Table of Contents
cs.CL [Back]
[1] Subjective Evaluation Profile Analysis of Science Fiction Short Stories and its Critical-Theoretical Significance
Kazuyoshi Otsuka
Main category: cs.CL
TL;DR: 这项研究使用LLMs作为“主观文学批评家”来分析文学评估中的审美偏好和评价模式,发现LLMs具有类似人类批评学派的个体评价特征。
Details
Motivation: 研究的动机是探索美学偏好和文学评价中的评估模式,并观察由RLHF塑造的隐含价值体系及其对文学判断的影响。 Method: 该研究采用了六种最先进的LLMs对十篇日本科幻短篇小说的英文翻译进行评估,并通过主成分分析和聚类技术观察评价一致性及模式。 Result: 研究结果显示了评估一致性存在显著变化,并发现了五种不同的评估模式,且各模型的评估词汇也有所不同。 Conclusion: 该研究得出LLMs在文学评估中表现出个体评价特征,类似于人类批评流派,而非中立的基准测试者。 Abstract: This study positions large language models (LLMs) as "subjective literary critics" to explore aesthetic preferences and evaluation patterns in literary assessment. Ten Japanese science fiction short stories were translated into English and evaluated by six state-of-the-art LLMs across seven independent sessions. Principal component analysis and clustering techniques revealed significant variations in evaluation consistency ({\alpha} ranging from 1.00 to 0.35) and five distinct evaluation patterns. Additionally, evaluation variance across stories differed by up to 4.5-fold, with TF-IDF analysis confirming distinctive evaluation vocabularies for each model. Our seven-session within-day protocol using an original Science Fiction corpus strategically minimizes external biases, allowing us to observe implicit value systems shaped by RLHF and their influence on literary judgment. These findings suggest that LLMs may possess individual evaluation characteristics similar to human critical schools, rather than functioning as neutral benchmarkers.[2] MapIQ: Benchmarking Multimodal Large Language Models for Map Question Answering
Varun Srivastava,Fan Lei,Srija Mukhopadhyay,Vivek Gupta,Ross Maciejewski
Main category: cs.CL
TL;DR: MapIQ is introduced as a new benchmark dataset for evaluating multimodal large language models (MLLMs) on various map types and visual analytical tasks, providing insights into their performance, robustness, and potential improvements.
Details
Motivation: To address the limited scope of Map-VQA research that has primarily focused on choropleth maps, MapIQ introduces a broader range of map types and thematic categories for more comprehensive evaluation of MLLMs. Method: Introduction of MapIQ, a benchmark dataset with 14,706 question-answer pairs across three map types, and evaluation of multiple MLLMs on six visual analytical tasks, including experiments on map design changes. Result: The study evaluates MLLMs' performance across different map types and tasks, compares them to a human baseline, and identifies the impact of map design changes on their robustness and sensitivity. Conclusion: MapIQ serves as a comprehensive benchmark dataset for evaluating MLLMs' performance on various types of maps and visual analytical tasks, offering insights into their robustness, sensitivity, and potential improvements in Map-VQA. Abstract: Recent advancements in multimodal large language models (MLLMs) have driven researchers to explore how well these models read data visualizations, e.g., bar charts, scatter plots. More recently, attention has shifted to visual question answering with maps (Map-VQA). However, Map-VQA research has primarily focused on choropleth maps, which cover only a limited range of thematic categories and visual analytical tasks. To address these gaps, we introduce MapIQ, a benchmark dataset comprising 14,706 question-answer pairs across three map types: choropleth maps, cartograms, and proportional symbol maps spanning topics from six distinct themes (e.g., housing, crime). We evaluate multiple MLLMs using six visual analytical tasks, comparing their performance against one another and a human baseline. An additional experiment examining the impact of map design changes (e.g., altered color schemes, modified legend designs, and removal of map elements) provides insights into the robustness and sensitivity of MLLMs, their reliance on internal geographic knowledge, and potential avenues for improving Map-VQA performance.[3] Cross-lingual Few-shot Learning for Persian Sentiment Analysis with Incremental Adaptation
Farideh Majidi,Ziaeddin Beheshtifard
Main category: cs.CL
TL;DR: 该研究开发了一种在有限数据下有效的波斯语情感分析模型,通过结合小样本学习和增量学习与多语言预训练模型,特别是 mDeBERTa 和 XLM-RoBERTa,取得了高达 96% 准确率的优异表现。
Details
Motivation: 开发一种能够在数据有限的情况下执行波斯语情感分析的模型,同时从高资源语言中获取先验知识。 Method: 使用三种预训练的多语言模型(XLM-RoBERTa、mDeBERTa 和 DistilBERT),并通过小样本学习和增量学习方法在来自多个来源的少量波斯语数据上进行微调。 Result: 实验结果显示,mDeBERTa 和 XLM-RoBERTa 在波斯语情感分析中达到了 96% 的准确率。 Conclusion: mDeBERTa 和 XLM-RoBERTa 在结合小样本学习和增量学习方法后,在波斯语情感分析中表现出色,准确率达到 96%,证明这些方法与多语言预训练模型结合的有效性。 Abstract: This research examines cross-lingual sentiment analysis using few-shot learning and incremental learning methods in Persian. The main objective is to develop a model capable of performing sentiment analysis in Persian using limited data, while getting prior knowledge from high-resource languages. To achieve this, three pre-trained multilingual models (XLM-RoBERTa, mDeBERTa, and DistilBERT) were employed, which were fine-tuned using few-shot and incremental learning approaches on small samples of Persian data from diverse sources, including X, Instagram, Digikala, Snappfood, and Taaghche. This variety enabled the models to learn from a broad range of contexts. Experimental results show that the mDeBERTa and XLM-RoBERTa achieved high performances, reaching 96% accuracy on Persian sentiment analysis. These findings highlight the effectiveness of combining few-shot learning and incremental learning with multilingual pre-trained models.[4] Partitioner Guided Modal Learning Framework
Guimin Hu,Yi Xin,Lijie Hu,Zhihong Zhu,Hasti Seifi
Main category: cs.CL
TL;DR: This paper proposes PgM, a framework for multimodal learning that partitions and learns uni-modal and paired-modal features separately, enhancing performance and adaptability across tasks.
Details
Motivation: Multimodal learning benefits from leveraging both uni-modal features (learned independently) and paired-modal features (from cross-modal interaction), requiring a framework to effectively partition and utilize these features. Method: PgM framework includes a modal partitioner, uni-modal learner, paired-modal learner, and uni-paired modal decoder to segment, learn, and reconstruct modal representations for multimodal learning. Result: PgM demonstrates effectiveness across four multimodal tasks, offers insights into feature contributions through visualization, and shows transferability to existing models. Conclusion: PgM provides thorough learning of uni-modal and paired-modal features with flexible distribution adjustment and different learning rates across modalities, demonstrating effectiveness and transferability across multimodal tasks. Abstract: Multimodal learning benefits from multiple modal information, and each learned modal representations can be divided into uni-modal that can be learned from uni-modal training and paired-modal features that can be learned from cross-modal interaction. Building on this perspective, we propose a partitioner-guided modal learning framework, PgM, which consists of the modal partitioner, uni-modal learner, paired-modal learner, and uni-paired modal decoder. Modal partitioner segments the learned modal representation into uni-modal and paired-modal features. Modal learner incorporates two dedicated components for uni-modal and paired-modal learning. Uni-paired modal decoder reconstructs modal representation based on uni-modal and paired-modal features. PgM offers three key benefits: 1) thorough learning of uni-modal and paired-modal features, 2) flexible distribution adjustment for uni-modal and paired-modal representations to suit diverse downstream tasks, and 3) different learning rates across modalities and partitions. Extensive experiments demonstrate the effectiveness of PgM across four multimodal tasks and further highlight its transferability to existing models. Additionally, we visualize the distribution of uni-modal and paired-modal features across modalities and tasks, offering insights into their respective contributions.[5] ExpliCIT-QA: Explainable Code-Based Image Table Question Answering
Maximiliano Hormazábal Lagos,Álvaro Bueno Sáez,Pedro Alonso Doval,Jorge Alcalde Vesteiro,Héctor Cerezo-Costas
Main category: cs.CL
TL;DR: ExpliCIT-QA是一个可解释的多模态表格问答系统,通过模块化方法提高透明度,并在TableVQA-Bench基准上表现优异。
Details
Motivation: 解决现有端到端表格视觉问答系统缺乏可解释性的问题,以满足需要审计结果的敏感领域的应用需求。 Method: 设计了一个模块化的多模态流程,包括多模态表格理解、基于语言的推理、自动代码生成、代码执行和自然语言解释。 Result: 在TableVQA-Bench基准上对ExpliCIT-QA进行了评估,结果显示其在可解释性和透明度方面优于现有基线方法。 Conclusion: ExpliCIT-QA有效提高了表格视觉问答系统的可解释性和透明度,为在金融、医疗等敏感领域的应用打开了大门。 Abstract: We present ExpliCIT-QA, a system that extends our previous MRT approach for tabular question answering into a multimodal pipeline capable of handling complex table images and providing explainable answers. ExpliCIT-QA follows a modular design, consisting of: (1) Multimodal Table Understanding, which uses a Chain-of-Thought approach to extract and transform content from table images; (2) Language-based Reasoning, where a step-by-step explanation in natural language is generated to solve the problem; (3) Automatic Code Generation, where Python/Pandas scripts are created based on the reasoning steps, with feedback for handling errors; (4) Code Execution to compute the final answer; and (5) Natural Language Explanation that describes how the answer was computed. The system is built for transparency and auditability: all intermediate outputs, parsed tables, reasoning steps, generated code, and final answers are available for inspection. This strategy works towards closing the explainability gap in end-to-end TableVQA systems. We evaluated ExpliCIT-QA on the TableVQA-Bench benchmark, comparing it with existing baselines. We demonstrated improvements in interpretability and transparency, which open the door for applications in sensitive domains like finance and healthcare where auditing results are critical.[6] CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks
Meng Li,Timothy M. McPhillips,Dingmin Wang,Shin-Rong Tsai,Bertram Ludäscher
Main category: cs.CL
TL;DR: CRABS improves notebook understanding by combining syntactic analysis with LLM-based resolution, achieving high accuracy in identifying information flows and execution dependencies.
Details
Motivation: Understanding Python notebooks is crucial for evaluation, reuse, and adaptation, but re-execution is often impractical due to dependency issues. LLMs struggle with some realistic notebooks due to hallucinations and long-context challenges. Method: The Capture and Resolve Assisted Bounding Strategy (CRABS) uses shallow syntactic parsing and AST analysis to establish bounds on inter-cell I/O sets, followed by LLM-based zero-shot learning to resolve remaining ambiguities. Result: CRABS achieves an average F1 score of 98% in identifying cell-to-cell information flows and 99% in identifying transitive execution dependencies on a dataset of 50 Kaggle notebooks. The LLM resolves 98% of ambiguities left by syntactic analysis. Conclusion: CRABS proves effective in notebook understanding by combining syntactic analysis and LLM-based resolution of ambiguities. Abstract: Recognizing the information flows and operations comprising data science and machine learning Python notebooks is critical for evaluating, reusing, and adapting notebooks for new tasks. Investigating a notebook via re-execution often is impractical due to the challenges of resolving data and software dependencies. While Large Language Models (LLMs) pre-trained on large codebases have demonstrated effectiveness in understanding code without running it, we observe that they fail to understand some realistic notebooks due to hallucinations and long-context challenges. To address these issues, we propose a notebook understanding task yielding an information flow graph and corresponding cell execution dependency graph for a notebook, and demonstrate the effectiveness of a pincer strategy that uses limited syntactic analysis to assist full comprehension of the notebook using an LLM. Our Capture and Resolve Assisted Bounding Strategy (CRABS) employs shallow syntactic parsing and analysis of the abstract syntax tree (AST) to capture the correct interpretation of a notebook between lower and upper estimates of the inter-cell I/O sets, then uses an LLM to resolve remaining ambiguities via cell-by-cell zero-shot learning, thereby identifying the true data inputs and outputs of each cell. We evaluate and demonstrate the effectiveness of our approach using an annotated dataset of 50 representative, highly up-voted Kaggle notebooks that together represent 3454 actual cell inputs and outputs. The LLM correctly resolves 1397 of 1425 (98%) ambiguities left by analyzing the syntactic structure of these notebooks. Across 50 notebooks, CRABS achieves average F1 scores of 98% identifying cell-to-cell information flows and 99% identifying transitive cell execution dependencies.[7] AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles
Matteo Fasulo,Luca Babboni,Luca Tedeschini
Main category: cs.CL
TL;DR: AI Wizards团队参加了CLEF 2025 CheckThat! Lab的任务1,旨在通过结合情感得分和句子表示来提高基于Transformer的分类器在新闻文章主观性检测中的性能。
Details
Motivation: 在多语言和零样本环境下检测新闻文章的主观性是一个挑战,特别是在存在类别不平衡的情况下。 Method: 使用mDeBERTaV3-base、ModernBERT-base和Llama3.2-1B模型,结合情感得分和句子表示,并通过决策阈值校正来解决类别不平衡问题。 Result: 情感特征的集成显著提高了性能,尤其是在主观F1得分上,并在希腊语中取得了第一名的成绩(Macro F1 = 0.51)。 Conclusion: 结合情感得分和句子表示的框架在主观性检测任务中表现出色,具有良好的跨语言泛化能力。 Abstract: This paper presents AI Wizards' participation in the CLEF 2025 CheckThat! Lab Task 1: Subjectivity Detection in News Articles, classifying sentences as subjective/objective in monolingual, multilingual, and zero-shot settings. Training/development datasets were provided for Arabic, German, English, Italian, and Bulgarian; final evaluation included additional unseen languages (e.g., Greek, Romanian, Polish, Ukrainian) to assess generalization. Our primary strategy enhanced transformer-based classifiers by integrating sentiment scores, derived from an auxiliary model, with sentence representations, aiming to improve upon standard fine-tuning. We explored this sentiment-augmented architecture with mDeBERTaV3-base, ModernBERT-base (English), and Llama3.2-1B. To address class imbalance, prevalent across languages, we employed decision threshold calibration optimized on the development set. Our experiments show sentiment feature integration significantly boosts performance, especially subjective F1 score. This framework led to high rankings, notably 1st for Greek (Macro F1 = 0.51).[8] Tracing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models
Dante Campregher,Yanxu Chen,Sander Hoffman,Maria Heuss
Main category: cs.CL
TL;DR: This paper studies how attention heads in LLMs manage factual and counterfactual information, finding that they suppress copying generally rather than selectively, and that their behavior varies by domain and model size.
Details
Motivation: To reconcile conflicting findings from recent studies on how LLMs handle factual and counterfactual information and to better understand the role of attention heads in this process. Method: The study conducts a reproducibility analysis focusing on attention heads, using Mechanistic Interpretability tools to evaluate hypotheses about suppression mechanisms and investigate domain specificity. Result: Attention heads promoting factual output suppress copying broadly, not selectively inhibiting counterfactuals, and attention patterns are domain-specific, with larger models showing more nuanced, category-sensitive behavior. Conclusion: Attention heads in LLMs promote factual output through general copy suppression rather than selective counterfactual suppression, and their behavior is domain-dependent, with larger models showing more specialized patterns. Abstract: This paper presents a reproducibility study examining how Large Language Models (LLMs) manage competing factual and counterfactual information, focusing on the role of attention heads in this process. We attempt to reproduce and reconcile findings from three recent studies by Ortu et al., Yu, Merullo, and Pavlick and McDougall et al. that investigate the competition between model-learned facts and contradictory context information through Mechanistic Interpretability tools. Our study specifically examines the relationship between attention head strength and factual output ratios, evaluates competing hypotheses about attention heads' suppression mechanisms, and investigates the domain specificity of these attention patterns. Our findings suggest that attention heads promoting factual output do so via general copy suppression rather than selective counterfactual suppression, as strengthening them can also inhibit correct facts. Additionally, we show that attention head behavior is domain-dependent, with larger models exhibiting more specialized and category-sensitive patterns.[9] ILID: Native Script Language Identification for Indian Languages
Yash Ingle,Pruthwik Mishra
Main category: cs.CL
TL;DR: This paper introduces a large dataset for language identification, including English and 22 Indian languages, and develops baseline models comparable to state-of-the-art approaches.
Details
Motivation: Language identification is a crucial task in NLP, particularly challenging in noisy, short, and code-mixed environments, especially for Indian languages with lexical and phonetic similarities and shared scripts. Method: Developed robust baseline models using state-of-the-art machine learning and deep learning approaches. Result: A dataset of 230K sentences consisting of English and all 22 official Indian languages was released, along with robust baseline models. Conclusion: Baseline models developed in this paper are comparable to state-of-the-art models for language identification. Abstract: The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script making the task even more challenging. In this paper, we release a dataset of 230K sentences consisting of English and all 22 official Indian languages labeled with their language identifiers where data in most languages are newly created. We also develop and release robust baseline models using state-of-the-art approaches in machine learning and deep learning that can aid the research in this field. Our baseline models are comparable to the state-of-the-art models for the language identification task.[10] Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential
Mohammad Samragh,Arnav Kundu,David Harrison,Kumari Nishu,Devang Naik,Minsik Cho,Mehrdad Farajtabar
Main category: cs.CL
TL;DR: The paper proposes a framework to speed up autoregressive language models by enabling simultaneous prediction of multiple tokens.
Details
Motivation: The motivation is to overcome the sequential generation limitation of autoregressive language models, which hinders inference speed and parallelism. Method: The method involves a masked-input formulation, a gated LoRA formulation, a learnable sampler module, auxiliary training losses, and a speculative generation strategy. Result: The framework achieves significant speedups in generating code, math, and improves general chat and knowledge tasks. Conclusion: The paper concludes that the proposed framework significantly speeds up the generation process of autoregressive language models without compromising quality. Abstract: Autoregressive language models are constrained by their inherently sequential nature, generating one token at a time. This paradigm limits inference speed and parallelism, especially during later stages of generation when the direction and semantics of text are relatively certain. In this work, we propose a novel framework that leverages the inherent knowledge of vanilla autoregressive language models about future tokens, combining techniques to realize this potential and enable simultaneous prediction of multiple subsequent tokens. Our approach introduces several key innovations: (1) a masked-input formulation where multiple future tokens are jointly predicted from a common prefix; (2) a gated LoRA formulation that preserves the original LLM's functionality, while equipping it for multi-token prediction; (3) a lightweight, learnable sampler module that generates coherent sequences from the predicted future tokens; (4) a set of auxiliary training losses, including a consistency loss, to enhance the coherence and accuracy of jointly generated tokens; and (5) a speculative generation strategy that expands tokens quadratically in the future while maintaining high fidelity. Our method achieves significant speedups through supervised fine-tuning on pretrained models. For example, it generates code and math nearly 5x faster, and improves general chat and knowledge tasks by almost 2.5x. These gains come without any loss in quality.[11] Cross-Domain Transfer and Few-Shot Learning for Personal Identifiable Information Recognition
Junhong Ye,Xu Yuan,Xinying Qiu
Main category: cs.CL
TL;DR: 本研究探讨了跨领域模型迁移、多领域数据融合和样本高效学习在识别个人身份信息(PII)中的有效性,发现法律领域数据可迁移到传记文本,而医疗领域难以迁移,且在低专业化领域仅需少量数据即可实现高质量识别。
Details
Motivation: 准确识别个人身份信息(PII)对于自动文本匿名化至关重要。 Method: 使用来自医疗(I2B2)、法律(TAB)和传记(维基百科)的标注语料库,从四个维度评估模型:领域内性能、跨领域可迁移性、融合和少样本学习。 Result: 结果表明,法律领域数据可以很好地迁移到传记文本,而医疗领域则难以接受迁移;数据融合的效果具有领域特定性;在低专业化领域,仅使用10%的训练数据即可实现高质量识别。 Conclusion: 跨领域模型迁移、多领域数据融合和样本高效学习在识别个人身份信息(PII)中具有潜力,尤其在低专业化领域仅需少量数据即可实现高质量的识别。 Abstract: Accurate recognition of personally identifiable information (PII) is central to automated text anonymization. This paper investigates the effectiveness of cross-domain model transfer, multi-domain data fusion, and sample-efficient learning for PII recognition. Using annotated corpora from healthcare (I2B2), legal (TAB), and biography (Wikipedia), we evaluate models across four dimensions: in-domain performance, cross-domain transferability, fusion, and few-shot learning. Results show legal-domain data transfers well to biographical texts, while medical domains resist incoming transfer. Fusion benefits are domain-specific, and high-quality recognition is achievable with only 10% of training data in low-specialization domains.[12] COLA-GEC: A Bidirectional Framework for Enhancing Grammatical Acceptability and Error Correction
Xiangyu Yang,Xinying Qiu
Main category: cs.CL
TL;DR: This paper introduces COLA-GEC, a bidirectional framework that enhances grammatical error correction and acceptability judgment through mutual knowledge transfer, achieving state-of-the-art results while identifying areas for future improvement.
Details
Motivation: The motivation is to bridge the gap between grammatical error correction (GEC) and grammatical acceptability judgment (COLA), two core NLP tasks that typically evolve independently despite sharing foundational grammatical knowledge. Method: The paper proposes COLA-GEC, a bidirectional framework that improves GEC and COLA tasks through mutual knowledge transfer. It augments grammatical acceptability models with GEC datasets and integrates grammatical acceptability signals into GEC model training using a dynamic loss function. Result: The approach achieves state-of-the-art results on several multilingual benchmarks, with performance improvements in both GEC and COLA tasks. Error analysis highlights challenges in punctuation error correction. Conclusion: The paper concludes that the proposed COLA-GEC framework successfully enhances both grammatical error correction and grammatical acceptability judgment tasks through mutual knowledge transfer, achieving state-of-the-art results on multilingual benchmarks while identifying punctuation error correction as a key area for future improvement. Abstract: Grammatical Error Correction (GEC) and grammatical acceptability judgment (COLA) are core tasks in natural language processing, sharing foundational grammatical knowledge yet typically evolving independently. This paper introduces COLA-GEC, a novel bidirectional framework that enhances both tasks through mutual knowledge transfer. First, we augment grammatical acceptability models using GEC datasets, significantly improving their performance across multiple languages. Second, we integrate grammatical acceptability signals into GEC model training via a dynamic loss function, effectively guiding corrections toward grammatically acceptable outputs. Our approach achieves state-of-the-art results on several multilingual benchmarks. Comprehensive error analysis highlights remaining challenges, particularly in punctuation error correction, providing insights for future improvements in grammatical modeling.[13] DualReward: A Dynamic Reinforcement Learning Framework for Cloze Tests Distractor Generation
Tianyou Huang,Xinglu Chen,Jingshen Zhang,Xinying Qiu,Ruiying Niu
Main category: cs.CL
TL;DR: This paper proposes DualReward, a reinforcement learning framework for generating distractors in cloze tests, showing improved performance through a dual reward mechanism with adaptive scaling.
Details
Motivation: The motivation is to improve automatic distractor generation in cloze tests by overcoming the limitations of conventional supervised learning and static generative models. Method: The method uses a dual reward structure with adaptive scaling to differentiate between human-created and model-generated distractors, adjusting reward intensity dynamically based on performance and confidence. Result: The approach showed consistent improvements over baselines, with modest gains on homogeneous datasets (CLOTH-F) and significant improvements (3.48-3.86% in P@1) on cross-domain data (MCQ). Conclusion: DualReward provides a flexible framework for balancing learning from human examples and exploring new distractors, making it effective for varied question types and domains. Abstract: This paper introduces DualReward, a novel reinforcement learning framework for automatic distractor generation in cloze tests. Unlike conventional approaches that rely primarily on supervised learning or static generative models, our method employs a dual reward structure with adaptive scaling that differentiates between human-created gold standard distractors and model-generated candidates. The framework dynamically adjusts reward signal intensity based on model performance and confidence. We evaluate our approach on both passage-level (CLOTH-F) and sentence-level (MCQ) cloze test datasets, demonstrating consistent improvements over state-of-the-art baselines. Experimental results show that our adaptive reward scaling mechanism provides modest but consistent benefits on homogeneous datasets (CLOTH-F) and more substantial improvements (3.48-3.86% in P@1) on diverse, cross-domain data (MCQ), suggesting its particular effectiveness for handling varied question types and domains. Our work offers a flexible framework that effectively balances learning from reliable human examples while exploring novel, high-quality distractors for automated test generation.[14] LLMs Encode Harmfulness and Refusal Separately
Jiachen Zhao,Jing Huang,Zhengxuan Wu,David Bau,Weiyan Shi
Main category: cs.CL
TL;DR: This paper identifies an internal harmfulness direction in LLMs separate from refusal, showing that models understand harmfulness beyond just refusing harmful instructions. It introduces Latent Guard, a safeguard using harmfulness representation that effectively detects unsafe inputs and resists finetuning attacks, suggesting a new approach to AI safety.
Details
Motivation: The motivation behind this research is to understand whether LLMs truly comprehend harmfulness beyond just refusing harmful instructions. It explores the internal mechanisms of LLMs related to harmfulness and refusal, aiming to improve AI safety by developing more robust safeguards against harmful inputs. Method: The research identifies a harmfulness direction in LLMs, distinct from the refusal direction, and investigates its effects through steering techniques. The study analyzes how jailbreak methods affect refusal signals and internal harmfulness beliefs, and explores the impact of adversarial finetuning on harmfulness representations. A practical application, Latent Guard, is developed using the harmfulness concept to detect unsafe inputs. Result: The research found that LLMs encode harmfulness internally as a separate concept from refusal, with a distinct harmfulness direction. Steering along this direction changes how models interpret instructions, while the refusal direction mainly triggers refusals without altering harmfulness beliefs. Jailbreak methods were found to reduce refusal signals without changing the model's harmfulness belief, and adversarial finetuning had minimal impact on harmfulness understanding. The proposed Latent Guard demonstrated performance comparable to or better than existing safeguard models. Conclusion: The study concludes that LLMs have an internal understanding of harmfulness that is more robust than their refusal decisions, offering a new perspective on AI safety. The harmfulness concept can be used as an intrinsic safeguard (Latent Guard) to detect unsafe inputs and reduce over-refusals, showing robustness against finetuning attacks. Abstract: LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model's internal belief of harmfulness. These insights lead to a practical safety application: The model's latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs' internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety[15] Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models
Bo Zeng,Chenyang Lyu,Sinuo Liu,Mingyan Zeng,Minghao Wu,Xuanfan Ni,Tianqi Shi,Yu Zhao,Yefeng Liu,Chenyu Zhu,Ruizhe Li,Jiahui Geng,Qing Li,Yu Tong,Longyue Wang,Weihua Luo,Kaifu Zhang
Main category: cs.CL
TL;DR: 本文提出了本地化多语言基准Marco-Bench-MIF,用以评估大语言模型的多语言指令跟随能力,并分析了资源、模型规模和本地化对性能的影响。
Details
Motivation: 现有数据集如IFEval主要以英语为主或简单机器翻译,限制了多语言场景的应用,因此需要一个本地化的多语言基准。 Method: 构建了一个结合翻译与验证的混合流程,创建了包含30种语言的多语言基准Marco-Bench-MIF,并对20多个LLM进行了全面评估。 Result: 评估发现高/低资源语言之间有25-35%的准确率差距,模型规模提升带来45-60%的性能提升但仍存在特定脚本挑战,机器翻译数据比本地化数据准确率低7-22%。 Conclusion: Marco-Bench-MIF评估显示,语言资源丰富程度、模型规模和数据本地化对多语言指令跟随能力有显著影响,并揭示了跨语言关键词一致性和组合约束的挑战。 Abstract: Instruction-following capability has become a major ability to be evaluated for Large Language Models (LLMs). However, existing datasets, such as IFEval, are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource languages, (2) model scales largely impact performance by 45-60% yet persists script-specific challenges, and (3) machine-translated data underestimates accuracy by7-22% versus localized data. Our analysis identifies challenges in multilingual instruction following, including keyword consistency preservation and compositional constraint adherence across languages. Our Marco-Bench-MIF is available at https://github.com/AIDC-AI/Marco-Bench-MIF.[16] A Survey of Deep Learning for Geometry Problem Solving
Jianzhe Ma,Wenxuan Wang,Qin Jin
Main category: cs.CL
TL;DR: 本文综述了深度学习在几何问题解决中的应用,涵盖了任务总结、方法回顾、评估分析和未来方向讨论,并提供了一个持续更新的论文列表。
Details
Motivation: 几何问题解决是数学推理的关键领域,涉及教育、人工智能数学能力评估和多模态能力评估等重要领域。近年来,深度学习技术的快速发展,特别是多模态大语言模型的兴起,引发了广泛的研究热潮,因此需要对这一领域的应用进行系统性的总结和分析。 Method: 本文通过全面总结几何问题解决中的相关任务、深入回顾相关的深度学习方法、详细分析评估指标和方法、批判性讨论当前挑战和未来方向,对深度学习在几何问题解决中的应用进行了调查。 Result: 本文提供了深度学习在几何问题解决中应用的全面调查,包括相关任务、方法和评估指标的总结,以及当前挑战和未来研究方向的讨论。此外,作者在GitHub上创建了一个持续更新的论文列表。 Conclusion: 本文总结了深度学习在几何问题解决中的应用,提供了全面实用的参考资料,并在GitHub上创建了一个持续更新的论文列表,以促进该领域的发展。 Abstract: Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.[17] POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
Yichen Xu,Liangyu Chen,Liang Zhang,Wenxuan Wang,Qin Jin
Main category: cs.CL
TL;DR: This paper introduces PolyChartQA, the first large-scale multilingual chart question answering benchmark. It highlights performance gaps in vision-language models across different languages and lays the foundation for more inclusive models.
Details
Motivation: Existing chart understanding benchmarks are predominantly English-centric, limiting their accessibility and applicability to global audiences. There is a need for a multilingual benchmark to advance globally inclusive vision-language models. Method: The paper introduces a decoupled pipeline that separates chart data from rendering code, allowing multilingual charts to be generated by translating the data and reusing the code. State-of-the-art LLM-based translation and quality control measures are employed. Result: PolyChartQA includes 22,606 multilingual charts and 26,151 question-answering pairs across 10 diverse languages. Experiments reveal a significant performance gap between English and other languages, especially low-resource ones with non-Latin scripts. Conclusion: PolyChartQA enables systematic evaluation of multilingual chart understanding and highlights performance gaps in vision-language models across different languages. Abstract: Charts are a universally adopted medium for interpreting and communicating data. However, existing chart understanding benchmarks are predominantly English-centric, limiting their accessibility and applicability to global audiences. In this paper, we present PolyChartQA, the first large-scale multilingual chart question answering benchmark covering 22,606 charts and 26,151 question-answering pairs across 10 diverse languages. PolyChartQA is built using a decoupled pipeline that separates chart data from rendering code, allowing multilingual charts to be flexibly generated by simply translating the data and reusing the code. We leverage state-of-the-art LLM-based translation and enforce rigorous quality control in the pipeline to ensure the linguistic and semantic consistency of the generated multilingual charts. PolyChartQA facilitates systematic evaluation of multilingual chart understanding. Experiments on both open- and closed-source large vision-language models reveal a significant performance gap between English and other languages, especially low-resource ones with non-Latin scripts. This benchmark lays a foundation for advancing globally inclusive vision-language models.[18] BlockBPE: Parallel BPE Tokenization
Amos You
Main category: cs.CL
TL;DR: BlockBPE is a GPU-based tokenization method that significantly improves throughput for batch inference by parallelizing token merges, though with a slight reduction in generation quality.
Details
Motivation: Existing tokenization methods are CPU-bound, suboptimal for GPU batch inference, and dominated by Regex pre-tokenization, leading to higher runtime complexity. Method: BlockBPE uses a parallel GPU implementation of byte-pair encoding (BPE) that eliminates Regex pre-tokenization, enabling highly parallelized token merges and reducing runtime complexity. Result: BlockBPE achieves up to 2x higher throughput than tiktoken and 2.5x over HuggingFace Tokenizers on high-batch inference workloads, with near linear-time complexity. Conclusion: BlockBPE provides a more efficient and high-throughput alternative to existing tokenization methods by leveraging parallel GPU processing, despite a minor compromise in generation quality. Abstract: Tokenization is a critical preprocessing step in large language model pipelines, yet widely-used implementations remain CPU-bound and suboptimal for batch inference workflows on GPU. We present BlockBPE, a parallel GPU implementation of byte-pair encoding (BPE) that achieves near linear-time complexity under realistic assumptions and is optimized for high-throughput, batch inference. Unlike existing Rust-based tokenizers such as HuggingFace Tokenizers or OpenAI's tiktoken-whose runtimes are dominated by Regex pre-tokenization and exhibit $O(n \log n)$ runtime-BlockBPE eliminates the Regex pre-tokenization which leads to small loss in generation quality, but enables highly parallelized token merges within thread blocks, reducing overall complexity to $O(nd)$ where $d \ll n$. On high-batch inference workloads, BlockBPE achieves up to 2x higher throughput than tiktoken and 2.5x over HuggingFace Tokenizers.[19] DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression
Yi Zhao,Zuchao Li,Hai Zhao,Baoyuan Qi,Guoming Liu
Main category: cs.CL
TL;DR: This paper proposes DAC, a dynamic attention-aware method for task-agnostic prompt compression that integrates entropy and attention information, showing robust improvements across multiple domains and LLMs.
Details
Motivation: Existing prompt compression methods overlook attention-critical tokens and entropy shifts during compression, necessitating a more refined approach. Method: A dynamic attention-aware approach for task-agnostic prompt compression (DAC) that integrates entropy and attention information while dynamically sensing entropy shifts during compression. Result: DAC demonstrated consistent and substantial improvements in prompt compression across multiple domains like LongBench, GSM8K, and BBH, proving its efficacy. Conclusion: The proposed DAC method excels in prompt compression by integrating entropy and attention information, showing robust improvements across tasks and LLMs. Abstract: Task-agnostic prompt compression leverages the redundancy in natural language to reduce computational overhead and enhance information density within prompts, especially in long-context scenarios. Existing methods predominantly rely on information entropy as the metric to compress lexical units, aiming to achieve minimal information loss. However, these approaches overlook two critical aspects: (i) the importance of attention-critical tokens at the algorithmic level, and (ii) shifts in information entropy during the compression process. Motivated by these challenges, we propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC). This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression. Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements across a diverse range of tasks and LLMs, offering compelling evidence of its efficacy.[20] IAM: Efficient Inference through Attention Mapping between Different-scale LLMs
Yi Zhao,Zuchao Li,Hai Zhao
Main category: cs.CL
TL;DR: 本文提出了一种名為IAM的新框架,通過在小型和大型LLM之間進行注意力映射,實現了加速注意力計算和減少KV緩存使用的雙重好處,為提高LLM效率提供了新的方法。
Details
Motivation: 現有的提高LLM推理效率的方法主要利用模型內部的稀疏性,而未考慮利用外部信息進行優化。本文通過發現不同規模LLM之間注意力矩陣的高相似性,提供了一種新的優化視角。 Method: 首先對如何衡量相似性、如何選擇映射層以及映射是否具有一致性進行了全面分析,然後基於這些見解引入了IAM框架。 Result: 實驗結果顯示,IAM框架可以加速prefill階段15%,減少KV緩存使用22.1%,同時不會明顯影響性能,且該方法在不同系列模型上均顯示出可泛化性。 Conclusion: IAM框架提供了一種提高LLM效率的新途徑,且與許多現有的KV緩存優化方法互補,是提高LLM效率工具包中的靈活補充。 Abstract: LLMs encounter significant challenges in resource consumption nowadays, especially with long contexts. Despite extensive efforts dedicate to enhancing inference efficiency, these methods primarily exploit internal sparsity within the models, without leveraging external information for optimization. We identify the high similarity of attention matrices across different-scale LLMs, which offers a novel perspective for optimization. We first conduct a comprehensive analysis of how to measure similarity, how to select mapping Layers and whether mapping is consistency. Based on these insights, we introduce the IAM framework, which achieves dual benefits of accelerated attention computation and reduced KV cache usage by performing attention mapping between small and large LLMs. Our experimental results demonstrate that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance. Experiments on different series of models show the generalizability of IAM. Importantly, it is also orthogonal to many existing KV cache optimization methods, making it a versatile addition to the current toolkit for enhancing LLM efficiency.[21] The benefits of query-based KGQA systems for complex and temporal questions in LLM era
Artem Alekseev,Mikhail Chaichuk,Miron Butko,Alexander Panchenko,Elena Tutubalina,Oleg Somov
Main category: cs.CL
TL;DR: 本文提出了一种用于WikiData QA的多阶段查询生成方法,提高了多跳和时间问题的回答准确性。
Details
Motivation: 大型语言模型在多跳推理和时间问题上仍存在困难,而基于查询的知识图谱QA提供了一个模块化替代方案。 Method: 提出了一种多阶段查询生成方法,并引入了一种使用CoT推理的新颖实体链接和谓词匹配方法。 Result: 所提出的多阶段方法在多跳和时间基准测试中表现出增强的性能,并通过泛化和拒绝研究评估了鲁棒性。 Conclusion: Query-based multi-stage KGQA框架在改进多跳和时间QA方面具有潜力,特别是在小型语言模型中。 Abstract: Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions. Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers. We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks. Through generalization and rejection studies, we evaluate robustness across multi-hop and temporal QA datasets. Additionally, we introduce a novel entity linking and predicate matching method using CoT reasoning. Our results demonstrate the potential of query-based multi-stage KGQA framework for improving multi-hop and temporal QA with small language models. Code and data: https://github.com/ar2max/NLDB-KGQA-System[22] PoTPTQ: A Two-step Power-of-Two Post-training for LLMs
Xinyu Wang,Vahid Partovi Nia,Peng Lu,Jerry Huang,Xiao-Wen Chang,Boxing Chen,Yufei Cui
Main category: cs.CL
TL;DR: This paper proposes a novel Power-of-two (PoT) quantization framework for Large Language Models (LLMs) that improves accuracy in low-precision formats and accelerates inference speed on GPUs.
Details
Motivation: The motivation is to address the challenge of deploying Large Language Models (LLMs) due to substantial computational resource requirements, particularly focusing on improving the effectiveness of Power-of-two (PoT) quantization on GPUs. Method: The paper introduces a two-step post-training algorithm to maintain the accuracy of the quantized model: (i) initializing quantization scales with a robust starting point, and (ii) refining these scales using a minimal calibration set. Result: The performance of the PoT post-training algorithm surpasses current state-of-the-art in integer quantization, especially at low precisions like 2- and 3-bit formats. The PoT quantization leads to a $3.67 imes$ speedup on a NVIDIA V100 and $1.63 imes$ on a NVIDIA RTX 4090 compared to uniform integer dequantization. Conclusion: The proposed PoT quantization framework outperforms state-of-the-art accuracy in extremely low-precision number formats and enables faster inference through more efficient dequantization. Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their deployment is challenging due to the substantial computational resources required. Power-of-two (PoT) quantization is a general tool to counteract this difficulty. Albeit previous works on PoT quantization can be efficiently dequantized on CPUs using fixed-point addition, it showed less effectiveness on GPUs. The reason is entanglement of the sign bit and sequential bit manipulations needed for dequantization. We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization. To maintain the accuracy of the quantized model, we introduce a two-step post-training algorithm: (i) initialize the quantization scales with a robust starting point, and (ii) refine these scales using a minimal calibration set. The performance of our PoT post-training algorithm surpasses the current state-of-the-art in integer quantization, particularly at low precisions such as 2- and 3-bit formats. Our PoT quantization accelerates the dequantization step required for the floating point inference and leads to $3.67\times$ speed up on a NVIDIA V100, and $1.63\times$ on a NVIDIA RTX 4090, compared to uniform integer dequantization.[23] Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation
Ziyu Ge,Gabriel Chua,Leanne Tan,Roy Ka-Wei Lee
Main category: cs.CL
TL;DR: 该论文介绍了一种针对低资源语言的毒性保留翻译框架,并以新加坡混合语作为案例,展示了其在内容审查和区域平台治理中的应用价值。
Details
Motivation: 在线交流中越来越多地使用代表性不足的语言和口语方言,而标准翻译系统往往无法保留本地俚语、语码混合和有害言论的文化嵌入标记。 Method: 首先进行人工验证的少样本提示工程,然后优化模型-提示对,通过直接和回译的语义相似性基准测试多个大语言模型。 Result: 该框架在翻译质量上有所提升,并支持低资源背景下的文化敏感性审查和基准测试。 Conclusion: 该论文提出了一种可重复的两阶段毒性保留翻译框架,并通过定量人类评估验证了其有效性,为低资源语言的多元文化LLM安全提供了贡献。 Abstract: As online communication increasingly incorporates under-represented languages and colloquial dialects, standard translation systems often fail to preserve local slang, code-mixing, and culturally embedded markers of harmful speech. Translating toxic content between low-resource language pairs poses additional challenges due to scarce parallel data and safety filters that sanitize offensive expressions. In this work, we propose a reproducible, two-stage framework for toxicity-preserving translation, demonstrated on a code-mixed Singlish safety corpus. First, we perform human-verified few-shot prompt engineering: we iteratively curate and rank annotator-selected Singlish-target examples to capture nuanced slang, tone, and toxicity. Second, we optimize model-prompt pairs by benchmarking several large language models using semantic similarity via direct and back-translation. Quantitative human evaluation confirms the effectiveness and efficiency of our pipeline. Beyond improving translation quality, our framework contributes to the safety of multicultural LLMs by supporting culturally sensitive moderation and benchmarking in low-resource contexts. By positioning Singlish as a testbed for inclusive NLP, we underscore the importance of preserving sociolinguistic nuance in real-world applications such as content moderation and regional platform governance.[24] Graph Representations for Reading Comprehension Analysis using Large Language Model and Eye-Tracking Biomarker
Yuhong Zhang,Jialu Li,Shilai Yang,Yuchen Xu,Gert Cauwenberghs,Tzyy-Ping Jung
Main category: cs.CL
TL;DR: This study uses a graph-based text representation to explore how LLMs and humans understand language, finding that LLMs show high consistency in comprehension at the graph structural level, offering insights into human-AI learning strategies.
Details
Motivation: The motivation for this study stems from the need to better understand how humans and LLMs comprehend language across different contexts, particularly in functional tasks like inference and information retrieval. Previous work showed distinct patterns in biomarkers for words with varying relevance to inference targets but was limited in depth by focusing solely on individual words. Method: The study used an LLM-based AI agent to group words from a reading passage into nodes and edges, forming a graph-based text representation based on semantic meaning and question-oriented prompts. The distribution of eye fixations on important nodes and edges was then compared. Result: The findings indicate that LLMs exhibit high consistency in language understanding at the level of graph topological structure, enhancing previous findings and providing a more nuanced understanding of language comprehension. Conclusion: The study concludes that LLMs demonstrate high consistency in understanding language at the level of graph topological structure, building on previous findings and offering insights into human-AI co-learning strategies. Abstract: Reading comprehension is a fundamental skill in human cognitive development. With the advancement of Large Language Models (LLMs), there is a growing need to compare how humans and LLMs understand language across different contexts and apply this understanding to functional tasks such as inference, emotion interpretation, and information retrieval. Our previous work used LLMs and human biomarkers to study the reading comprehension process. The results showed that the biomarkers corresponding to words with high and low relevance to the inference target, as labeled by the LLMs, exhibited distinct patterns, particularly when validated using eye-tracking data. However, focusing solely on individual words limited the depth of understanding, which made the conclusions somewhat simplistic despite their potential significance. This study used an LLM-based AI agent to group words from a reading passage into nodes and edges, forming a graph-based text representation based on semantic meaning and question-oriented prompts. We then compare the distribution of eye fixations on important nodes and edges. Our findings indicate that LLMs exhibit high consistency in language understanding at the level of graph topological structure. These results build on our previous findings and offer insights into effective human-AI co-learning strategies.[25] Value-Based Large Language Model Agent Simulation for Mutual Evaluation of Trust and Interpersonal Closeness
Yuki Sakamoto,Takahisa Uchida,Hiroshi Ishiguro
Main category: cs.CL
TL;DR: 本研究通过使用LLM代理模拟实验,发现价值观相似性可以增强代理间的信任和人际关系亲密程度,为社会科学理论提供了有效试验平台。
Details
Motivation: 在人类社会中,价值观相似性在建立信任和亲密关系方面起着重要作用,但在由LLM代理组成的人工社会中,这一原则是否适用尚未被探索。因此,本研究旨在调查价值观相似性对LLM代理之间关系建立的影响。 Method: 研究通过两个实验进行,第一个实验评估了LLM中价值观的可控性,以确定最有效的模型和提示设计;第二个实验生成了具有特定价值观的LLM代理对,并分析了它们在对话后对信任和人际关系亲密程度的相互评价。实验以英语和日语进行,以研究语言依赖性。 Result: 实验结果确认,具有更高价值观相似性的代理对表现出更大的相互信任和人际关系亲密程度。 Conclusion: 该研究得出结论,LLM代理模拟可以作为社会科学理论的有效试验台,有助于阐明价值观如何影响关系建立,并为社会科学研究提供新的理论和见解的基础。 Abstract: Large language models (LLMs) have emerged as powerful tools for simulating complex social phenomena using human-like agents with specific traits. In human societies, value similarity is important for building trust and close relationships; however, it remains unexplored whether this principle holds true in artificial societies comprising LLM agents. Therefore, this study investigates the influence of value similarity on relationship-building among LLM agents through two experiments. First, in a preliminary experiment, we evaluated the controllability of values in LLMs to identify the most effective model and prompt design for controlling the values. Subsequently, in the main experiment, we generated pairs of LLM agents imbued with specific values and analyzed their mutual evaluations of trust and interpersonal closeness following a dialogue. The experiments were conducted in English and Japanese to investigate language dependence. The results confirmed that pairs of agents with higher value similarity exhibited greater mutual trust and interpersonal closeness. Our findings demonstrate that the LLM agent simulation serves as a valid testbed for social science theories, contributes to elucidating the mechanisms by which values influence relationship building, and provides a foundation for inspiring new theories and insights into the social sciences.[26] Simplifications are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions
Lukas Ellinger,Miriam Anschütz,Georg Groh
Main category: cs.CL
TL;DR: 本文研究了大型语言模型在不同目标群体中简化定义对多义词定义质量的影响,发现简化会导致信息丢失和误解,而通过直接偏好优化微调模型可以显著提高回答质量。
Details
Motivation: 大型语言模型(LLMs)可以为任何上下文提供准确的词汇定义和解释,但针对不同目标群体(如儿童或语言学习者)的定义范围会有所不同,特别是对于多义词,过度简化可能会导致关键意义的遗漏,从而误导依赖LLM输出的用户。 Method: 通过两个新的多语言评估数据集,使用LLM-as-Judge和人工注释方法测试了DeepSeek v3、Llama 4 Maverick、Qwen3-30B A3B、GPT-4o mini和Llama 3.1 8B。 Result: 研究结果显示,简化会通过忽略多义性而显著降低定义的完整性,增加误解的风险。使用直接偏好优化对Llama 3.1 8B进行微调,显著提高了所有提示类型中多义词回答的质量。 Conclusion: 该论文得出结论,在教育自然语言处理中需要平衡简洁性和完整性,以确保为所有学习者提供可靠且与上下文相关的定义。 Abstract: Large Language Models (LLMs) can provide accurate word definitions and explanations for any context. However, the scope of the definition changes for different target groups, like children or language learners. This is especially relevant for homonyms, words with multiple meanings, where oversimplification might risk information loss by omitting key senses, potentially misleading users who trust LLM outputs. We investigate how simplification impacts homonym definition quality across three target groups: Normal, Simple, and ELI5. Using two novel evaluation datasets spanning multiple languages, we test DeepSeek v3, Llama 4 Maverick, Qwen3-30B A3B, GPT-4o mini, and Llama 3.1 8B via LLM-as-Judge and human annotations. Our results show that simplification drastically degrades definition completeness by neglecting polysemy, increasing the risk of misunderstanding. Fine-tuning Llama 3.1 8B with Direct Preference Optimization substantially improves homonym response quality across all prompt types. These findings highlight the need to balance simplicity and completeness in educational NLP to ensure reliable, context-aware definitions for all learners.[27] Improving Data and Parameter Efficiency of Neural Language Models Using Representation Analysis
Josip Jukić
Main category: cs.CL
TL;DR: This thesis improves the efficiency and robustness of language models using representation smoothness analysis, novel optimization techniques, and weak supervision, especially effective in low-resource settings.
Details
Motivation: To address the challenges of data and parameter inefficiency in neural language models, aiming to improve robustness, stability, and performance with minimal labeling and computational cost. Method: The work employs Jacobian and Hessian-based regularization, smoothness-informed early-stopping, active learning combined with parameter-efficient fine-tuning, and in-context learning for weak supervision. Result: Extensive experiments show that the proposed methods outperform traditional approaches in performance, stability, and efficiency, especially under low-resource conditions. Conclusion: The thesis concludes that integrating representation smoothness analysis with optimization techniques significantly enhances robustness, generalization, and efficiency of neural language models, particularly in low-resource and dynamic environments. Abstract: This thesis addresses challenges related to data and parameter efficiency in neural language models, with a focus on representation analysis and the introduction of new optimization techniques. The first part examines the properties and dynamics of language representations within neural models, emphasizing their significance in enhancing robustness and generalization. It proposes innovative approaches based on representation smoothness, including regularization strategies that utilize Jacobian and Hessian matrices to stabilize training and mitigate sensitivity to input perturbations. The second part focuses on methods to significantly enhance data and parameter efficiency by integrating active learning strategies with parameter-efficient fine-tuning, guided by insights from representation smoothness analysis. It presents smoothness-informed early-stopping techniques designed to eliminate the need for labeled validation sets and proposes innovative combinations of active learning and parameter-efficient fine-tuning to reduce labeling efforts and computational resources. Extensive experimental evaluations across various NLP tasks demonstrate that these combined approaches substantially outperform traditional methods in terms of performance, stability, and efficiency. The third part explores weak supervision techniques enhanced by in-context learning to effectively utilize unlabeled data, further reducing dependence on extensive labeling. It shows that using in-context learning as a mechanism for weak supervision enables models to better generalize from limited labeled data by leveraging unlabeled examples more effectively during training. Comprehensive empirical evaluations confirm significant gains in model accuracy, adaptability, and robustness, especially in low-resource settings and dynamic data environments.[28] A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans
Anca Dinu,Andra-Maria Florescu,Alina Resceanu
Main category: cs.CL
TL;DR: The paper introduces a linguistic creativity test for humans and LLMs, revealing that LLMs outperform humans in creativity tasks, with LLMs favoring F(ixed)-creativity and humans showing inclination towards E(extending)-creativity.
Details
Motivation: The motivation behind the paper is to assess and compare the linguistic creativity of humans and Large Language Models (LLMs) through a structured test that examines their capacity for creating new words and phrases. Method: The method involved administering a linguistic creativity test to 24 humans and an equal number of LLMs. The test assessed the ability to generate original words and phrases based on word formation processes and metaphorical language use. Answers were evaluated using the OCSAI tool for Originality, Elaboration, and Flexibility, with additional analysis on uniqueness and a manual analysis of the dataset. Result: LLMs outperformed humans in all assessed criteria and did better in six out of eight test tasks. There were minor differences in the uniqueness of answers between humans and LLMs. The manual analysis indicated a preference of humans for E(extending)-creativity and of LLMs for F(ixed)-creativity. Conclusion: The paper concludes that LLMs demonstrate a higher level of creativity compared to humans, excelling in most test tasks and favoring F(ixed)-creativity, while humans lean towards E(extending)-creativity. Abstract: The following paper introduces a general linguistic creativity test for humans and Large Language Models (LLMs). The test consists of various tasks aimed at assessing their ability to generate new original words and phrases based on word formation processes (derivation and compounding) and on metaphorical language use. We administered the test to 24 humans and to an equal number of LLMs, and we automatically evaluated their answers using OCSAI tool for three criteria: Originality, Elaboration, and Flexibility. The results show that LLMs not only outperformed humans in all the assessed criteria, but did better in six out of the eight test tasks. We then computed the uniqueness of the individual answers, which showed some minor differences between humans and LLMs. Finally, we performed a short manual analysis of the dataset, which revealed that humans are more inclined towards E(extending)-creativity, while LLMs favor F(ixed)-creativity.[29] Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited
Anthony G Cohn,Robert E Blackwell
Main category: cs.CL
TL;DR: 研究大型语言模型在方向推理方面的表现。
Details
Motivation: 研究28个大型语言模型(LLMs)在方向推理方面的能力。 Method: 使用从一组模板生成的基准测试来广泛测试LLM在特定情境下确定正确方向的能力。 Result: 即使较新的大型推理模型也无法可靠地确定所有问题的正确方向。 Conclusion: Large Reasoning Models 无法始终准确地判断所有问题的正确方向。 Abstract: We investigate the abilities of 28 Large language Models (LLMs) to reason about cardinal directions (CDs) using a benchmark generated from a set of templates, extensively testing an LLM's ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first, second or third person. Even the newer Large Reasoning Models are unable to reliably determine the correct CD for all questions. This paper summarises and extends earlier work presented at COSIT-24.[30] StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features
Jeremi K. Ochab,Mateusz Matias,Tymoteusz Boba,Tomasz Walkowiak
Main category: cs.CL
TL;DR: 本文提出了一种基于spaCy和梯度提升机的高效AI生成文本检测方法,具有可解释性和计算效率。
Details
Motivation: 开发一种非神经网络、计算成本低但有效的分类方法,并利用大规模机器生成文本数据集提高分类器性能。 Method: 使用公共spaCy模型进行文本预处理并提取数千个特征(包括n-gram),并利用轻量级梯度提升机进行分类。 Result: 论文描述了一种可扩展且高效的分类方法,适用于大规模AI生成文本的检测任务。 Conclusion: 该论文提出了一种基于模块化风格分析流程的二进制AI检测任务解决方案,并采用轻量级梯度提升机作为分类器,具有可解释性且计算成本低。 Abstract: This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier's training. We explore several parameter options to increase the classifier's capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.[31] BOOKCOREF: Coreference Resolution at Book Scale
Giuliano Martinelli,Tommaso Bonomo,Pere-Lluís Huguet Cabot,Roberto Navigli
Main category: cs.CL
TL;DR: 本文介绍了一个用于书籍规模共指解析的新基准BOOKCOREF,其平均文档长度超过20万token,并提出了一种自动注释长文本的高质量共指解析方法。
Details
Motivation: 现有的共指解析基准如LitBank在文档长度上存在限制,无法充分评估系统在书籍规模上的性能,即共指提及跨越数十万token时的系统能力。 Method: 提出了一种新的自动化流程,能够在完整叙事文本上生成高质量的共指解析注释,并利用此流程创建了BOOKCOREF基准。 Result: 实验表明该自动流程具有鲁棒性,并且BOOKCOREF使得当前长文档共指解析系统在整本书上评估时获得了高达+20 CoNLL-F1点的提升。 Conclusion: BOOKCOREF为书籍规模的共指解析提供了新的评估标准,同时揭示了现有模型在处理长文档时面临的新挑战。 Abstract: Coreference Resolution systems are typically evaluated on benchmarks containing small- to medium-scale documents. When it comes to evaluating long texts, however, existing benchmarks, such as LitBank, remain limited in length and do not adequately assess system capabilities at the book scale, i.e., when co-referring mentions span hundreds of thousands of tokens. To fill this gap, we first put forward a novel automatic pipeline that produces high-quality Coreference Resolution annotations on full narrative texts. Then, we adopt this pipeline to create the first book-scale coreference benchmark, BOOKCOREF, with an average document length of more than 200,000 tokens. We carry out a series of experiments showing the robustness of our automatic procedure and demonstrating the value of our resource, which enables current long-document coreference systems to gain up to +20 CoNLL-F1 points when evaluated on full books. Moreover, we report on the new challenges introduced by this unprecedented book-scale setting, highlighting that current models fail to deliver the same performance they achieve on smaller documents. We release our data and code to encourage research and development of new book-scale Coreference Resolution systems at https://github.com/sapienzanlp/bookcoref.[32] Findings of MEGA: Maths Explanation with LLMs using the Socratic Method for Active Learning
Tosin Adewumi,Foteini Simistira Liwicki,Marcus Liwicki,Viktor Gardelli,Lama Alkhaled,Hamam Mokayed
Main category: cs.CL
TL;DR: This paper introduces the MEGA method, combining Socratic questioning, Chain of Thought reasoning, gamification, and formative feedback to improve university students' mathematics learning using large language models (LLMs). It shows better effectiveness compared to the traditional CoT method, especially for more complex math problems.
Details
Motivation: Many students struggle with Mathematics, often due to ineffective teaching methods, despite its importance in various disciplines. This study aims to develop and evaluate a more engaging and effective pedagogical approach for Maths learning. Method: The study used a within-group design, comparing the MEGA method to the traditional step-by-step Chain of Thought (CoT) method. Samples (n=60) were drawn from the GSM8K and MATH datasets and evaluated by university students. Two capable LLMs, GPT-4o and Claude 3.5 Sonnet, were employed for the analysis. Result: The results showed that students found the MEGA method more effective for learning, especially for difficult problems in the MATH dataset, where MEGA outperformed CoT by a significant margin (47.5% vs. 26.67%). Conclusion: The MEGA method is more effective than the traditional CoT approach in enhancing students' understanding of Mathematics, particularly for complex problems, highlighting its potential for improving pedagogy in this field. Abstract: This paper presents an intervention study on the effects of the combined methods of (1) the Socratic method, (2) Chain of Thought (CoT) reasoning, (3) simplified gamification and (4) formative feedback on university students' Maths learning driven by large language models (LLMs). We call our approach Mathematics Explanations through Games by AI LLMs (MEGA). Some students struggle with Maths and as a result avoid Math-related discipline or subjects despite the importance of Maths across many fields, including signal processing. Oftentimes, students' Maths difficulties stem from suboptimal pedagogy. We compared the MEGA method to the traditional step-by-step (CoT) method to ascertain which is better by using a within-group design after randomly assigning questions for the participants, who are university students. Samples (n=60) were randomly drawn from each of the two test sets of the Grade School Math 8K (GSM8K) and Mathematics Aptitude Test of Heuristics (MATH) datasets, based on the error margin of 11%, the confidence level of 90%, and a manageable number of samples for the student evaluators. These samples were used to evaluate two capable LLMs at length (Generative Pretrained Transformer 4o (GPT4o) and Claude 3.5 Sonnet) out of the initial six that were tested for capability. The results showed that students agree in more instances that the MEGA method is experienced as better for learning for both datasets. It is even much better than the CoT (47.5% compared to 26.67%) in the more difficult MATH dataset, indicating that MEGA is better at explaining difficult Maths problems.[33] Iterative Augmentation with Summarization Refinement (IASR) Evaluation for Unstructured Survey data Modeling and Analysis
Payal Bhattad,Sai Manoj Pudukotai Dinakarrao,Anju Gupta
Main category: cs.CL
TL;DR: This paper proposes a framework for evaluating LLM-based text augmentation, demonstrating improved semantic consistency and diversity, with GPT-3.5 Turbo showing the best performance. Applied to topic modeling, it significantly enhances granularity and eliminates overlaps.
Details
Motivation: The motivation stems from the limitations of existing text data augmentation techniques in ensuring semantic preservation during large-scale or iterative generation, which leads to redundancy and instability in NLP tasks. Method: The method involves a framework with Scalability Analysis for measuring semantic consistency and Iterative Augmentation with Summarization Refinement (IASR) to evaluate semantic drift. Empirical evaluations are conducted on state-of-the-art LLMs, particularly applied to a real-world task using BERTopic with GPT-enhanced few-shot labeling. Result: GPT-3.5 Turbo showed the best balance of semantic fidelity, diversity, and generation efficiency. The approach resulted in a 400% increase in topic granularity and eliminated topic overlaps in the real-world application. Conclusion: The proposed evaluation framework for LLM-based text augmentation demonstrates practical utility in improving topic modeling tasks, ensuring semantic preservation and achieving increased topic granularity without overlaps. Abstract: Text data augmentation is a widely used strategy for mitigating data sparsity in natural language processing (NLP), particularly in low-resource settings where limited samples hinder effective semantic modeling. While augmentation can improve input diversity and downstream interpretability, existing techniques often lack mechanisms to ensure semantic preservation during large-scale or iterative generation, leading to redundancy and instability. This work introduces a principled evaluation framework for large language model (LLM) based text augmentation, comprising two components: (1) Scalability Analysis, which measures semantic consistency as augmentation volume increases, and (2) Iterative Augmentation with Summarization Refinement (IASR), which evaluates semantic drift across recursive paraphrasing cycles. Empirical evaluations across state-of-the-art LLMs show that GPT-3.5 Turbo achieved the best balance of semantic fidelity, diversity, and generation efficiency. Applied to a real-world topic modeling task using BERTopic with GPT-enhanced few-shot labeling, the proposed approach results in a 400% increase in topic granularity and complete elimination of topic overlaps. These findings validated the utility of the proposed frameworks for structured evaluation of LLM-based augmentation in practical NLP pipelines.[34] Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators
Pavel Šindelář,Ondřej Bojar
Main category: cs.CL
TL;DR: ELOQUENT的Sensemaking任务通过三步流程评估生成模型,发现其在问题生成、答案限制和评估准确性方面仍存在挑战。
Details
Motivation: 为评估生成语言模型提供可测试的、高层次的标准,推动模型在理解与生成方面的能力提升。 Method: 通过课堂考试启发的三步评估流程(问题生成、回答问题、评估答案),使用自动和手动评估方法对生成模型进行测试。 Result: LLMs在回答问题上表现尚可,但在限制答案来源和评估准确性方面存在显著问题,特别是在对抗测试中表现不佳。 Conclusion: Sensemaking任务揭示了生成模型在问题生成、回答和评估方面的挑战,特别是在严格遵循输入材料和准确评估答案方面需要进一步改进。 Abstract: ELOQUENT is a set of shared tasks that aims to create easily testable high-level criteria for evaluating generative language models. Sensemaking is one such shared task. In Sensemaking, we try to assess how well generative models ``make sense out of a given text'' in three steps inspired by exams in a classroom setting: (1) Teacher systems should prepare a set of questions, (2) Student systems should answer these questions, and (3) Evaluator systems should score these answers, all adhering rather strictly to a given set of input materials. We report on the 2025 edition of Sensemaking, where we had 7 sources of test materials (fact-checking analyses of statements, textbooks, transcribed recordings of a lecture, and educational videos) spanning English, German, Ukrainian, and Czech languages. This year, 4 teams participated, providing us with 2 Teacher submissions, 2 Student submissions, and 2 Evaluator submissions. We added baselines for Teacher and Student using commercial large language model systems. We devised a fully automatic evaluation procedure, which we compare to a minimalistic manual evaluation. We were able to make some interesting observations. For the first task, the creation of questions, better evaluation strategies will still have to be devised because it is difficult to discern the quality of the various candidate question sets. In the second task, question answering, the LLMs examined overall perform acceptably, but restricting their answers to the given input texts remains problematic. In the third task, evaluation of question answers, our adversarial tests reveal that systems using the LLM-as-a-Judge paradigm erroneously rate both garbled question-answer pairs and answers to mixed-up questions as acceptable.[35] Toward a Behavioural Translation Style Space: Simulating the Temporal Dynamics of Affect, Behaviour, and Cognition in Human Translation Production
Michael Carl,Takanori Mizowaki,Aishvarya Ray,Masaru Yamada,Devi Sri Bandaru,Xinyue Ren
Main category: cs.CL
TL;DR: The paper proposes a Behavioural Translation Style Space (BTSS) to understand translation patterns, suggesting that physical translation behaviour is shaped by cognitive processes and affective states, with BTSS built using keystroke and gaze data.
Details
Motivation: The paper aims to describe possible behavioural translation patterns through a Behavioural Translation Style Space (BTSS), suggesting that observable translation behaviour is fundamental to the physical act of translation, shaped by higher-order cognitive processes and affective translation states. Method: The paper uses records of keystrokes and gaze data as indicators of the hidden mental processing structure and organizes the behavioural patterns as a multi-layered embedded BTSS. Result: The result is the development of a hierarchical BTSS that organizes behavioural translation patterns into various embedded processing layers, based on the analysis of keystrokes and gaze data. Conclusion: BTSS serves as the basis for a computational translation agent to simulate the temporal dynamics of affect, automatized behaviour and cognition during human translation production. Abstract: The paper introduces a Behavioural Translation Style Space (BTSS) that describes possible behavioural translation patterns. The suggested BTSS is organized as a hierarchical structure that entails various embedded processing layers. We posit that observable translation behaviour - i.e., eye and finger movements - is fundamental when executing the physical act of translation but it is caused and shaped by higher-order cognitive processes and affective translation states. We analyse records of keystrokes and gaze data as indicators of the hidden mental processing structure and organize the behavioural patterns as a multi-layered embedded BTSS. The BTSS serves as the basis for a computational translation agent to simulate the temporal dynamics of affect, automatized behaviour and cognition during human translation production.[36] Towards few-shot isolated word reading assessment
Reuben Smit,Retief Louw,Herman Kamper
Main category: cs.CL
TL;DR: 本文提出一种在低资源环境下基于SSL模型和少量样本学习的儿童语音评估方法,发现SSL表示在儿童数据上效果有限。
Details
Motivation: 探索在资源匮乏环境下,无需自动语音识别(ASR)的孤立词阅读评估方法,以应对儿童语音数据的挑战。 Method: 通过比较输入儿童语音与少量成人提供的参考模板,使用SSL模型的中间层进行编码,并对特征进行离散化和模板巴氏平均等设计选项进行研究。 Result: 理想实验显示成人语音表现合理,但儿童语音输入存在显著性能下降,即使使用儿童模板也未能显著改善。 Conclusion: 使用自监督学习(SSL)表示在低资源语音任务中的成功,但在儿童语音数据处理上仍存在限制,尤其是在少量样本分类系统中。 Abstract: We explore an ASR-free method for isolated word reading assessment in low-resource settings. Our few-shot approach compares input child speech to a small set of adult-provided reference templates. Inputs and templates are encoded using intermediate layers from large self-supervised learned (SSL) models. Using an Afrikaans child speech benchmark, we investigate design options such as discretising SSL features and barycentre averaging of the templates. Idealised experiments show reasonable performance for adults, but a substantial drop for child speech input, even with child templates. Despite the success of employing SSL representations in low-resource speech tasks, our work highlights the limitations of SSL representations for processing child data when used in a few-shot classification system.[37] Improving Contextual ASR via Multi-grained Fusion with Large Language Models
Shilin Zhou,Zhenghua Li
Main category: cs.CL
TL;DR: 本文提出了一种结合词级和短语级融合的多粒度方法,通过将自动语音识别与大语言模型结合,有效提升了关键词识别的准确性,并通过实验验证了其性能优势。
Details
Motivation: 现有的端到端自动语音识别模型在识别上下文相关关键词(如专有名词或用户特定实体)时表现不佳,因此需要一种更有效的方法来提升关键词的识别准确性。 Method: 引入了一种后期融合策略,将自动语音识别的声学信息与大语言模型的丰富上下文知识相结合,并通过消融实验验证了词级和短语级融合组件的贡献。 Result: 在中文和英文数据集上的实验表明,所提出的方法在关键词相关指标上达到了最先进的性能,同时保持了对非关键词文本的高精度。 Conclusion: 本文提出了一种新颖的多粒度融合方法,结合了词级和短语级融合的优势,并通过实验验证了该方法在关键词相关指标上的显著性能提升。 Abstract: While end-to-end Automatic Speech Recognition (ASR) models have shown impressive performance in transcribing general speech, they often struggle to accurately recognize contextually relevant keywords, such as proper nouns or user-specific entities. Previous approaches have explored leveraging keyword dictionaries in the textual modality to improve keyword recognition, either through token-level fusion that guides token-by-token generation or phrase-level fusion that enables direct copying of keyword phrases. However, these methods operate at different granularities and have their own limitations. In this paper, we propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs). Our approach incorporates a late-fusion strategy that elegantly combines ASR's acoustic information with LLM's rich contextual knowledge, balancing fine-grained token precision with holistic phrase-level understanding. Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving high accuracy on non-keyword text. Ablation studies further confirm that the token-level and phrase-level components both contribute significantly to the performance gains, complementing each other in our joint multi-grained framework. The code and models will be publicly available at https://github.com/.[38] Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese
Yikang Liu,Wanyang Zhang,Yiming Wang,Jialong Tang,Pei Zhang,Baosong Yang,Fei Huang,Rui Wang,Hai Hu
Main category: cs.CL
TL;DR: This paper introduces T-index, a new measure for translationese that is robust, efficient, and complementary to existing MT QE metrics.
Details
Motivation: There was no quantitative measure for translationese before, and such a measure could improve machine translation quality estimation. Method: T-index is computed from the likelihood ratios of two contrastively fine-tuned language models (LMs), evaluated using a synthesized dataset and a dataset with translations in the wild. Result: T-index is both robust and efficient, with absolute values correlating well with human ratings of degrees of translationese (Pearson's $r = 0.568$) and relative differences predicting pairwise annotations. Additionally, T-index has low correlation with existing metrics like BLEU and COMET. Conclusion: T-index is a robust and efficient quantitative measure for translationese, which can serve as a complementary metric in MT QE as it is not covered by existing MT QE metrics. Abstract: In this paper, we propose the first quantitative measure for translationese -- the translationese-index (T-index) for graded and generalizable measurement of translationese, computed from the likelihood ratios of two contrastively fine-tuned language models (LMs). We use a synthesized dataset and a dataset with translations in the wild to evaluate T-index's generalizability in cross-domain settings and its validity against human judgments. Our results show that T-index is both robust and efficient. T-index scored by two 0.5B LMs fine-tuned on only 1-5k pairs of synthetic data can well capture translationese in the wild. We find that the relative differences in T-indices between translations can well predict pairwise translationese annotations obtained from human annotators; and the absolute values of T-indices correlate well with human ratings of degrees of translationese (Pearson's $r = 0.568$). Additionally, the correlation between T-index and existing machine translation (MT) quality estimation (QE) metrics such as BLEU and COMET is low, suggesting that T-index is not covered by these metrics and can serve as a complementary metric in MT QE.[39] Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes
Johann Frei,Nils Feldhus,Lisa Raithel,Roland Roller,Alexander Meyer,Frank Kramer
Main category: cs.CL
TL;DR: This paper introduces Infherno, an end-to-end framework using LLM agents and healthcare terminology tools to convert unstructured clinical notes into structured FHIR resources, improving upon the generalizability and structural conformity of prior methods.
Details
Motivation: Previous methods for translating clinical notes into structured FHIR resources, such as rule-based systems or LLMs with instruction tuning, often suffer from limited generalizability and structural inconformity. This motivates the development of a more robust and automated solution. Method: The study proposes an end-to-end framework called Infherno, which utilizes LLM agents, code execution, and healthcare terminology database tools to translate unstructured clinical text into FHIR resources. Result: Infherno adheres to the FHIR document schema and performs competitively with a human baseline in predicting FHIR resources from unstructured text, supporting clinical data integration and interoperability across institutions. Conclusion: Infherno is a feasible and effective solution for translating free-form clinical notes into structured FHIR resources, offering improved generalizability and structural conformity compared to previous methods. Abstract: For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources rely on modular, rule-based systems or LLMs with instruction tuning and constrained decoding. Since they frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions.[40] Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding
Feng Xiao,Jicong Fan
Main category: cs.CL
TL;DR: 本文构建了一个全面的文本异常检测基准,揭示了嵌入质量对检测效果的关键影响,传统方法在LLM嵌入下表现良好,并开源工具包以促进未来研究。
Details
Motivation: 由于缺乏标准化和全面的基准来评估现有文本异常检测方法,限制了该领域的严格比较和创新发展,因此需要建立一个系统性基准以推动研究和技术进步。 Method: 通过综合实证研究,系统评估基于不同预训练语言模型嵌入的文本异常检测效果,涵盖早期模型(GloVe, BERT)、多个LLM(LLaMa-2, LLama-3, Mistral, OpenAI系列)及多领域文本数据集(新闻、社交媒体、科学出版物),采用AUROC和AUPRC等全面评估指标。 Result: 研究发现嵌入质量显著影响文本异常检测效果,传统浅层算法在使用LLM嵌入时表现与深度学习方法相当;跨模型性能矩阵具有强低秩特性,支持高效模型评估与选择;开源了包含所有模型嵌入和代码的基准工具包。 Conclusion: 文本异常检测的效果很大程度上取决于嵌入质量,使用LLM嵌入时,深度学习方法相较于传统浅层算法没有性能优势;跨模型性能矩阵具有强低秩特性,可实现高效的模型评估和选择;开源工具包为未来研究奠定了基础。 Abstract: Text anomaly detection is a critical task in natural language processing (NLP), with applications spanning fraud detection, misinformation identification, spam detection and content moderation, etc. Despite significant advances in large language models (LLMs) and anomaly detection algorithms, the absence of standardized and comprehensive benchmarks for evaluating the existing anomaly detection methods on text data limits rigorous comparison and development of innovative approaches. This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection, leveraging embeddings from diverse pre-trained language models across a wide array of text datasets. Our work systematically evaluates the effectiveness of embedding-based text anomaly detection by incorporating (1) early language models (GloVe, BERT); (2) multiple LLMs (LLaMa-2, LLama-3, Mistral, OpenAI (small, ada, large)); (3) multi-domain text datasets (news, social media, scientific publications); (4) comprehensive evaluation metrics (AUROC, AUPRC). Our experiments reveal a critical empirical insight: embedding quality significantly governs anomaly detection efficacy, and deep learning-based approaches demonstrate no performance advantage over conventional shallow algorithms (e.g., KNN, Isolation Forest) when leveraging LLM-derived embeddings.In addition, we observe strongly low-rank characteristics in cross-model performance matrices, which enables an efficient strategy for rapid model evaluation (or embedding evaluation) and selection in practical applications. Furthermore, by open-sourcing our benchmark toolkit that includes all embeddings from different models and code at https://github.com/jicongfan/Text-Anomaly-Detection-Benchmark, this work provides a foundation for future research in robust and scalable text anomaly detection systems.[41] Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization
Prashanth Vijayaraghavan,Apoorva Nitsure,Charles Mackin,Luyao Shi,Stefano Ambrogio,Arvind Haran,Viresh Paruthi,Ali Elzein,Dan Coops,David Beymer,Tyler Baldwin,Ehsan Degan
Main category: cs.CL
TL;DR: This paper evaluates LLMs for VHDL code generation and summarization, identifies their shortcomings, and proposes the CoDes method to enhance performance, achieving better results than standard approaches.
Details
Motivation: There is a lack of research evaluating and refining LLMs for hardware description languages like VHDL, despite their widespread use in NLP and code-related tasks. Method: The study evaluates existing code LLMs using two datasets (VHDL-Eval and VHDL-Xform) and proposes the CoDes method, which generates intermediate descriptive steps to enhance LLM performance. Result: Existing LLMs consistently underperformed across multiple metrics for VHDL code generation and summarization. The CoDes approach showed significant improvement over standard prompting methods on both datasets. Conclusion: The study concludes that existing LLMs underperform in VHDL code generation and summarization, and the proposed Chain-of-Descriptions (CoDes) approach significantly improves their performance, offering a framework for future research. Abstract: Large Language Models (LLMs) have become widely used across diverse NLP tasks and domains, demonstrating their adaptability and effectiveness. In the realm of Electronic Design Automation (EDA), LLMs show promise for tasks like Register-Transfer Level (RTL) code generation and summarization. However, despite the proliferation of LLMs for general code-related tasks, there's a dearth of research focused on evaluating and refining these models for hardware description languages (HDLs), notably VHDL. In this study, we evaluate the performance of existing code LLMs for VHDL code generation and summarization using various metrics and two datasets -- VHDL-Eval and VHDL-Xform. The latter, an in-house dataset, aims to gauge LLMs' understanding of functionally equivalent code. Our findings reveal consistent underperformance of these models across different metrics, underscoring a significant gap in their suitability for this domain. To address this challenge, we propose Chain-of-Descriptions (CoDes), a novel approach to enhance the performance of LLMs for VHDL code generation and summarization tasks. CoDes involves generating a series of intermediate descriptive steps based on: (i) the problem statement for code generation, and (ii) the VHDL code for summarization. These steps are then integrated with the original input prompt (problem statement or code) and provided as input to the LLMs to generate the final output. Our experiments demonstrate that the CoDes approach significantly surpasses the standard prompting strategy across various metrics on both datasets. This method not only improves the quality of VHDL code generation and summarization but also serves as a framework for future research aimed at enhancing code LLMs for VHDL.[42] Exploring Gender Bias in Alzheimer's Disease Detection: Insights from Mandarin and Greek Speech Perception
Liu He,Yuanchao Li,Rui Feng,XinRan Han,Yin-Long Liu,Yuwei Yang,Zude Zhu,Jiahong Yuan
Main category: cs.CL
TL;DR: 研究揭示了阿尔茨海默病语音感知中存在性别偏差,男性语音更容易被识别为AD语音,且这一现象与特定声学特征有关。
Details
Motivation: 性别偏差在语音感知任务中广泛存在,并受到性别基本发声差异的影响。研究者希望探究这种偏差是否会影响阿尔茨海默病语音的感知,从而影响相关检测模型的准确性。 Method: 通过一项感知实验,涉及16名中文听众评估中文和希腊语语音,并结合声学分析方法,分析了语音的基本声学特征与AD感知之间的关系。 Result: 研究发现男性语音更容易被识别为AD语音,这种偏差在中文语音中尤为明显。声学分析显示,男性语音中的shimmer值与AD感知显著相关,而语音部分与AD识别呈显著负相关。 Conclusion: 该研究揭示了在阿尔茨海默病(AD)语音感知中存在的性别偏差,并强调了在开发AD检测模型时解决这种性别偏差的重要性。 Abstract: Gender bias has been widely observed in speech perception tasks, influenced by the fundamental voicing differences between genders. This study reveals a gender bias in the perception of Alzheimer's Disease (AD) speech. In a perception experiment involving 16 Chinese listeners evaluating both Chinese and Greek speech, we identified that male speech was more frequently identified as AD, with this bias being particularly pronounced in Chinese speech. Acoustic analysis showed that shimmer values in male speech were significantly associated with AD perception, while speech portion exhibited a significant negative correlation with AD identification. Although language did not have a significant impact on AD perception, our findings underscore the critical role of gender bias in AD speech perception. This work highlights the necessity of addressing gender bias when developing AD detection models and calls for further research to validate model performance across different linguistic contexts.[43] Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate
Ana Davila,Jacinto Colan,Yasuhisa Hasegawa
Main category: cs.CL
TL;DR: This paper shows that a multi-agent debate framework can enhance the ability of Large Language Models to handle ambiguous user requests.
Details
Motivation: To overcome challenges of ambiguity in user requests faced by Large Language Models. Method: Introduced and evaluated a multi-agent debate framework with three LLM architectures and a dataset of ambiguous queries. Result: The debate framework improved the performance of Llama3-8B and Mistral-7B models, achieving a 76.7% success rate in resolving complex ambiguities. Conclusion: Structured debates can significantly improve the clarity and performance of Large Language Models in handling ambiguous user requests. Abstract: Large Language Models (LLMs) have demonstrated significant capabilities in understanding and generating human language, contributing to more natural interactions with complex systems. However, they face challenges such as ambiguity in user requests processed by LLMs. To address these challenges, this paper introduces and evaluates a multi-agent debate framework designed to enhance detection and resolution capabilities beyond single models. The framework consists of three LLM architectures (Llama3-8B, Gemma2-9B, and Mistral-7B variants) and a dataset with diverse ambiguities. The debate framework markedly enhanced the performance of Llama3-8B and Mistral-7B variants over their individual baselines, with Mistral-7B-led debates achieving a notable 76.7% success rate and proving particularly effective for complex ambiguities and efficient consensus. While acknowledging varying model responses to collaborative strategies, these findings underscore the debate framework's value as a targeted method for augmenting LLM capabilities. This work offers important insights for developing more robust and adaptive language understanding systems by showing how structured debates can lead to improved clarity in interactive systems.[44] Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics
Meysam Alizadeh,Fabrizio Gilardi,Zeynab Samei,Mohsen Mosleh
Main category: cs.CL
TL;DR: This study shows that web browsing large language models can infer social media user demographics from usernames, raising both potential benefits for research and concerns about misuse.
Details
Motivation: The motivation is to explore whether LLMs equipped with web browsing capabilities can directly retrieve and analyze social media data, particularly for inferring user demographics. Method: The researchers used a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants to evaluate the LLMs' ability to predict user demographics based on usernames. Result: LLMs were shown to predict user demographics with reasonable accuracy and revealed insights into how they parse social media profiles, which could introduce gender and political biases for less active accounts. Conclusion: The study concludes that web browsing LLMs can infer demographic attributes from social media usernames, which has potential applications in computational social science but also risks misuse. The authors recommend restricted access to this capability. Abstract: Large language models (LLMs) have traditionally relied on static training data, limiting their knowledge to fixed snapshots. Recent advancements, however, have equipped LLMs with web browsing capabilities, enabling real time information retrieval and multi step reasoning over live web content. While prior studies have demonstrated LLMs ability to access and analyze websites, their capacity to directly retrieve and analyze social media data remains unexplored. Here, we evaluate whether web browsing LLMs can infer demographic attributes of social media users given only their usernames. Using a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants, we show that these models can access social media content and predict user demographics with reasonable accuracy. Analysis of the synthetic dataset further reveals how LLMs parse and interpret social media profiles, which may introduce gender and political biases against accounts with minimal activity. While this capability holds promise for computational social science in the post API era, it also raises risks of misuse particularly in information operations and targeted advertising underscoring the need for safeguards. We recommend that LLM providers restrict this capability in public facing applications, while preserving controlled access for verified research purposes.[45] Probing for Arithmetic Errors in Language Models
Yucheng Sun,Alessandro Stolfo,Mrinmaya Sachan
Main category: cs.CL
TL;DR: 本文探讨了语言模型内部激活状态是否可用于检测算术错误,通过使用简单探测器从隐藏状态中解码模型预测输出和正确答案,并训练轻量级错误检测器以超过90%的准确率预测模型正确性。研究还发现这些探测器能指导选择性重新提示错误的推理步骤,从而在最小干扰正确输出的情况下提高任务准确性。
Details
Motivation: 本文旨在探索语言模型内部激活状态是否可用于检测算术错误,以提高模型预测的可靠性。 Method: 文章首先从一个受控的3位数加法设置开始,展示简单探测器如何准确解码模型预测输出和正确答案。接着训练轻量级错误检测器,并将其分析扩展到更复杂的加法GSM8K问题的结构化思维链痕迹上。 Result: 研究发现,无论模型输出是否正确,简单探测器都可以从隐藏状态中准确解码模型预测输出和正确答案。此外,训练的轻量级错误检测器可以超过90%的准确率预测模型正确性,且探测器在更复杂的加法问题上也表现出良好的泛化能力。 Conclusion: 本文的研究表明,语言模型的算术错误可以从其内部激活状态中预见,而简单探测器为实现模型自我修正提供了一种可行的轻量级方法。 Abstract: We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states, regardless of whether the model's output is correct. Building on this, we train lightweight error detectors that predict model correctness with over 90% accuracy. We then extend our analysis to structured chain-of-thought traces on addition-only GSM8K problems and find that probes trained on simple arithmetic generalize well to this more complex setting, revealing consistent internal representations. Finally, we demonstrate that these probes can guide selective re-prompting of erroneous reasoning steps, improving task accuracy with minimal disruption to correct outputs. Our findings suggest that arithmetic errors can be anticipated from internal activations alone, and that simple probes offer a viable path toward lightweight model self-correction.[46] Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data
Chandana Cheerla
Main category: cs.CL
TL;DR: This paper proposes an advanced RAG framework to better handle structured and semi-structured enterprise data, showing significant improvements in retrieval accuracy and response quality through hybrid retrieval methods and semantic chunking.
Details
Motivation: Organizations increasingly depend on structured enterprise data for decision-making, but LLMs face limitations in handling heterogeneous data formats, context windows, and static pretraining. Conventional RAG frameworks struggle with structured and semi-structured data, necessitating an improved approach. Method: The framework combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced with metadata-aware filtering via SpaCy NER and cross-encoder reranking. It employs semantic chunking to maintain textual coherence, preserves tabular data structures, utilizes quantized indexing for efficiency, and incorporates human-in-the-loop feedback and conversation memory for adaptability. Result: Experiments on enterprise datasets showed significant improvements: Precision@5 increased by 15% (90 vs. 75), Recall@5 by 13% (87 vs. 74), and Mean Reciprocal Rank by 16% (0.85 vs. 0.69). Qualitative evaluations also showed higher scores for Faithfulness (4.6 vs. 3.0), Completeness (4.2 vs. 2.5), and Relevance (4.5 vs. 3.2) on a 5-point Likert scale. Conclusion: The proposed advanced RAG framework significantly enhances the retrieval and generation capabilities for structured and semi-structured enterprise data, demonstrating improved precision, recall, and response quality. Future work involves expanding to multimodal data and incorporating agent-based retrieval. Abstract: Organizations increasingly rely on proprietary enterprise data, including HR records, structured reports, and tabular documents, for critical decision-making. While Large Language Models (LLMs) have strong generative capabilities, they are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats. Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data. This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking. The framework applies semantic chunking to maintain textual coherence and retains tabular data structures to preserve row-column integrity. Quantized indexing optimizes retrieval efficiency, while human-in-the-loop feedback and conversation memory improve adaptability. Experiments on enterprise datasets show notable improvements: Precision@5 increased by 15 percent (90 versus 75), Recall@5 by 13 percent (87 versus 74), and Mean Reciprocal Rank by 16 percent (0.85 versus 0.69). Qualitative evaluations show higher scores in Faithfulness (4.6 versus 3.0), Completeness (4.2 versus 2.5), and Relevance (4.5 versus 3.2) on a 5-point Likert scale. These results demonstrate the framework's effectiveness in delivering accurate, comprehensive, and contextually relevant responses for enterprise tasks. Future work includes extending to multimodal data and integrating agent-based retrieval. The source code will be released at https://github.com/CheerlaChandana/Enterprise-Chatbot[47] Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Yik Siu Chan,Zheng-Xin Yong,Stephen H. Bach
Main category: cs.CL
TL;DR: This paper explores the use of CoT activations with linear probes to effectively predict the safety of final responses in language models, offering potential for real-time safety monitoring during response generation.
Details
Motivation: The motivation behind the study is to address alignment risks in open-weights reasoning language models by predicting potential misalignment in final responses using chains-of-thought (CoTs), which often contain harmful content. Method: The researchers evaluated various monitoring approaches, including humans, large language models, and text classifiers, using either CoT text or activations. They focused on assessing the effectiveness of a linear probe trained on CoT activations in predicting the safety of final responses. Result: A linear probe trained on CoT activations significantly outperformed text-based methods in predicting whether a final response would be safe or unsafe. Additionally, the probe was effective even when applied to early segments of the CoT, enabling early intervention. Conclusion: The study concludes that using linear probes on CoT activations is a reliable and effective method for predicting the safety of final responses in language models, offering potential for real-time safety monitoring during generation. Abstract: Open-weights reasoning language models generate long chains-of-thought (CoTs) before producing a final response, which improves performance but introduces additional alignment risks, with harmful content often appearing in both the CoTs and the final outputs. In this work, we investigate if we can use CoTs to predict final response misalignment. We evaluate a range of monitoring approaches, including humans, highly-capable large language models, and text classifiers, using either CoT text or activations. First, we find that a simple linear probe trained on CoT activations can significantly outperform all text-based methods in predicting whether a final response will be safe or unsafe. CoT texts are often unfaithful and can mislead humans and classifiers, while model latents (i.e., CoT activations) offer a more reliable predictive signal. Second, the probe makes accurate predictions before reasoning completes, achieving strong performance even when applied to early CoT segments. These findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.[48] S2WTM: Spherical Sliced-Wasserstein Autoencoder for Topic Modeling
Suman Adhya,Debarshi Kumar Sanyal
Main category: cs.CL
TL;DR: S2WTM通过引入球面切片Wasserstein距离解决VAE-NTMs中的后验崩溃问题,实现了更优的主题建模和下游任务性能。
Details
Motivation: VAE-NTMs在编码超球面结构时经常遭遇后验崩溃问题,导致潜在表征效果不佳。为解决这一问题,提出了S2WTM方法。 Method: S2WTM采用了一种支持在单位超球面上的先验分布,并利用球面切片Wasserstein距离将聚合后验分布与先验对齐。 Result: 实验结果表明,S2WTM在主题建模任务中优于现有的最先进主题模型,生成的主题更加连贯和多样化,同时在下游任务中表现更优。 Conclusion: S2WTM有效地缓解了VAE-NTMs中的后验崩溃问题,同时在主题建模中实现了更连贯和多样化的主题生成,并提升了下游任务的性能。 Abstract: Modeling latent representations in a hyperspherical space has proven effective for capturing directional similarities in high-dimensional text data, benefiting topic modeling. Variational autoencoder-based neural topic models (VAE-NTMs) commonly adopt the von Mises-Fisher prior to encode hyperspherical structure. However, VAE-NTMs often suffer from posterior collapse, where the KL divergence term in the objective function highly diminishes, leading to ineffective latent representations. To mitigate this issue while modeling hyperspherical structure in the latent space, we propose the Spherical Sliced Wasserstein Autoencoder for Topic Modeling (S2WTM). S2WTM employs a prior distribution supported on the unit hypersphere and leverages the Spherical Sliced-Wasserstein distance to align the aggregated posterior distribution with the prior. Experimental results demonstrate that S2WTM outperforms state-of-the-art topic models, generating more coherent and diverse topics while improving performance on downstream tasks.[49] Language Models Improve When Pretraining Data Matches Target Tasks
David Mizrahi,Anders Boesen Lindbo Larsen,Jesse Allardice,Suzie Petryk,Yuri Gorokhov,Jeffrey Li,Alex Fang,Josh Gardner,Tom Gunter,Afshin Dehghan
Main category: cs.CL
TL;DR: BETR是一种通过与基准训练示例相似性来选择预训练文档的方法,能够在多个任务和模型规模上显著提升性能。
Details
Motivation: 研究问题是如何显式优化数据选择方法的目标,以提高模型性能。 Method: 提出了一种简单的方法BETR,基于与基准训练示例的相似性选择预训练文档。 Result: BETR在10个任务中的9个上提升了性能,且对于不同基准集也具有良好的泛化能力。 Conclusion: 直接匹配预训练数据与目标任务能够精确塑造模型能力,且最佳选择策略必须适应模型规模。 Abstract: Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining documents based on similarity to benchmark training examples. BETR embeds benchmark examples and a sample of pretraining documents in a shared space, scores this sample by similarity to benchmarks, then trains a lightweight classifier to predict these scores for the full corpus. We compare data selection methods by training over 500 models spanning $10^{19}$ to $10^{22}$ FLOPs and fitting scaling laws to them. From this, we find that simply aligning pretraining data to evaluation benchmarks using BETR achieves a 2.1x compute multiplier over DCLM-Baseline (4.7x over unfiltered data) and improves performance on 9 out of 10 tasks across all scales. BETR also generalizes well: when targeting a diverse set of benchmarks disjoint from our evaluation suite, it still matches or outperforms baselines. Our scaling analysis further reveals a clear trend: larger models require less aggressive filtering. Overall, our findings show that directly matching pretraining data to target tasks precisely shapes model capabilities and highlight that optimal selection strategies must adapt to model scale.cs.CV [Back]
[50] An Memory-Efficient Framework for Deformable Transformer with Neural Architecture Search
Wendong Mao,Mingfan Zhao,Jianfeng Guan,Qiwei Dong,Zhongfeng Wang
Main category: cs.CV
TL;DR: This paper proposes a hardware-friendly optimization framework for Deformable Attention Transformers (DAT) that reduces DRAM access times by 18% without significantly compromising accuracy (only 0.2% drop).
Details
Motivation: The motivation stems from the challenges posed by the irregular memory access patterns of Deformable Attention Transformers (DAT) due to their data-dependent sampling mechanism. These challenges make efficient hardware deployment difficult, and existing acceleration methods either incur high hardware overhead or compromise model accuracy. Method: The paper proposes a hardware-friendly optimization framework for DAT, which includes a neural architecture search (NAS)-based method with a new slicing strategy to divide input features into uniform patches, avoiding memory conflicts. Additionally, an FPGA-based verification system is designed to test the framework's performance on edge-side hardware. Result: Algorithm experiments on the ImageNet-1K dataset showed that the proposed framework maintains a very small accuracy drop of only 0.2% compared to the baseline DAT. Hardware experiments on Xilinx FPGA demonstrated an 18% reduction in DRAM access times compared to existing DAT acceleration methods. Conclusion: The paper concludes that their hardware-friendly framework for Deformable Attention Transformers (DAT) successfully reduces DRAM access times by 18% compared to existing DAT acceleration methods while maintaining a minimal accuracy drop of only 0.2% compared to the baseline DAT. Abstract: Deformable Attention Transformers (DAT) have shown remarkable performance in computer vision tasks by adaptively focusing on informative image regions. However, their data-dependent sampling mechanism introduces irregular memory access patterns, posing significant challenges for efficient hardware deployment. Existing acceleration methods either incur high hardware overhead or compromise model accuracy. To address these issues, this paper proposes a hardware-friendly optimization framework for DAT. First, a neural architecture search (NAS)-based method with a new slicing strategy is proposed to automatically divide the input feature into uniform patches during the inference process, avoiding memory conflicts without modifying model architecture. The method explores the optimal slice configuration by jointly optimizing hardware cost and inference accuracy. Secondly, an FPGA-based verification system is designed to test the performance of this framework on edge-side hardware. Algorithm experiments on the ImageNet-1K dataset demonstrate that our hardware-friendly framework can maintain have only 0.2% accuracy drop compared to the baseline DAT. Hardware experiments on Xilinx FPGA show the proposed method reduces DRAM access times to 18% compared with existing DAT acceleration methods.[51] Deformable Dynamic Convolution for Accurate yet Efficient Spatio-Temporal Traffic Prediction
Hyeonseok Jin,Geonmin Kim,Kyungbaek Kim
Main category: cs.CV
TL;DR: The paper proposes DDCN, a novel CNN-based architecture for spatio-temporal traffic prediction that addresses scalability and heterogeneity issues of traditional methods like GNNs.
Details
Motivation: The motivation is to overcome limitations of previous methods, especially Graph Neural Networks (GNNs), which require predefined adjacency matrices and struggle with scalability and heterogeneity in traffic patterns. Method: The paper proposes Deformable Dynamic Convolution Network (DDCN), which dynamically applies deformable filters based on offset, and decomposes transformer-style CNN to encoder-decoder structure to emphasize important features. Result: DDCN achieves competitive performance in traffic prediction on four real-world datasets, showing the potential of CNN-based approaches for spatio-temporal prediction tasks. Conclusion: The paper concludes that DDCN is an effective and efficient solution for spatio-temporal traffic prediction, demonstrating competitive performance on real-world datasets. Abstract: Spatio-temporal traffic prediction plays a key role in intelligent transportation systems by enabling accurate prediction in complex urban areas. Although not only accuracy but also efficiency for scalability is important, some previous methods struggle to capture heterogeneity such as varying traffic patterns across regions and time periods. Moreover, Graph Neural Networks (GNNs), which are the mainstream of traffic prediction, not only require predefined adjacency matrix, but also limit scalability to large-scale data containing many nodes due to their inherent complexity. To overcome these limitations, we propose Deformable Dynamic Convolution Network (DDCN) for accurate yet efficient traffic prediction. Traditional Convolutional Neural Networks (CNNs) are limited in modeling non-Euclidean spatial structures and spatio-temporal heterogeneity, DDCN overcomes these challenges by dynamically applying deformable filters based on offset. Specifically, DDCN decomposes transformer-style CNN to encoder-decoder structure, and applies proposed approaches to the spatial and spatio-temporal attention blocks of the encoder to emphasize important features. The decoder, composed of feed-forward module, complements the output of the encoder. This novel structure make DDCN can perform accurate yet efficient traffic prediction. In comprehensive experiments on four real-world datasets, DDCN achieves competitive performance, emphasizing the potential and effectiveness of CNN-based approaches for spatio-temporal traffic prediction.[52] Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models
Zejian Li,Yize Li,Chenye Meng,Zhongni Liu,Yang Ling,Shengyuan Zhang,Guang Yang,Changyuan Yang,Zhiyuan Yang,Lingyun Sun
Main category: cs.CV
TL;DR: Inversion-DPO is a novel, efficient, and high-precision alignment framework for diffusion models that avoids reward modeling and enables improved performance on complex realistic generation tasks.
Details
Motivation: The motivation is to address the limitations of existing alignment methods that require computation-intensive training of a base model and a reward model, which can compromise model accuracy and training efficiency. Method: The method reformulates Direct Preference Optimization (DPO) with DDIM inversion for DMs, using deterministic inversion from winning and losing samples to noise for intractable posterior sampling. Result: Inversion-DPO achieves substantial performance improvements compared to existing post-training methods, with the ability to generate high-fidelity compositionally coherent images. A paired dataset of 11,140 images was curated to enhance compositional capabilities. Conclusion: Inversion-DPO is a new alignment framework for diffusion models that improves precision and efficiency by eliminating the need for auxiliary reward models or inaccurate approximation. Abstract: Recent advancements in diffusion models (DMs) have been propelled by alignment methods that post-train models to better conform to human preferences. However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. Our method conducts intractable posterior sampling in Diffusion-DPO with the deterministic inversion from winning and losing samples to noise and thus derive a new post-training paradigm. This paradigm eliminates the need for auxiliary reward models or inaccurate appromixation, significantly enhancing both precision and efficiency of training. We apply Inversion-DPO to a basic task of text-to-image generation and a challenging task of compositional image generation. Extensive experiments show substantial performance improvements achieved by Inversion-DPO compared to existing post-training methods and highlight the ability of the trained generative models to generate high-fidelity compositionally coherent images. For the post-training of compostitional image geneation, we curate a paired dataset consisting of 11,140 images with complex structural annotations and comprehensive scores, designed to enhance the compositional capabilities of generative models. Inversion-DPO explores a new avenue for efficient, high-precision alignment in diffusion models, advancing their applicability to complex realistic generation tasks. Our code is available at https://github.com/MIGHTYEZ/Inversion-DPO[53] Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting
Changlu Chen,Yanbin Liu,Chaoxi Niu,Ling Chen,Tianqing Zhu
Main category: cs.CV
TL;DR: 本文提出ST-VFM框架,通过双分支架构与重编程策略,将视觉基础模型用于时空预测,解决了时间建模和模态差异问题,显著提升了预测性能。
Details
Motivation: 现有的大语言模型(LLMs)在时空预测任务中表现不足,因为它们主要捕捉一维序列依赖关系,而无法有效建模复杂的时空相关性。因此,需要一种新的方法来利用视觉基础模型(VFMs)的空间先验知识并增强其时间建模能力。 Method: 提出了ST-VFM框架,采用双分支架构,结合原始时空输入和辅助时空流输入,并引入了两个专门的重编程阶段:预VFM重编程和后VFM重编程,以增强时空相关性建模。 Result: 在十个时空数据集上的实验表明,ST-VFM在不同VFM主干(如DINO、CLIP、DEIT)上均表现出色,具有良好的有效性和鲁棒性。 Conclusion: ST-VFM是一个有效的时空预测通用框架,实验结果表明其性能优于现有最先进的基线方法,并且适用于不同的视觉基础模型(VFM)主干。 Abstract: Foundation models have achieved remarkable success in natural language processing and computer vision, demonstrating strong capabilities in modeling complex patterns. While recent efforts have explored adapting large language models (LLMs) for time-series forecasting, LLMs primarily capture one-dimensional sequential dependencies and struggle to model the richer spatio-temporal (ST) correlations essential for accurate ST forecasting. In this paper, we present \textbf{ST-VFM}, a novel framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose spatio-temporal forecasting. While VFMs offer powerful spatial priors, two key challenges arise when applying them to ST tasks: (1) the lack of inherent temporal modeling capacity and (2) the modality gap between visual and ST data. To address these, ST-VFM adopts a \emph{dual-branch architecture} that integrates raw ST inputs with auxiliary ST flow inputs, where the flow encodes lightweight temporal difference signals interpretable as dynamic spatial cues. To effectively process these dual-branch inputs, ST-VFM introduces two dedicated reprogramming stages. The \emph{pre-VFM reprogramming} stage applies a Temporal-Aware Token Adapter to embed temporal context and align both branches into VFM-compatible feature spaces. The \emph{post-VFM reprogramming} stage introduces a Bilateral Cross-Prompt Coordination module, enabling dynamic interaction between branches through prompt-based conditioning, thus enriching joint representation learning without modifying the frozen VFM backbone. Extensive experiments on ten spatio-temporal datasets show that ST-VFM outperforms state-of-the-art baselines, demonstrating effectiveness and robustness across VFM backbones (e.g., DINO, CLIP, DEIT) and ablation studies, establishing it as a strong general framework for spatio-temporal forecasting.[54] Expert Operational GANS: Towards Real-Color Underwater Image Restoration
Ozer Can Devecioglu,Serkan Kiranyaz,Mehmet Yamac,Moncef Gabbouj
Main category: cs.CV
TL;DR: xOp-GAN improves underwater image restoration by using multiple expert generators and a discriminator for image selection, leading to higher PSNR results compared to traditional single-generator GANs.
Details
Motivation: Underwater image restoration is challenging due to complex deformation artifacts, and conventional GAN-based methods struggle because a single generator cannot effectively capture the full range of visual degradations. Method: The proposed xOp-GAN employs multiple expert generator networks trained on specific image quality subsets, with the discriminator selecting the best restored image based on perceptual confidence scores during inference. Result: On the LSUI dataset, xOp-GAN achieved PSNR levels up to 25.16 dB, significantly outperforming existing single-regressor models. Conclusion: xOp-GAN, a novel GAN model with multiple expert generator networks, outperforms single-regressor models in underwater image restoration, achieving higher PSNR levels with reduced complexity. Abstract: The wide range of deformation artifacts that arise from complex light propagation, scattering, and depth-dependent attenuation makes the underwater image restoration to remain a challenging problem. Like other single deep regressor networks, conventional GAN-based restoration methods struggle to perform well across this heterogeneous domain, since a single generator network is typically insufficient to capture the full range of visual degradations. In order to overcome this limitation, we propose xOp-GAN, a novel GAN model with several expert generator networks, each trained solely on a particular subset with a certain image quality. Thus, each generator can learn to maximize its restoration performance for a particular quality range. Once a xOp-GAN is trained, each generator can restore the input image and the best restored image can then be selected by the discriminator based on its perceptual confidence score. As a result, xOP-GAN is the first GAN model with multiple generators where the discriminator is being used during the inference of the regression task. Experimental results on benchmark Large Scale Underwater Image (LSUI) dataset demonstrates that xOp-GAN achieves PSNR levels up to 25.16 dB, surpassing all single-regressor models by a large margin even, with reduced complexity.[55] Data-Driven Meta-Analysis and Public-Dataset Evaluation for Sensor-Based Gait Age Estimation
Varun Velankar
Main category: cs.CV
TL;DR: 这项研究综述了从步态数据中估计年龄的方法,并通过大规模实验和可视化分析提供了降低实际场景中误差的基准和指南。
Details
Motivation: 从步态中估计人的年龄在医疗保健、安全和人机交互中具有重要应用,因此需要准确的年龄估计方法以及对影响因素的深入理解。 Method: 研究人员分析了59项研究,涉及超过75,000名受试者,并使用卷积神经网络、支持向量机、决策树、随机森林、多层感知器等方法进行年龄估计。此外,他们还分析了63,846个步态周期以量化年龄与关键指标之间的相关性,并使用ResNet34模型结合Grad-CAM进行可视化分析。 Result: 卷积神经网络的平均误差约为4.2年,惯性传感器模型约为4.5年,多传感器融合低至3.4年;ResNet34模型微调结合Grad-CAM揭示了网络关注膝盖和骨盆区域;在VersatileGait数据库的一个10万样本子集上,深度网络达到高达96%的准确率,每个样本处理时间不到0.1秒。 Conclusion: 该研究通过全面的元分析和新的大规模实验,建立了坚实的性能基准和实用指南,可在实际场景中将步态年龄误差降低到三年以下。 Abstract: Estimating a person's age from their gait has important applications in healthcare, security and human-computer interaction. In this work, we review fifty-nine studies involving over seventy-five thousand subjects recorded with video, wearable and radar sensors. We observe that convolutional neural networks produce an average error of about 4.2 years, inertial-sensor models about 4.5 years and multi-sensor fusion as low as 3.4 years, with notable differences between lab and real-world data. We then analyse sixty-three thousand eight hundred forty-six gait cycles from the OU-ISIR Large-Population dataset to quantify correlations between age and five key metrics: stride length, walking speed, step cadence, step-time variability and joint-angle entropy, with correlation coefficients of at least 0.27. Next, we fine-tune a ResNet34 model and apply Grad-CAM to reveal that the network attends to the knee and pelvic regions, consistent with known age-related gait changes. Finally, on a one hundred thousand sample subset of the VersatileGait database, we compare support vector machines, decision trees, random forests, multilayer perceptrons and convolutional neural networks, finding that deep networks achieve up to 96 percent accuracy while processing each sample in under 0.1 seconds. By combining a broad meta-analysis with new large-scale experiments and interpretable visualizations, we establish solid performance baselines and practical guidelines for reducing gait-age error below three years in real-world scenarios.[56] What cat is that? A re-id model for feral cats
Victor Caquilpan
Main category: cs.CV
TL;DR: This study adapts a deep learning model (PPGNet) for identifying individual feral cats from camera trap images, achieving high accuracy and offering a promising tool for wildlife monitoring.
Details
Motivation: Feral cats have a significant negative impact on Australian wildlife, making their monitoring crucial to minimizing these effects. Re-Identification (re-ID) techniques can enhance monitoring efforts by identifying individual feral cats from camera trap images. Method: This project modifies a part-pose guided network (PPGNet) model previously used for Amur tiger re-ID to make it applicable for feral cats, resulting in the PPGNet-Cat model. The study also explores contrastive learning approaches such as ArcFace loss. Result: The PPGNet-Cat model achieved a mean Average Precision (mAP) of 0.86 and a rank-1 accuracy of 0.95, indicating high performance in identifying individual feral cats. Conclusion: PPGNet-Cat proves to be a competitive model for re-ID in feral cats with high performance, offering potential for enhanced monitoring of feral cats using camera trap images. Abstract: Feral cats exert a substantial and detrimental impact on Australian wildlife, placing them among the most dangerous invasive species worldwide. Therefore, closely monitoring these cats is essential labour in minimising their effects. In this context, the potential application of Re-Identification (re-ID) emerges to enhance monitoring activities for these animals, utilising images captured by camera traps. This project explores different CV approaches to create a re-ID model able to identify individual feral cats in the wild. The main approach consists of modifying a part-pose guided network (PPGNet) model, initially used in the re-ID of Amur tigers, to be applicable for feral cats. This adaptation, resulting in PPGNet-Cat, which incorporates specific modifications to suit the characteristics of feral cats images. Additionally, various experiments were conducted, particularly exploring contrastive learning approaches such as ArcFace loss. The main results indicate that PPGNet-Cat excels in identifying feral cats, achieving high performance with a mean Average Precision (mAP) of 0.86 and a rank-1 accuracy of 0.95. These outcomes establish PPGNet-Cat as a competitive model within the realm of re-ID.[57] SketchDNN: Joint Continuous-Discrete Diffusion for CAD Sketch Generation
Sathvik Chereddy,John Femiani
Main category: cs.CV
TL;DR: SketchDNN提出Gaussian-Softmax扩散方法,解决CAD草图生成中的关键问题,实现更高质量的生成。
Details
Motivation: 解决CAD草图生成中原始参数异构性和排列不变性这两个关键问题。 Method: 提出了一种名为Gaussian-Softmax扩散的新方法,该方法通过softmax转换将带有高斯噪声的logits投影到概率单纯形上,从而实现离散变量的混合类别标签。 Result: 在SketchGraphs数据集上,生成质量显著提高,Fr'echet Inception Distance (FID)从16.04降低到7.80,负对数似然(NLL)从84.8降低到81.33。 Conclusion: SketchDNN通过Gaussian-Softmax扩散方法,在CAD草图生成任务中实现了新的技术水平,显著提高了生成质量。 Abstract: We present SketchDNN, a generative model for synthesizing CAD sketches that jointly models both continuous parameters and discrete class labels through a unified continuous-discrete diffusion process. Our core innovation is Gaussian-Softmax diffusion, where logits perturbed with Gaussian noise are projected onto the probability simplex via a softmax transformation, facilitating blended class labels for discrete variables. This formulation addresses 2 key challenges, namely, the heterogeneity of primitive parameterizations and the permutation invariance of primitives in CAD sketches. Our approach significantly improves generation quality, reducing Fr\'echet Inception Distance (FID) from 16.04 to 7.80 and negative log-likelihood (NLL) from 84.8 to 81.33, establishing a new state-of-the-art in CAD sketch generation on the SketchGraphs dataset.[58] Interpretable Prediction of Lymph Node Metastasis in Rectal Cancer MRI Using Variational Autoencoders
Benjamin Keel,Aaron Quyn,David Jayne,Maryam Mohsin,Samuel D. Relton
Main category: cs.CV
TL;DR: This paper proposes a VAE-based model (VAE-MLP) for improved lymph node metastasis staging in rectal cancer using MRI data.
Details
Motivation: Current radiological criteria for lymph node metastasis staging have limited accuracy, and VAEs offer a more interpretable latent space by reconstructing images and capturing meaningful patterns. Method: A Variational Autoencoder (VAE) was used as a feature encoder to replace traditional CNNs, trained on an in-house MRI dataset with post-operative pathological N stages as ground truth. Result: VAE-MLP achieved state-of-the-art performance with AUC 0.86 +/- 0.05, Sensitivity 0.79 +/- 0.06, and Specificity 0.85 +/- 0.05 on the MRI dataset. Conclusion: The proposed VAE-MLP model demonstrates high performance in lymph node metastasis staging, offering a more interpretable approach compared to traditional CNNs. Abstract: Effective treatment for rectal cancer relies on accurate lymph node metastasis (LNM) staging. However, radiological criteria based on lymph node (LN) size, shape and texture morphology have limited diagnostic accuracy. In this work, we investigate applying a Variational Autoencoder (VAE) as a feature encoder model to replace the large pre-trained Convolutional Neural Network (CNN) used in existing approaches. The motivation for using a VAE is that the generative model aims to reconstruct the images, so it directly encodes visual features and meaningful patterns across the data. This leads to a disentangled and structured latent space which can be more interpretable than a CNN. Models are deployed on an in-house MRI dataset with 168 patients who did not undergo neo-adjuvant treatment. The post-operative pathological N stage was used as the ground truth to evaluate model predictions. Our proposed model 'VAE-MLP' achieved state-of-the-art performance on the MRI dataset, with cross-validated metrics of AUC 0.86 +/- 0.05, Sensitivity 0.79 +/- 0.06, and Specificity 0.85 +/- 0.05. Code is available at: https://github.com/benkeel/Lymph_Node_Classification_MIUA.[59] Posture-Driven Action Intent Inference for Playing style and Fatigue Assessment
Abhishek Jaiswal,Nisheeth Srivastava
Main category: cs.CV
TL;DR: 本研究提出了一种基于姿势的意图识别方法,应用于板球比赛,并展示了姿势对意图推断的重要信号作用。
Details
Motivation: 基于姿势的心理状态推断在疲劳诊断、伤害预防和性能提升方面具有潜力,但面临人类受试者数据敏感性的挑战,因此需要寻找替代数据来源。 Method: 通过在板球比赛中测试假设,提出了一种基于姿势的意图识别方法,并使用现有数据统计作为弱监督来验证发现。 Result: 该方法在区分攻击性和防御性击球意图方面达到了75%以上的F1分数和80%以上的AUC-ROC。 Conclusion: 该研究得出姿势能够泄露强烈的意图推断信号,并为克服数据标记限制提供了潜在解决方案,促进了运动分析技术和人类行为分析的应用发展。 Abstract: Posture-based mental state inference has significant potential in diagnosing fatigue, preventing injury, and enhancing performance across various domains. Such tools must be research-validated with large datasets before being translated into practice. Unfortunately, such vision diagnosis faces serious challenges due to the sensitivity of human subject data. To address this, we identify sports settings as a viable alternative for accumulating data from human subjects experiencing diverse emotional states. We test our hypothesis in the game of cricket and present a posture-based solution to identify human intent from activity videos. Our method achieves over 75\% F1 score and over 80\% AUC-ROC in discriminating aggressive and defensive shot intent through motion analysis. These findings indicate that posture leaks out strong signals for intent inference, even with inherent noise in the data pipeline. Furthermore, we utilize existing data statistics as weak supervision to validate our findings, offering a potential solution for overcoming data labelling limitations. This research contributes to generalizable techniques for sports analytics and also opens possibilities for applying human behavior analysis across various fields.[60] VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization
Hannah Shafferman,Annika Thomas,Jouko Kinnari,Michael Ricard,Jose Nino,Jonathan How
Main category: cs.CV
TL;DR: 本文提出了一种名为VISTA的新型单目全局定位框架,通过结合基于对象的分割跟踪和子地图对应搜索,实现了在不同场景和季节变化下的稳定定位,无需特定领域训练或微调。
Details
Motivation: 全局定位在自动驾驶中至关重要,尤其是在参考帧未知的情况下。传统方法因视角变化、季节变化、空间别名和遮挡等问题容易失效,因此需要一种更鲁棒的解决方案。 Method: VISTA框架包含两个主要部分:1)前端基于对象的分割与跟踪流程;2)利用环境地图几何一致性的子地图对应搜索,以对齐车辆参考帧。 Result: 在季节变化和斜角航拍数据集上的评估显示,与基线方法相比,召回率提高了高达69%,同时地图大小仅为最节省内存基线方法的0.6%。 Conclusion: VISTA提供了一种高效且鲁棒的全局定位方法,适用于资源受限平台,能够在多种环境条件下实现高精度实时定位。 Abstract: Global localization is critical for autonomous navigation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the correlation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, spatial aliasing, and occlusions -- known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and tracking pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle reference frames. VISTA enables consistent localization across diverse camera viewpoints and seasonal changes, without requiring any domain-specific training or finetuning. We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods. Furthermore, we maintain a compact object-based map that is only 0.6% the size of the most memory-conservative baseline, making our approach capable of real-time implementation on resource-constrained platforms.[61] Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis
Maciej Szankin,Vidhyananth Venkatasamy,Lihang Ying
Main category: cs.CV
TL;DR: This study compares Vision-Language Models and CNN-based OCR methods for outdoor billboard text recognition, finding that while VLMs offer holistic understanding, lightweight CNNs remain efficient and accurate for practical deployment.
Details
Motivation: Verifying billboard text visibility in real-world outdoor conditions is challenging due to complex scenes, varying fonts, and weather-induced noise, which traditional OCR methods struggle to handle effectively. Method: The study benchmarks multimodal Vision-Language Models (VLMs) such as Qwen 2.5 VL 3B, InternVL3, and SmolVLM2 against a CNN-based OCR baseline (PaddleOCRv4) on public datasets (ICDAR 2015 and SVT), with synthetic weather distortions applied to simulate real-world conditions. Result: Selected VLMs perform well in understanding complex scenes without explicit text detection, but lightweight CNN pipelines like PaddleOCRv4 achieve comparable accuracy for cropped text at significantly lower computational cost. Conclusion: While Vision-Language Models (VLMs) show promise in holistic scene understanding for billboard text visibility, lightweight CNN-based OCR methods like PaddleOCRv4 remain competitive in accuracy for cropped text with lower computational costs, making them more suitable for edge deployment. Abstract: Outdoor advertisements remain a critical medium for modern marketing, yet accurately verifying billboard text visibility under real-world conditions is still challenging. Traditional Optical Character Recognition (OCR) pipelines excel at cropped text recognition but often struggle with complex outdoor scenes, varying fonts, and weather-induced visual noise. Recently, multimodal Vision-Language Models (VLMs) have emerged as promising alternatives, offering end-to-end scene understanding with no explicit detection step. This work systematically benchmarks representative VLMs - including Qwen 2.5 VL 3B, InternVL3, and SmolVLM2 - against a compact CNN-based OCR baseline (PaddleOCRv4) across two public datasets (ICDAR 2015 and SVT), augmented with synthetic weather distortions to simulate realistic degradation. Our results reveal that while selected VLMs excel at holistic scene reasoning, lightweight CNN pipelines still achieve competitive accuracy for cropped text at a fraction of the computational cost-an important consideration for edge deployment. To foster future research, we release our weather-augmented benchmark and evaluation code publicly.[62] Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning
Fan Shi,Bin Li,Xiangyang Xue
Main category: cs.CV
TL;DR: This paper introduces UCGS, a unified framework for abstract visual reasoning that solves multiple tasks without task-specific designs and enables zero-shot reasoning.
Details
Motivation: Current deep AVR solvers require task-specific designs and retraining for new tasks, increasing costs. A unified approach is needed to generalize across tasks efficiently. Method: The paper proposes a Unified Conditional Generative Solver (UCGS) that reformulates AVR tasks as predictability estimation problems and uses a single conditional generative model to solve them. Result: UCGS successfully demonstrates abstract reasoning across various AVR tasks with a single multi-task training round and shows zero-shot reasoning capabilities during testing. Conclusion: UCGS is a unified framework that can solve multiple AVR tasks, reducing the need for task-specific designs and enabling zero-shot reasoning. Abstract: Abstract visual reasoning (AVR) enables humans to quickly discover and generalize abstract rules to new scenarios. Designing intelligent systems with human-like AVR abilities has been a long-standing topic in the artificial intelligence community. Deep AVR solvers have recently achieved remarkable success in various AVR tasks. However, they usually use task-specific designs or parameters in different tasks. In such a paradigm, solving new tasks often means retraining the model, and sometimes retuning the model architectures, which increases the cost of solving AVR problems. In contrast to task-specific approaches, this paper proposes a novel Unified Conditional Generative Solver (UCGS), aiming to address multiple AVR tasks in a unified framework. First, we prove that some well-known AVR tasks can be reformulated as the problem of estimating the predictability of target images in problem panels. Then, we illustrate that, under the proposed framework, training one conditional generative model can solve various AVR tasks. The experiments show that with a single round of multi-task training, UCGS demonstrates abstract reasoning ability across various AVR tasks. Especially, UCGS exhibits the ability of zero-shot reasoning, enabling it to perform abstract reasoning on problems from unseen AVR tasks in the testing phase.[63] CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning
Peiwen Xia,Tangfei Liao,Wei Zhu,Danhuai Zhao,Jianjun Ke,Kaihao Zhang,Tong Lu,Tao Wang
Main category: cs.CV
TL;DR: 提出了CorrMoE,一种新的对应关系修剪框架,通过De-stylization Dual Branch和Bi-Fusion Mixture of Experts模块提高跨域和跨场景变化下的鲁棒性。
Details
Motivation: 尽管最近的方法在修剪密集对应集方面取得了进展,但它们通常假设一致的视觉域,并忽略了不同场景结构带来的挑战。 Method: 引入了De-stylization Dual Branch和Bi-Fusion Mixture of Experts模块,以线性复杂度注意力和动态专家路由适应性地融合多视角特征。 Result: 在基准数据集上的大量实验表明,与最先进的方法相比,CorrMoE实现了更高的准确性和泛化能力。 Conclusion: CorrMoE是一个新的对应关系修剪框架,在跨域和跨场景变化下提高了鲁棒性。 Abstract: Establishing reliable correspondences between image pairs is a fundamental task in computer vision, underpinning applications such as 3D reconstruction and visual localization. Although recent methods have made progress in pruning outliers from dense correspondence sets, they often hypothesize consistent visual domains and overlook the challenges posed by diverse scene structures. In this paper, we propose CorrMoE, a novel correspondence pruning framework that enhances robustness under cross-domain and cross-scene variations. To address domain shift, we introduce a De-stylization Dual Branch, performing style mixing on both implicit and explicit graph features to mitigate the adverse influence of domain-specific representations. For scene diversity, we design a Bi-Fusion Mixture of Experts module that adaptively integrates multi-perspective features through linear-complexity attention and dynamic expert routing. Extensive experiments on benchmark datasets demonstrate that CorrMoE achieves superior accuracy and generalization compared to state-of-the-art methods. The code and pre-trained models are available at https://github.com/peiwenxia/CorrMoE.[64] ProtoConNet: Prototypical Augmentation and Alignment for Open-Set Few-Shot Image Classification
Kexuan Shi,Zhuang Qi,Jingjing Zhu,Lei Meng,Yaochen Zhang,Haibei Huang,Xiangxu Meng
Main category: cs.CV
TL;DR: 本文提出ProtoConNet方法,通过整合上下文信息和原型对齐,提升了少样本图像分类中对已知和未知类别的区分能力。
Details
Motivation: 现有方法主要依赖单张图像的视觉信息,忽略了整合丰富上下文信息的好处,从而在区分已知和未知类别时效果有限。 Method: 提出了一种名为ProtoConNet的方法,包含聚类数据选择、上下文增强语义优化和原型对齐三个模块,用于增强特征空间多样性并减少虚假关联。 Result: 在两个数据集上的实验结果验证了ProtoConNet的有效性,表明其在少样本场景中具有更好的表现。 Conclusion: ProtoConNet通过结合背景信息和原型对齐,提升了少样本场景中的表征学习效果,并能有效识别开放集样本,优于现有方法。 Abstract: Open-set few-shot image classification aims to train models using a small amount of labeled data, enabling them to achieve good generalization when confronted with unknown environments. Existing methods mainly use visual information from a single image to learn class representations to distinguish known from unknown categories. However, these methods often overlook the benefits of integrating rich contextual information. To address this issue, this paper proposes a prototypical augmentation and alignment method, termed ProtoConNet, which incorporates background information from different samples to enhance the diversity of the feature space, breaking the spurious associations between context and image subjects in few-shot scenarios. Specifically, it consists of three main modules: the clustering-based data selection (CDS) module mines diverse data patterns while preserving core features; the contextual-enhanced semantic refinement (CSR) module builds a context dictionary to integrate into image representations, which boosts the model's robustness in various scenarios; and the prototypical alignment (PA) module reduces the gap between image representations and class prototypes, amplifying feature distances for known and unknown classes. Experimental results from two datasets verified that ProtoConNet enhances the effectiveness of representation learning in few-shot scenarios and identifies open-set samples, making it superior to existing methods.[65] From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition
Yu Liu,Leyuan Qu,Hanlei Shi,Di Gao,Yuhua Zheng,Taihao Li
Main category: cs.CV
TL;DR: This paper proposes GRACE, a method for improved Dynamic Facial Expression Recognition through cross-modal alignment of enhanced semantic and visual signals.
Details
Motivation: To overcome limitations in existing vision-language approaches for emotion recognition that underutilize subtle emotional cues and lack effective mechanisms to filter irrelevant facial dynamics. Method: GRACE utilizes a Coarse-to-fine Affective Text Enhancement module, motion-difference weighting mechanism, and aligns semantic and visual signals using entropy-regularized optimal transport. Result: The proposed method significantly improves recognition performance on three benchmark datasets, especially in challenging scenarios, achieving new SOTA results in UAR and WAR. Conclusion: GRACE achieves state-of-the-art results in Dynamic Facial Expression Recognition by effectively integrating dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment. Abstract: Dynamic Facial Expression Recognition (DFER) aims to identify human emotions from temporally evolving facial movements and plays a critical role in affective computing. While recent vision-language approaches have introduced semantic textual descriptions to guide expression recognition, existing methods still face two key limitations: they often underutilize the subtle emotional cues embedded in generated text, and they have yet to incorporate sufficiently effective mechanisms for filtering out facial dynamics that are irrelevant to emotional expression. To address these gaps, We propose GRACE, Granular Representation Alignment for Cross-modal Emotion recognition that integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient spatiotemporal features. Our method constructs emotion-aware textual descriptions via a Coarse-to-fine Affective Text Enhancement (CATE) module and highlights expression-relevant facial motion through a motion-difference weighting mechanism. These refined semantic and visual signals are aligned at the token level using entropy-regularized optimal transport. Experiments on three benchmark datasets demonstrate that our method significantly improves recognition performance, particularly in challenging settings with ambiguous or imbalanced emotion classes, establishing new state-of-the-art (SOTA) results in terms of both UAR and WAR.[66] Spatial Frequency Modulation for Semantic Segmentation
Linwei Chen,Ying Fu,Lin Gu,Dezhi Zheng,Jifeng Dai
Main category: cs.CV
TL;DR: This paper proposes a Spatial Frequency Modulation (SFM) method to address the issue of aliasing during downsampling in semantic segmentation, achieving improved performance while maintaining high-frequency details.
Details
Motivation: High spatial frequency information contributes to semantic segmentation accuracy, but according to the Nyquist-Shannon Sampling Theorem, high-frequency components are vulnerable to aliasing or distortion during downsampling. This paper aims to address this issue. Method: The authors propose a Spatial Frequency Modulation (SFM) method that modulates high-frequency features to a lower frequency before downsampling and demodulates them back during upsampling. This is achieved through Adaptive Resampling (ARS) and Multi-Scale Adaptive Upsampling (MSAU). Result: The SFM method successfully retains details after demodulation and improves segmentation performance. The approach is validated through feature visualization and analysis, and its broad applicability is demonstrated across various tasks like image classification, adversarial robustness, instance segmentation, and panoptic segmentation. Conclusion: The proposed SFM method effectively alleviates aliasing while retaining details, and it can be seamlessly integrated with various architectures, extending from CNNs to transformers. Abstract: High spatial frequency information, including fine details like textures, significantly contributes to the accuracy of semantic segmentation. However, according to the Nyquist-Shannon Sampling Theorem, high-frequency components are vulnerable to aliasing or distortion when propagating through downsampling layers such as strided-convolution. Here, we propose a novel Spatial Frequency Modulation (SFM) that modulates high-frequency features to a lower frequency before downsampling and then demodulates them back during upsampling. Specifically, we implement modulation through adaptive resampling (ARS) and design a lightweight add-on that can densely sample the high-frequency areas to scale up the signal, thereby lowering its frequency in accordance with the Frequency Scaling Property. We also propose Multi-Scale Adaptive Upsampling (MSAU) to demodulate the modulated feature and recover high-frequency information through non-uniform upsampling This module further improves segmentation by explicitly exploiting information interaction between densely and sparsely resampled areas at multiple scales. Both modules can seamlessly integrate with various architectures, extending from convolutional neural networks to transformers. Feature visualization and analysis confirm that our method effectively alleviates aliasing while successfully retaining details after demodulation. Finally, we validate the broad applicability and effectiveness of SFM by extending it to image classification, adversarial robustness, instance segmentation, and panoptic segmentation tasks. The code is available at \href{https://github.com/Linwei-Chen/SFM}{https://github.com/Linwei-Chen/SFM}.[67] SEPose: A Synthetic Event-based Human Pose Estimation Dataset for Pedestrian Monitoring
Kaustav Chanda,Aayush Atul Verma,Arpitsinh Vaghela,Yezhou Yang,Bharatesh Chakravarthi
Main category: cs.CV
TL;DR: SEPose is a large synthetic dataset designed for event-based human pose estimation in pedestrian monitoring systems, offering broad scenario coverage and demonstrating effective real-world applicability.
Details
Motivation: The limited availability of data addressing pedestrian and traffic monitoring under challenging conditions motivated the creation of the SEPose dataset. Method: SEPose was generated using dynamic vision sensors in the CARLA simulator, providing annotated pedestrian body pose keypoints across a range of environments and conditions. Result: SEPose includes nearly 350K annotated pedestrians, covering diverse crowd densities, lighting, weather conditions, and geographic settings. Conclusion: SEPose is a synthetic dataset that demonstrates sim-to-real generalization capabilities for event-based human pose estimation in pedestrian perception. Abstract: Event-based sensors have emerged as a promising solution for addressing challenging conditions in pedestrian and traffic monitoring systems. Their low-latency and high dynamic range allow for improved response time in safety-critical situations caused by distracted walking or other unusual movements. However, the availability of data covering such scenarios remains limited. To address this gap, we present SEPose -- a comprehensive synthetic event-based human pose estimation dataset for fixed pedestrian perception generated using dynamic vision sensors in the CARLA simulator. With nearly 350K annotated pedestrians with body pose keypoints from the perspective of fixed traffic cameras, SEPose is a comprehensive synthetic multi-person pose estimation dataset that spans busy and light crowds and traffic across diverse lighting and weather conditions in 4-way intersections in urban, suburban, and rural environments. We train existing state-of-the-art models such as RVT and YOLOv8 on our dataset and evaluate them on real event-based data to demonstrate the sim-to-real generalization capabilities of the proposed dataset.[68] Dark-EvGS: Event Camera as an Eye for Radiance Field in the Dark
Jingqian Wu,Peiqi Duan,Zongqiang Wang,Changwei Wang,Boxin Shi,Edmund Y. Lam
Main category: cs.CV
TL;DR: 本文提出了一种名为Dark-EvGS的事件辅助3D高斯随机框架,用于在低光条件下从任意视角重建明亮帧,解决了事件数据噪声大、帧质量差和色彩不一致的问题,并引入了一个真实捕捉的数据集进行验证。
Details
Motivation: 在低光环境下,传统摄像机因动态范围限制和长时间曝光导致的运动模糊难以捕捉清晰的多视角图像。事件摄像机具有高动态范围和高速特性,有助于缓解这些问题,但现有方法仍面临事件数据噪声、帧质量差和色彩不一致的挑战。 Method: 提出Dark-EvGS,采用三元组级监督学习和色彩一致性匹配模块,结合事件数据与3D高斯随机方法进行辐射场重建,实现从任意视角合成高质量明亮帧。 Result: 实验表明,Dark-EvGS在低光条件下的辐射场重建效果优于现有方法,并成功引入了首个真实捕捉的事件引导明亮帧合成数据集。 Conclusion: Dark-EvGS是首个有效的事件辅助3D高斯随机框架,成功解决了低光环境下事件数据噪声大、帧质量差和色彩不一致的问题,实现了高质量的明亮帧合成。 Abstract: In low-light environments, conventional cameras often struggle to capture clear multi-view images of objects due to dynamic range limitations and motion blur caused by long exposure. Event cameras, with their high-dynamic range and high-speed properties, have the potential to mitigate these issues. Additionally, 3D Gaussian Splatting (GS) enables radiance field reconstruction, facilitating bright frame synthesis from multiple viewpoints in low-light conditions. However, naively using an event-assisted 3D GS approach still faced challenges because, in low light, events are noisy, frames lack quality, and the color tone may be inconsistent. To address these issues, we propose Dark-EvGS, the first event-assisted 3D GS framework that enables the reconstruction of bright frames from arbitrary viewpoints along the camera trajectory. Triplet-level supervision is proposed to gain holistic knowledge, granular details, and sharp scene rendering. The color tone matching block is proposed to guarantee the color consistency of the rendered frames. Furthermore, we introduce the first real-captured dataset for the event-guided bright frame synthesis task via 3D GS-based radiance field reconstruction. Experiments demonstrate that our method achieves better results than existing methods, conquering radiance field reconstruction under challenging low-light conditions. The code and sample data are included in the supplementary material.[69] Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs
Mohammad Shahab Sepehri,Berk Tinaz,Zalan Fabian,Mahdi Soltanolkotabi
Main category: cs.CV
TL;DR: Hyperphantasia is introduced as a synthetic benchmark to evaluate mental visualization in MLLMs, revealing a significant human-model performance gap and highlighting the need for improved visual simulation capabilities.
Details
Motivation: Despite advancements in Multimodal Large Language Models (MLLMs), existing benchmarks primarily assess passive visual perception and do not adequately evaluate the active cognitive ability of mental visualization, which is crucial for complex problem-solving in humans. Method: The researchers introduced Hyperphantasia, a synthetic benchmark consisting of four procedurally generated puzzles at three difficulty levels, to evaluate the mental visualization abilities of MLLMs. They conducted a comprehensive evaluation of state-of-the-art models and explored the potential of reinforcement learning to enhance visual simulation capabilities. Result: The evaluation revealed a substantial performance gap between humans and MLLMs on the Hyperphantasia benchmark, indicating that current models struggle with robust mental visualization despite some capability in visual pattern recognition. Conclusion: The study concludes that although some models show partial competence in recognizing visual patterns, robust mental visualization remains a significant challenge for current MLLMs. Abstract: Mental visualization, the ability to construct and manipulate visual representations internally, is a core component of human cognition and plays a vital role in tasks involving reasoning, prediction, and abstraction. Despite the rapid progress of Multimodal Large Language Models (MLLMs), current benchmarks primarily assess passive visual perception, offering limited insight into the more active capability of internally constructing visual patterns to support problem solving. Yet mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation. To bridge this gap, we introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of MLLMs through four carefully constructed puzzles. Each task is procedurally generated and presented at three difficulty levels, enabling controlled analysis of model performance across increasing complexity. Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs. Additionally, we explore the potential of reinforcement learning to improve visual simulation capabilities. Our findings suggest that while some models exhibit partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.[70] RaDL: Relation-aware Disentangled Learning for Multi-Instance Text-to-Image Generation
Geon Park,Seon Bin Kim,Gunho Jung,Seong-Whan Lee
Main category: cs.CV
TL;DR: 本文提出了一种名为RaDL的关系感知解耦学习框架,以解决在多实例图像生成中考虑实例间关系和多个属性的问题。
Details
Motivation: 现有的文本到图像生成方法在生成多个实例时,难以处理关系差异和多个属性泄露的问题。 Method: RaDL通过可学习参数增强实例特定属性,并利用从全局提示中提取的动作动词生成关系感知的图像特征。 Result: 在COCO-Position、COCO-MIG和DrawBench等基准测试中,RaDL在位置准确性、多属性考虑和实例间关系方面显著优于现有方法。 Conclusion: RaDL提供了一种有效的解决方案,能够在多实例图像生成中同时考虑实例间的关系和每个实例的多个属性。 Abstract: With recent advancements in text-to-image (T2I) models, effectively generating multiple instances within a single image prompt has become a crucial challenge. Existing methods, while successful in generating positions of individual instances, often struggle to account for relationship discrepancy and multiple attributes leakage. To address these limitations, this paper proposes the relation-aware disentangled learning (RaDL) framework. RaDL enhances instance-specific attributes through learnable parameters and generates relation-aware image features via Relation Attention, utilizing action verbs extracted from the global prompt. Through extensive evaluations on benchmarks such as COCO-Position, COCO-MIG, and DrawBench, we demonstrate that RaDL outperforms existing methods, showing significant improvements in positional accuracy, multiple attributes consideration, and the relationships between instances. Our results present RaDL as the solution for generating images that consider both the relationships and multiple attributes of each instance within the multi-instance image.[71] Prototypical Progressive Alignment and Reweighting for Generalizable Semantic Segmentation
Yuhang Zhang,Zhengyu Zhang,Muxin Liao,Shishun Tian,Wenbin Zou,Lu Zhang,Chen Xu
Main category: cs.CV
TL;DR: PPAR是一种新的语义分割泛化框架,通过CLIP生成的原型进行渐进式对齐和源数据重加权,有效提升了模型在未见域上的表现。
Details
Motivation: 为了解决现有方法在类原型对齐策略、过拟合问题和源样本权重分配方面的不足,提高语义分割模型在未见域上的泛化能力。 Method: 提出了PPAR框架,包括两种CLIP生成的原型(OTP和VTP)、渐进式对齐策略以及原型重加权机制。 Result: 在多个基准测试中,PPAR表现出最先进的性能,同时理论分析表明该方法与领域泛化理论保持一致。 Conclusion: PPAR通过利用CLIP模型的强大泛化能力,定义了两种原型(OTP和VTP),并引入了渐进式对齐策略和原型重加权机制,从而在多个基准测试中实现了最先进的性能,验证了其有效性。 Abstract: Generalizable semantic segmentation aims to perform well on unseen target domains, a critical challenge due to real-world applications requiring high generalizability. Class-wise prototypes, representing class centroids, serve as domain-invariant cues that benefit generalization due to their stability and semantic consistency. However, this approach faces three challenges. First, existing methods often adopt coarse prototypical alignment strategies, which may hinder performance. Second, naive prototypes computed by averaging source batch features are prone to overfitting and may be negatively affected by unrelated source data. Third, most methods treat all source samples equally, ignoring the fact that different features have varying adaptation difficulties. To address these limitations, we propose a novel framework for generalizable semantic segmentation: Prototypical Progressive Alignment and Reweighting (PPAR), leveraging the strong generalization ability of the CLIP model. Specifically, we define two prototypes: the Original Text Prototype (OTP) and Visual Text Prototype (VTP), generated via CLIP to serve as a solid base for alignment. We then introduce a progressive alignment strategy that aligns features in an easy-to-difficult manner, reducing domain gaps gradually. Furthermore, we propose a prototypical reweighting mechanism that estimates the reliability of source data and adjusts its contribution, mitigating the effect of irrelevant or harmful features (i.e., reducing negative transfer). We also provide a theoretical analysis showing the alignment between our method and domain generalization theory. Extensive experiments across multiple benchmarks demonstrate that PPAR achieves state-of-the-art performance, validating its effectiveness.[72] Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Yuchi Ishikawa,Shota Nakada,Hokuto Munakata,Kazuhiro Saito,Tatsuya Komatsu,Yoshimitsu Aoki
Main category: cs.CV
TL;DR: LG-CAV-MAE improves audio-visual representation learning by incorporating text modality and achieves superior performance on retrieval and classification tasks.
Details
Motivation: To enhance audio-visual representation learning by incorporating textual modality and leveraging unlabeled data for improved performance. Method: LG-CAV-MAE combines contrastive audio-visual masked autoencoders with a pretrained text encoder and uses an automatic method to generate high-quality audio-visual-text triplets from unlabeled videos. Result: The model achieves up to a 5.6% improvement in recall@10 for retrieval tasks and a 3.2% improvement for the classification task compared to existing methods. Conclusion: LG-CAV-MAE significantly improves audio-visual representation learning by integrating a pretrained text encoder and achieving better performance in retrieval and classification tasks. Abstract: In this paper, we propose Language-Guided Contrastive Audio-Visual Masked Autoencoders (LG-CAV-MAE) to improve audio-visual representation learning. LG-CAV-MAE integrates a pretrained text encoder into contrastive audio-visual masked autoencoders, enabling the model to learn across audio, visual and text modalities. To train LG-CAV-MAE, we introduce an automatic method to generate audio-visual-text triplets from unlabeled videos. We first generate frame-level captions using an image captioning model and then apply CLAP-based filtering to ensure strong alignment between audio and captions. This approach yields high-quality audio-visual-text triplets without requiring manual annotations. We evaluate LG-CAV-MAE on audio-visual retrieval tasks, as well as an audio-visual classification task. Our method significantly outperforms existing approaches, achieving up to a 5.6% improvement in recall@10 for retrieval tasks and a 3.2% improvement for the classification task.[73] Watch, Listen, Understand, Mislead: Tri-modal Adversarial Attacks on Short Videos for Content Appropriateness Evaluation
Sahid Hossain Mustakim,S M Jishanul Islam,Ummay Maria Muna,Montasir Chowdhury,Mohammed Jawwadul Islam,Sadia Ahmmed,Tashfia Sikder,Syed Tasdid Azam Dhrubo,Swakkhar Shatabda
Main category: cs.CV
TL;DR: 本论文提出了一个评估多模态大语言模型在短视频内容中三模态安全性的综合框架,并介绍了新的三模态攻击策略ChimeraBreak和相关的对抗数据集SVMA。
Details
Motivation: 现有的安全性评估方法主要依赖于单模态攻击,无法充分揭示多模态大语言模型在短视频场景下的综合漏洞,因此需要一种更全面的评估方法。 Method: 首先构建了包含多种短视频内容的Short-Video Multimodal Adversarial (SVMA)数据集,然后提出ChimeraBreak攻击策略,同时挑战模型的视觉、听觉和语义推理路径。通过LLM-as-a-judge方法评估攻击效果。 Result: 实验表明,当前最先进的多模态大语言模型存在显著的漏洞,具有较高的攻击成功率(ASR),并且模型倾向于将良性或违反政策的内容错误分类。 Conclusion: 本研究揭示了多模态大语言模型在短视频场景下的安全性问题,为开发更鲁棒和安全的模型提供了关键见解和数据支持。 Abstract: Multimodal Large Language Models (MLLMs) are increasingly used for content moderation, yet their robustness in short-form video contexts remains underexplored. Current safety evaluations often rely on unimodal attacks, failing to address combined attack vulnerabilities. In this paper, we introduce a comprehensive framework for evaluating the tri-modal safety of MLLMs. First, we present the Short-Video Multimodal Adversarial (SVMA) dataset, comprising diverse short-form videos with human-guided synthetic adversarial attacks. Second, we propose ChimeraBreak, a novel tri-modal attack strategy that simultaneously challenges visual, auditory, and semantic reasoning pathways. Extensive experiments on state-of-the-art MLLMs reveal significant vulnerabilities with high Attack Success Rates (ASR). Our findings uncover distinct failure modes, showing model biases toward misclassifying benign or policy-violating content. We assess results using LLM-as-a-judge, demonstrating attack reasoning efficacy. Our dataset and findings provide crucial insights for developing more robust and safe MLLMs.[74] GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models
Zhaohong Huang,Yuxin Zhang,Jingjing Xie,Fei Chao,Rongrong Ji
Main category: cs.CV
TL;DR: This paper proposes GS-Bias, an efficient test-time adaptation method for Vision-Language Models that introduces global and spatial biases to improve generalization performance with minimal computational overhead.
Details
Motivation: Existing test-time adaptation methods for Vision-Language Models struggle to balance performance and efficiency due to high tuning overhead or unstable feature enhancement. GS-Bias aims to address these issues with a more efficient and stable adaptation mechanism. Method: The paper introduces the Global-Spatial Bias Learner (GS-Bias), which incorporates two learnable biases—the global bias and spatial bias—added directly to the logits of pretrained VLMs to enhance performance without full backpropagation. Result: GS-Bias achieves a 2.23% improvement over TPT in cross-dataset generalization and a 2.72% improvement in domain generalization on ImageNet, while using only 6.5% of TPT's memory. Conclusion: GS-Bias provides a highly efficient and effective TTA paradigm for Vision-Language Models, achieving state-of-the-art performance on multiple datasets with significantly reduced memory usage compared to existing methods. Abstract: Recent advances in test-time adaptation (TTA) for Vision-Language Models (VLMs) have garnered increasing attention, particularly through the use of multiple augmented views of a single image to boost zero-shot generalization. Unfortunately, existing methods fail to strike a satisfactory balance between performance and efficiency, either due to excessive overhead of tuning text prompts or unstable benefits from handcrafted, training-free visual feature enhancement. In this paper, we present Global-Spatial Bias Learner (GS-Bias), an efficient and effective TTA paradigm that incorporates two learnable biases during TTA, unfolded as the global bias and spatial bias. Particularly, the global bias captures the global semantic features of a test image by learning consistency across augmented views, while spatial bias learns the semantic coherence between regions in the image's spatial visual representation. It is worth highlighting that these two sets of biases are directly added to the logits outputed by the pretrained VLMs, which circumvent the full backpropagation through VLM that hinders the efficiency of existing TTA methods. This endows GS-Bias with extremely high efficiency while achieving state-of-the-art performance on 15 benchmark datasets. For example, it achieves a 2.23% improvement over TPT in cross-dataset generalization and a 2.72% improvement in domain generalization, while requiring only 6.5% of TPT's memory usage on ImageNet.[75] EC-Diff: Fast and High-Quality Edge-Cloud Collaborative Inference for Diffusion Models
Jiajian Xie,Shengyu Zhang,Zhou Zhao,Fan Wu,Fei Wu
Main category: cs.CV
TL;DR: This paper proposes EC-Diff, a method to accelerate diffusion model inference by reducing cloud computation frequency and identifying optimal cloud-edge collaboration points, achieving faster and higher-quality generation.
Details
Motivation: The motivation is to overcome the limitations of model size and latency in diffusion models that affect user experience. Existing hybrid edge-cloud frameworks face challenges such as prolonged inference time due to excessive cloud denoising or semantic ambiguity from insufficient steps, leading to inconsistency in edge model outputs. Method: The paper proposes EC-Diff, which employs a K-step noise approximation strategy based on noise gradients to reduce cloud inference frequency, and a two-stage greedy search algorithm to determine optimal parameters for noise approximation and edge model switching. Result: Experiments show that the proposed method significantly enhances generation quality compared to edge inference alone and achieves up to an average 2x speedup in inference compared to cloud inference. Conclusion: EC-Diff successfully accelerates cloud inference and identifies optimal cloud-edge handoff points, improving generation quality while achieving significant speedup in inference compared to cloud inference alone. Abstract: Diffusion Models have shown remarkable proficiency in image and video synthesis. As model size and latency increase limit user experience, hybrid edge-cloud collaborative framework was recently proposed to realize fast inference and high-quality generation, where the cloud model initiates high-quality semantic planning and the edge model expedites later-stage refinement. However, excessive cloud denoising prolongs inference time, while insufficient steps cause semantic ambiguity, leading to inconsistency in edge model output. To address these challenges, we propose EC-Diff that accelerates cloud inference through gradient-based noise estimation while identifying the optimal point for cloud-edge handoff to maintain generation quality. Specifically, we design a K-step noise approximation strategy to reduce cloud inference frequency by using noise gradients between steps and applying cloud inference periodically to adjust errors. Then we design a two-stage greedy search algorithm to efficiently find the optimal parameters for noise approximation and edge model switching. Extensive experiments demonstrate that our method significantly enhances generation quality compared to edge inference, while achieving up to an average $2\times$ speedup in inference compared to cloud inference. Video samples and source code are available at https://ec-diff.github.io/.[76] Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints
Jiahao Xia,Yike Wu,Wenjian Huang,Jianguo Zhang,Jian Zhang
Main category: cs.CV
TL;DR: The paper proposes a new unsupervised part discovery method called Masked Part Autoencoder (MPAE) that robustly discovers meaningful parts across various categories and scenarios, overcoming the limitations of existing methods.
Details
Motivation: The motivation of the paper is to overcome the limitations of existing unsupervised part discovery methods that struggle with robustness across various categories and scenarios, thereby restricting their application range. Method: The paper introduces a novel method called Masked Part Autoencoder (MPAE) that learns part descriptors and feature maps from inputs, produces patch features from masked images, and fills masked regions using part descriptors based on similarity. It also incorporates looser yet more effective constraints to identify parts across various scenarios and categories in an unsupervised manner. Result: Extensive experiments demonstrate that the MPAE method robustly discovers meaningful parts across various categories and scenarios. It better aligns parts with actual object shapes even in complex scenarios. Conclusion: The paper concludes that MPAE provides a robust and effective paradigm for unsupervised part discovery, offering a solution to challenges posed by occlusion and enabling exploration of part similarity across multiple categories. Abstract: Part-level features are crucial for image understanding, but few studies focus on them because of the lack of fine-grained labels. Although unsupervised part discovery can eliminate the reliance on labels, most of them cannot maintain robustness across various categories and scenarios, which restricts their application range. To overcome this limitation, we present a more effective paradigm for unsupervised part discovery, named Masked Part Autoencoder (MPAE). It first learns part descriptors as well as a feature map from the inputs and produces patch features from a masked version of the original images. Then, the masked regions are filled with the learned part descriptors based on the similarity between the local features and descriptors. By restoring these masked patches using the part descriptors, they become better aligned with their part shapes, guided by appearance features from unmasked patches. Finally, MPAE robustly discovers meaningful parts that closely match the actual object shapes, even in complex scenarios. Moreover, several looser yet more effective constraints are proposed to enable MPAE to identify the presence of parts across various scenarios and categories in an unsupervised manner. This provides the foundation for addressing challenges posed by occlusion and for exploring part similarity across multiple categories. Extensive experiments demonstrate that our method robustly discovers meaningful parts across various categories and scenarios. The code is available at the project https://github.com/Jiahao-UTS/MPAE.[77] Style Composition within Distinct LoRA modules for Traditional Art
Jaehyun Lee,Wonhark Park,Wonsik Shin,Hyunho Lee,Hyoung Min Na,Nojun Kwak
Main category: cs.CV
TL;DR: 该研究提出了一种新的扩散模型方法,能够在文本到图像生成中实现多风格融合和区域特定的风格控制。
Details
Motivation: 现有的基于扩散的文本到图像模型存在潜在空间纠缠和缺乏平滑插值的问题,难以在局部区域进行受控的风格应用。 Method: 通过在去噪过程中使用单独训练的风格专用模型对潜在变量进行风格组合,并利用空间掩码进行融合,同时引入深度图条件控制以保持结构一致性。 Result: 定性和定量实验表明,该方法能够根据给定的掩码实现区域特定的风格混合,并保持每个风格的保真度。 Conclusion: 该论文提出了一种基于扩散模型的零样本风格融合方法,实现了在不同区域进行精确的风格混合,并通过深度图条件控制确保了结构一致性。 Abstract: Diffusion-based text-to-image models have achieved remarkable results in synthesizing diverse images from text prompts and can capture specific artistic styles via style personalization. However, their entangled latent space and lack of smooth interpolation make it difficult to apply distinct painting techniques in a controlled, regional manner, often causing one style to dominate. To overcome this, we propose a zero-shot diffusion pipeline that naturally blends multiple styles by performing style composition on the denoised latents predicted during the flow-matching denoising process of separately trained, style-specialized models. We leverage the fact that lower-noise latents carry stronger stylistic information and fuse them across heterogeneous diffusion pipelines using spatial masks, enabling precise, region-specific style control. This mechanism preserves the fidelity of each individual style while allowing user-guided mixing. Furthermore, to ensure structural coherence across different models, we incorporate depth-map conditioning via ControlNet into the diffusion framework. Qualitative and quantitative experiments demonstrate that our method successfully achieves region-specific style mixing according to the given masks.[78] ID-EA: Identity-driven Text Enhancement and Adaptation with Textual Inversion for Personalized Text-to-Image Generation
Hyun-Jun Jin,Young-Eun Kim,Seong-Whan Lee
Main category: cs.CV
TL;DR: 本文提出了一种名为ID-EA的新框架,通过将文本嵌入与视觉身份嵌入对齐,改善个性化生成中的身份保持,ID-EA包括两个关键组件:ID驱动增强器(ID-Enhancer)和ID条件适配器(ID-Adapter),实验表明,ID-EA在身份保持指标上显著优于现有方法,并且计算效率高。
Details
Motivation: 当前的文本反转方法在语义对齐方面存在困难,导致面部身份不一致,因此需要一种新方法来提高个性化生成中的身份保持。 Method: ID-EA框架包括两个关键组件:ID驱动增强器(ID-Enhancer)和ID条件适配器(ID-Adapter)。ID-Enhancer利用文本ID锚点整合身份嵌入,优化来自人脸识别模型的视觉身份嵌入;ID-Adapter利用增强的身份嵌入调整文本条件,通过调整预训练UNet模型中的交叉注意力模块来确保身份保持。 Result: 广泛的定量和定性评估表明,ID-EA在身份保持指标上显著优于现有方法,同时实现了显著的计算效率,生成个性化肖像的速度约为现有方法的15倍。 Conclusion: ID-EA是一种有效提高个性化生成中身份保持的新框架,其在身份保持指标上显著优于现有方法,并具有显著的计算效率。 Abstract: Recently, personalized portrait generation with a text-to-image diffusion model has significantly advanced with Textual Inversion, emerging as a promising approach for creating high-fidelity personalized images. Despite its potential, current Textual Inversion methods struggle to maintain consistent facial identity due to semantic misalignments between textual and visual embedding spaces regarding identity. We introduce ID-EA, a novel framework that guides text embeddings to align with visual identity embeddings, thereby improving identity preservation in a personalized generation. ID-EA comprises two key components: the ID-driven Enhancer (ID-Enhancer) and the ID-conditioned Adapter (ID-Adapter). First, the ID-Enhancer integrates identity embeddings with a textual ID anchor, refining visual identity embeddings derived from a face recognition model using representative text embeddings. Then, the ID-Adapter leverages the identity-enhanced embedding to adapt the text condition, ensuring identity preservation by adjusting the cross-attention module in the pre-trained UNet model. This process encourages the text features to find the most related visual clues across the foreground snippets. Extensive quantitative and qualitative evaluations demonstrate that ID-EA substantially outperforms state-of-the-art methods in identity preservation metrics while achieving remarkable computational efficiency, generating personalized portraits approximately 15 times faster than existing approaches.[79] SAMST: A Transformer framework based on SAM pseudo label filtering for remote sensing semi-supervised semantic segmentation
Jun Yin,Fei Wu,Yupeng Ren,Jisheng Huang,Qiankun Li,Heng jin,Jianhai Fu,Chanjie Cui
Main category: cs.CV
TL;DR: SAMST is a semi-supervised semantic segmentation method designed to improve the accuracy of pseudo-labels and address the limitations of public remote sensing datasets by utilizing the generalization power of large models.
Details
Motivation: Public remote sensing datasets often face limitations in universality due to resolution variability and inconsistent land cover category definitions, necessitating a method to utilize the vast pool of unlabeled data. Method: SAMST uses a semi-supervised semantic segmentation approach that iteratively refines pseudo-labels using two components: supervised model self-training and a SAM-based Pseudo-label Refiner with three modules. Result: Experiments on the Potsdam dataset validate the effectiveness and feasibility of SAMST in addressing challenges posed by limited labeled data in remote sensing semantic segmentation. Conclusion: SAMST is effective in improving pseudo-label accuracy by integrating the generalization power of large models with the training efficiency of small models. Abstract: Public remote sensing datasets often face limitations in universality due to resolution variability and inconsistent land cover category definitions. To harness the vast pool of unlabeled remote sensing data, we propose SAMST, a semi-supervised semantic segmentation method. SAMST leverages the strengths of the Segment Anything Model (SAM) in zero-shot generalization and boundary detection. SAMST iteratively refines pseudo-labels through two main components: supervised model self-training using both labeled and pseudo-labeled data, and a SAM-based Pseudo-label Refiner. The Pseudo-label Refiner comprises three modules: a Threshold Filter Module for preprocessing, a Prompt Generation Module for extracting connected regions and generating prompts for SAM, and a Label Refinement Module for final label stitching. By integrating the generalization power of large models with the training efficiency of small models, SAMST improves pseudo-label accuracy, thereby enhancing overall model performance. Experiments on the Potsdam dataset validate the effectiveness and feasibility of SAMST, demonstrating its potential to address the challenges posed by limited labeled data in remote sensing semantic segmentation.[80] AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation
Hao Li,Ju Dai,Feng Zhou,Kaida Ning,Lei Li,Junjun Pan
Main category: cs.CV
TL;DR: 本文介绍了AUBlendSet和AUBlendNet,这是一个用于精细风格化3D面部表情操作的新型数据集和网络。
Details
Motivation: 由于缺乏适当的数据集,在3D面部表情的精细风格化操作方面仍然存在挑战。 Method: 基于AU-Blendshape表示的3D面部数据集AUBlendSet和网络AUBlendNet,用于学习不同角色风格的AU-Blendshape基向量。 Result: 通过一系列定性和定量实验验证了AUBlendSet和AUBlendNet在3D面部动画任务中的潜力和重要性。 Conclusion: AUBlendSet 是首个用于通过面部动作单元进行任意身份的连续3D面部表情操作的数据集,而 AUBlendNet 是首个实现该操作的网络。 Abstract: While 3D facial animation has made impressive progress, challenges still exist in realizing fine-grained stylized 3D facial expression manipulation due to the lack of appropriate datasets. In this paper, we introduce the AUBlendSet, a 3D facial dataset based on AU-Blendshape representation for fine-grained facial expression manipulation across identities. AUBlendSet is a blendshape data collection based on 32 standard facial action units (AUs) across 500 identities, along with an additional set of facial postures annotated with detailed AUs. Based on AUBlendSet, we propose AUBlendNet to learn AU-Blendshape basis vectors for different character styles. AUBlendNet predicts, in parallel, the AU-Blendshape basis vectors of the corresponding style for a given identity mesh, thereby achieving stylized 3D emotional facial manipulation. We comprehensively validate the effectiveness of AUBlendSet and AUBlendNet through tasks such as stylized facial expression manipulation, speech-driven emotional facial animation, and emotion recognition data augmentation. Through a series of qualitative and quantitative experiments, we demonstrate the potential and importance of AUBlendSet and AUBlendNet in 3D facial animation tasks. To the best of our knowledge, AUBlendSet is the first dataset, and AUBlendNet is the first network for continuous 3D facial expression manipulation for any identity through facial AUs. Our source code is available at https://github.com/wslh852/AUBlendNet.git.[81] Frequency-Dynamic Attention Modulation for Dense Prediction
Linwei Chen,Lin Gu,Ying Fu
Main category: cs.CV
TL;DR: The paper introduces a strategy called Frequency-Dynamic Attention Modulation (FDAM) for Vision Transformers to address the issue of detail and texture loss due to the attention mechanism acting as a low-pass filter. FDAM, incorporating Attention Inversion and Frequency Dynamic Scaling, enhances performance across various tasks and models, and achieves state-of-the-art results in remote sensing detection.
Details
Motivation: The motivation behind the research is to overcome the limitation of the attention mechanism in Vision Transformers (ViTs), which functions as a low-pass filter, leading to the loss of critical details and textures when stacked. The goal is to enhance the performance of ViTs on various tasks by addressing this limitation. Method: The paper proposes a new strategy called Frequency-Dynamic Attention Modulation (FDAM) for Vision Transformers (ViTs). FDAM includes two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). AttInv creates a complementary high-pass filter by inverting the low-pass filter found in the attention matrix, while FreqScale allows for fine-grained adjustments to the target response function by weighting different frequency components. Result: The result of the proposed method, FDAM, is that it can be effectively integrated into ViTs, avoiding representation collapse and resulting in performance improvements across various models (such as SegFormer, DeiT, and MaskDINO) and tasks (including semantic segmentation, object detection, and instance segmentation). The method also achieves state-of-the-art results in remote sensing detection within single-scale settings. Conclusion: The paper concludes that by implementing FDAM, which includes AttInv and FreqScale, into ViTs, there is a consistent improvement in performance across various models and tasks, including semantic segmentation, object detection, and instance segmentation. The method also achieves state-of-the-art results in remote sensing detection. Abstract: Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at \href{https://github.com/Linwei-Chen/FDAM}{https://github.com/Linwei-Chen/FDAM}.[82] Dual form Complementary Masking for Domain-Adaptive Image Segmentation
Jiawen Wang,Yinda Chen,Xiaoyu Liu,Che Liu,Dong Liu,Jianqing Gao,Zhiwei Xiong
Main category: cs.CV
TL;DR: The paper proposes MaskTwins, a UDA framework that integrates masked reconstruction into training, enhancing domain generalization for image segmentation.
Details
Motivation: Previous works correlate Masked Image Modeling with consistency regularization in Unsupervised Domain Adaptation but neglect theoretical analysis, leading to a superficial understanding of masked reconstruction. Method: MaskTwins integrates masked reconstruction directly into the main training pipeline and enforces consistency between predictions of images masked in complementary ways. Result: Extensive experiments verify the superiority of MaskTwins over baseline methods in natural and biological image segmentation, demonstrating its advantages in extracting domain-invariant features. Conclusion: MaskTwins offers a new paradigm for domain-adaptive segmentation by extracting domain-invariant features without separate pre-training. Abstract: Recent works have correlated Masked Image Modeling (MIM) with consistency regularization in Unsupervised Domain Adaptation (UDA). However, they merely treat masking as a special form of deformation on the input images and neglect the theoretical analysis, which leads to a superficial understanding of masked reconstruction and insufficient exploitation of its potential in enhancing feature extraction and representation learning. In this paper, we reframe masked reconstruction as a sparse signal reconstruction problem and theoretically prove that the dual form of complementary masks possesses superior capabilities in extracting domain-agnostic image features. Based on this compelling insight, we propose MaskTwins, a simple yet effective UDA framework that integrates masked reconstruction directly into the main training pipeline. MaskTwins uncovers intrinsic structural patterns that persist across disparate domains by enforcing consistency between predictions of images masked in complementary ways, enabling domain generalization in an end-to-end manner. Extensive experiments verify the superiority of MaskTwins over baseline methods in natural and biological image segmentation. These results demonstrate the significant advantages of MaskTwins in extracting domain-invariant features without the need for separate pre-training, offering a new paradigm for domain-adaptive segmentation.[83] Deep Neural Encoder-Decoder Model to Relate fMRI Brain Activity with Naturalistic Stimuli
Florian David,Michael Chan,Elenor Morgenroth,Patrik Vuilleumier,Dimitri Van De Ville
Main category: cs.CV
TL;DR: 该论文提出了一个深度神经网络模型,能够解码大脑对自然电影刺激的fMRI反应,并重建视觉输入。
Details
Motivation: 研究的动机是为了解决自然电影刺激和功能性磁共振成像(fMRI)采集之间的时间分辨率差距问题,并预测视觉皮层内外的体素活动及从神经活动中重建相应的视觉输入。 Method: 我们利用连续电影帧的时间相关输入,采用时间卷积层架构,提出了一种端到端的深度神经编码器-解码器模型来编码和解码大脑对自然刺激的反应。 Result: 模型成功预测了视觉皮层区域的体素活动,并通过显著性图分析发现了对视觉解码贡献最大的大脑区域,包括中枕叶区、梭状区和距状区。 Conclusion: 研究显示,深度学习模型可以用来探测我们对电影中视觉处理的理解。 Abstract: We propose an end-to-end deep neural encoder-decoder model to encode and decode brain activity in response to naturalistic stimuli using functional magnetic resonance imaging (fMRI) data. Leveraging temporally correlated input from consecutive film frames, we employ temporal convolutional layers in our architecture, which effectively allows to bridge the temporal resolution gap between natural movie stimuli and fMRI acquisitions. Our model predicts activity of voxels in and around the visual cortex and performs reconstruction of corresponding visual inputs from neural activity. Finally, we investigate brain regions contributing to visual decoding through saliency maps. We find that the most contributing regions are the middle occipital area, the fusiform area, and the calcarine, respectively employed in shape perception, complex recognition (in particular face perception), and basic visual features such as edges and contrasts. These functions being strongly solicited are in line with the decoder's capability to reconstruct edges, faces, and contrasts. All in all, this suggests the possibility to probe our understanding of visual processing in films using as a proxy the behaviour of deep learning models such as the one proposed in this paper.[84] SS-DC: Spatial-Spectral Decoupling and Coupling Across Visible-Infrared Gap for Domain Adaptive Object Detection
Xiwei Zhang,Chunjin Yang,Yiming Xiao,Runtong Zhang,Fanman Meng
Main category: cs.CV
TL;DR: This paper introduces the SS-DC framework for RGB-IR unsupervised domain adaptive object detection, effectively decoupling and coupling domain-invariant and domain-specific features to reduce domain bias and achieve superior performance.
Details
Motivation: Existing UDAOD methods treat the RGB domain as a single domain, neglecting the presence of multiple subdomains like daytime, nighttime, and foggy scenes. This work aims to improve adaptation by decoupling domain-invariant and domain-specific features across these subdomains. Method: The paper proposes a Spectral Adaptive Idempotent Decoupling (SAID) module for spectral decomposition, a filter bank-based spectral processing paradigm, a self-distillation-driven decoupling loss, and a spatial-spectral coupling method to enhance domain adaptation performance. Result: Extensive experiments show that the proposed method significantly improves baseline performance and outperforms existing UDAOD approaches on multiple RGB-IR datasets, including under a new experimental protocol based on the FLIR-ADAS dataset. Conclusion: The proposed SS-DC framework improves RGB-IR domain adaptation by decoupling and coupling domain-invariant and domain-specific features, outperforming existing UDAOD methods on multiple datasets. Abstract: Unsupervised domain adaptive object detection (UDAOD) from the visible domain to the infrared (RGB-IR) domain is challenging. Existing methods regard the RGB domain as a unified domain and neglect the multiple subdomains within it, such as daytime, nighttime, and foggy scenes. We argue that decoupling the domain-invariant (DI) and domain-specific (DS) features across these multiple subdomains is beneficial for RGB-IR domain adaptation. To this end, this paper proposes a new SS-DC framework based on a decoupling-coupling strategy. In terms of decoupling, we design a Spectral Adaptive Idempotent Decoupling (SAID) module in the aspect of spectral decomposition. Due to the style and content information being highly embedded in different frequency bands, this module can decouple DI and DS components more accurately and interpretably. A novel filter bank-based spectral processing paradigm and a self-distillation-driven decoupling loss are proposed to improve the spectral domain decoupling. In terms of coupling, a new spatial-spectral coupling method is proposed, which realizes joint coupling through spatial and spectral DI feature pyramids. Meanwhile, this paper introduces DS from decoupling to reduce the domain bias. Extensive experiments demonstrate that our method can significantly improve the baseline performance and outperform existing UDAOD methods on multiple RGB-IR datasets, including a new experimental protocol proposed in this paper based on the FLIR-ADAS dataset.[85] Dataset Ownership Verification for Pre-trained Masked Models
Yuechen Xie,Jie Song,Yicheng Shan,Xiaoyan Zhang,Yuanyu Wan,Shengxuming Zhang,Jiarui Duan,Mingli Song
Main category: cs.CV
TL;DR: DOV4MM是首个针对掩码模型的数据集所有权验证方法,通过分析模型在嵌入空间中重建掩码信息的能力来保护数据集所有权。
Details
Motivation: 保护开源数据集免受潜在滥用的威胁,现有的验证技术主要针对监督模型和对比预训练模型,而不适用于掩码模型。 Method: 通过观察模型在目标数据集上预训练后在嵌入空间中重建掩码信息的难度来验证数据集所有权。 Result: DOV4MM在ImageNet-1K和WikiText-103上的实验表明其有效性,拒绝原假设的p值远低于0.05,并超越了所有先前的方法。 Conclusion: DOV4MM是一种有效的数据集所有权验证方法,尤其适用于掩码模型,帮助数据集所有者保护其权益。 Abstract: High-quality open-source datasets have emerged as a pivotal catalyst driving the swift advancement of deep learning, while facing the looming threat of potential exploitation. Protecting these datasets is of paramount importance for the interests of their owners. The verification of dataset ownership has evolved into a crucial approach in this domain; however, existing verification techniques are predominantly tailored to supervised models and contrastive pre-trained models, rendering them ill-suited for direct application to the increasingly prevalent masked models. In this work, we introduce the inaugural methodology addressing this critical, yet unresolved challenge, termed Dataset Ownership Verification for Masked Modeling (DOV4MM). The central objective is to ascertain whether a suspicious black-box model has been pre-trained on a particular unlabeled dataset, thereby assisting dataset owners in safeguarding their rights. DOV4MM is grounded in our empirical observation that when a model is pre-trained on the target dataset, the difficulty of reconstructing masked information within the embedding space exhibits a marked contrast to models not pre-trained on that dataset. We validated the efficacy of DOV4MM through ten masked image models on ImageNet-1K and four masked language models on WikiText-103. The results demonstrate that DOV4MM rejects the null hypothesis, with a $p$-value considerably below 0.05, surpassing all prior approaches. Code is available at https://github.com/xieyc99/DOV4MM.[86] MVAR: MultiVariate AutoRegressive Air Pollutants Forecasting Model
Xu Fan,Zhihao Wang,Yuetan Lin,Yan Zhang,Yang Xiang,Hao Li
Main category: cs.CV
TL;DR: 本文提出了一种新的多变量空气污染物预测模型MVAR,通过设计新架构和训练范式,在长期预测和数据利用效率方面优于现有方法。
Details
Motivation: 现有的研究主要集中在单一污染物预测上,忽视了不同污染物之间的相互作用及其多样化空间响应,而实际需求是进行多变量空气污染物预测。 Method: 提出了多变量自回归空气污染物预测模型(MVAR),设计了多变量自回归训练范式和气象耦合空间变换块。 Result: MVAR模型能够实现120小时的长期序列预测,并构建了一个涵盖6种主要污染物的综合数据集。 Conclusion: MVAR模型在多污染物预测中表现出色,优于现有最先进方法,并验证了所提出架构的有效性。 Abstract: Air pollutants pose a significant threat to the environment and human health, thus forecasting accurate pollutant concentrations is essential for pollution warnings and policy-making. Existing studies predominantly focus on single-pollutant forecasting, neglecting the interactions among different pollutants and their diverse spatial responses. To address the practical needs of forecasting multivariate air pollutants, we propose MultiVariate AutoRegressive air pollutants forecasting model (MVAR), which reduces the dependency on long-time-window inputs and boosts the data utilization efficiency. We also design the Multivariate Autoregressive Training Paradigm, enabling MVAR to achieve 120-hour long-term sequential forecasting. Additionally, MVAR develops Meteorological Coupled Spatial Transformer block, enabling the flexible coupling of AI-based meteorological forecasts while learning the interactions among pollutants and their diverse spatial responses. As for the lack of standardized datasets in air pollutants forecasting, we construct a comprehensive dataset covering 6 major pollutants across 75 cities in North China from 2018 to 2023, including ERA5 reanalysis data and FuXi-2.0 forecast data. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods and validate the effectiveness of the proposed architecture.[87] 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering
Rongtao Xu,Han Gao,Mingming Yu,Dong An,Shunpeng Chen,Changwei Wang,Li Guo,Xiaodan Liang,Shibiao Xu
Main category: cs.CV
TL;DR: 本文提出了一种新的生成大规模3D语言数据集的方法3D-MoRe,该方法在两个任务上都显著优于现有方法。
Details
Motivation: 室内场景任务对多样化和可扩展的数据的需求日益增长,例如问答和密集字幕生成。 Method: 3D-MoRe集成了多模态嵌入、跨模态交互和语言模型解码器等关键组件,利用ScanNet 3D场景数据集以及ScanQA和ScanRefer的文本注释生成大量QA对和物体描述。 Result: 在ScanQA上,3D-MoRe的CIDEr得分提高了2.15%,而在ScanRefer上,CIDEr@0.5提高了1.84%。 Conclusion: 3D-MoRe通过利用基础模型的优势,成功生成了大规模的3D语言数据集,并在ScanQA和ScanRefer任务上显著优于最先进的基线方法。 Abstract: With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15\%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84\%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.[88] SGLoc: Semantic Localization System for Camera Pose Estimation from 3D Gaussian Splatting Representation
Beining Xu,Siting Zhu,Hesheng Wang
Main category: cs.CV
TL;DR: SGLoc is a novel localization system that uses semantic information and 3D Gaussian Splatting to accurately estimate camera poses without prior information.
Details
Motivation: The motivation is to develop a localization system that can estimate 6DoF camera poses directly from 3DGS representation without relying on prior pose information. Method: SGLoc introduces a multi-level pose regression strategy and a semantic-based global retrieval algorithm to estimate and refine camera poses using 3D Gaussian Splatting (3DGS) representation. Result: SGLoc outperforms baselines on 12scenes and 7scenes datasets, achieving accurate global localization without initial pose estimation. Conclusion: SGLoc demonstrates superior performance in global localization without requiring initial pose priors, showing excellent capabilities on 12scenes and 7scenes datasets. Abstract: We propose SGLoc, a novel localization system that directly regresses camera poses from 3D Gaussian Splatting (3DGS) representation by leveraging semantic information. Our method utilizes the semantic relationship between 2D image and 3D scene representation to estimate the 6DoF pose without prior pose information. In this system, we introduce a multi-level pose regression strategy that progressively estimates and refines the pose of query image from the global 3DGS map, without requiring initial pose priors. Moreover, we introduce a semantic-based global retrieval algorithm that establishes correspondences between 2D (image) and 3D (3DGS map). By matching the extracted scene semantic descriptors of 2D query image and 3DGS semantic representation, we align the image with the local region of the global 3DGS map, thereby obtaining a coarse pose estimation. Subsequently, we refine the coarse pose by iteratively optimizing the difference between the query image and the rendered image from 3DGS. Our SGLoc demonstrates superior performance over baselines on 12scenes and 7scenes datasets, showing excellent capabilities in global localization without initial pose prior. Code will be available at https://github.com/IRMVLab/SGLoc.[89] Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery
Xinhang Wan,Jiyuan Liu,Qian Qu,Suyuan Liu,Chuyu Zhang,Fangdi Wang,Xinwang Liu,En Zhu,Kunlun He
Main category: cs.CV
TL;DR: 该研究提出了一种新的多视角新颖类发现框架IICMVNCD,解决了现有方法在处理多视角数据和伪标签依赖方面的局限性。
Details
Motivation: 现有的新颖类发现方法主要关注单视角数据并依赖伪标签,导致在处理多视角数据时性能不稳定。 Method: 在单视角级别上,使用矩阵分解将特征分解为视角特定的共享基础矩阵和因子矩阵;在多视角级别上,利用已知类的视角关系指导新颖类的聚类。 Result: 实验结果验证了所提出方法在多视角新颖类发现任务中的有效性。 Conclusion: 本文提出了一种新的多视角新颖类发现框架IICMVNCD,通过利用已知类的知识来聚类新颖类,并在多视角数据上取得了有效的结果。 Abstract: In this paper, we address the problem of novel class discovery (NCD), which aims to cluster novel classes by leveraging knowledge from disjoint known classes. While recent advances have made significant progress in this area, existing NCD methods face two major limitations. First, they primarily focus on single-view data (e.g., images), overlooking the increasingly common multi-view data, such as multi-omics datasets used in disease diagnosis. Second, their reliance on pseudo-labels to supervise novel class clustering often results in unstable performance, as pseudo-label quality is highly sensitive to factors such as data noise and feature dimensionality. To address these challenges, we propose a novel framework named Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery (IICMVNCD), which is the first attempt to explore NCD in multi-view setting so far. Specifically, at the intra-view level, leveraging the distributional similarity between known and novel classes, we employ matrix factorization to decompose features into view-specific shared base matrices and factor matrices. The base matrices capture distributional consistency among the two datasets, while the factor matrices model pairwise relationships between samples. At the inter-view level, we utilize view relationships among known classes to guide the clustering of novel classes. This includes generating predicted labels through the weighted fusion of factor matrices and dynamically adjusting view weights of known classes based on the supervision loss, which are then transferred to novel class learning. Experimental results validate the effectiveness of our proposed approach.[90] MoViAD: Modular Visual Anomaly Detection
Manuel Barusco,Francesco Borsatti,Arianna Stropeni,Davide Dalle Pezze,Gian Antonio Susto
Main category: cs.CV
TL;DR: The paper introduces MoViAD, a highly modular library for Visual Anomaly Detection that provides fast access to advanced models, supports diverse scenarios, and facilitates efficient deployment on Edge and IoT devices.
Details
Motivation: The motivation is to accelerate research and deployment in the field of Visual Anomaly Detection (VAD) by providing a comprehensive, modular library that simplifies access to advanced models and tools. Method: The paper introduces MoViAD, a modular library integrating state-of-the-art VAD models, trainers, datasets, and utilities. It supports various scenarios, including continual, semi-supervised, few-shots, and noisy settings, and addresses deployment challenges with optimized models and quantization/compression tools. Result: MoViAD offers fast and easy access to VAD models, supports a wide range of scenarios, addresses deployment challenges for Edge and IoT settings, and provides flexibility for researchers and engineers. Conclusion: MoViAD is a comprehensive and modular library that accelerates research and deployment in the VAD domain by providing easy access to state-of-the-art models, efficient deployment tools, and support for various scenarios. Abstract: VAD is a critical field in machine learning focused on identifying deviations from normal patterns in images, often challenged by the scarcity of anomalous data and the need for unsupervised training. To accelerate research and deployment in this domain, we introduce MoViAD, a comprehensive and highly modular library designed to provide fast and easy access to state-of-the-art VAD models, trainers, datasets, and VAD utilities. MoViAD supports a wide array of scenarios, including continual, semi-supervised, few-shots, noisy, and many more. In addition, it addresses practical deployment challenges through dedicated Edge and IoT settings, offering optimized models and backbones, along with quantization and compression utilities for efficient on-device execution and distributed inference. MoViAD integrates a selection of backbones, robust evaluation VAD metrics (pixel-level and image-level) and useful profiling tools for efficiency analysis. The library is designed for fast, effortless deployment, enabling machine learning engineers to easily use it for their specific setup with custom models, datasets, and backbones. At the same time, it offers the flexibility and extensibility researchers need to develop and experiment with new methods.[91] InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing
Kun-Hsiang Lin,Yu-Wen Tseng,Kang-Yang Huang,Jhih-Ciang Wu,Wen-Huang Cheng
Main category: cs.CV
TL;DR: InstructFLIP是一种新的指令调优框架,通过整合视觉-语言模型和元域策略,提高面部反欺骗系统的跨域泛化能力并减少训练冗余。
Details
Motivation: 本文旨在解决FAS领域中的两个主要挑战:对攻击类型的语义理解有限和跨域训练冗余。 Method: InstructFLIP利用视觉-语言模型(VLMs)增强对视觉输入的感知,并采用元域策略学习一个可在多个域中泛化的统一模型。此外,该框架将指令明确解耦为内容和风格两部分,分别关注欺骗的本质语义和环境及相机特性相关的变体。 Result: 实验表明,InstructFLIP在准确率上超越了SOTA模型,并显著减少了FAS中不同域间的训练冗余。 Conclusion: InstructFLIP通过整合视觉-语言模型和元域策略,有效解决了FAS中的训练冗余和语义理解不足的问题,展现出优越的跨域泛化能力。 Abstract: Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. Project website is available at https://kunkunlin1221.github.io/InstructFLIP.[92] MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning
Hongxu Ma,Guanshuo Wang,Fufu Yu,Qiong Jia,Shouhong Ding
Main category: cs.CV
TL;DR: 本文提出MS-DETR模型,通过联合建模视频的时序运动与空间语义特征,提升了视频片段检索和亮点检测任务的性能,并解决了数据稀疏性问题。
Details
Motivation: 现有基于DETR的联合框架未能充分利用视频内容中的时序运动和空间语义之间的关系,作者希望通过建模这些复杂关系提升视频片段检索和亮点检测性能。 Method: 提出了一种基于DETR的Motion-Semantics DETR (MS-DETR) 框架,利用编码器-解码器结构,分别建模运动与语义特征,并引入对比去噪学习解决数据稀疏问题。 Result: 在四个视频片段检索和亮点检测基准数据集上进行了大量实验,结果表明MS-DETR明显优于现有最先进方法。 Conclusion: MS-DETR通过统一学习时空语义特征,在视频片段检索和亮点检测任务中表现出色,超越了现有最先进模型。 Abstract: Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query. While DETR-based joint frameworks have made significant strides, there remains untapped potential in harnessing the intricate relationships between temporal motion and spatial semantics within video content. In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. The encoder first explicitly models disentangled intra-modal correlations within motion and semantics dimensions, guided by the given text queries. Subsequently, the decoder utilizes the task-wise correlation across temporal motion and spatial semantics dimensions to enable precise query-guided localization for MR and refined highlight boundary delineation for HD. Furthermore, we observe the inherent sparsity dilemma within the motion and semantics dimensions of MR/HD datasets. To address this issue, we enrich the corpus from both dimensions by generation strategies and propose contrastive denoising learning to ensure the above components learn robustly and effectively. Extensive experiments on four MR/HD benchmarks demonstrate that our method outperforms existing state-of-the-art models by a margin. Our code is available at https://github.com/snailma0229/MS-DETR.git.[93] Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics
Muleilan Pei,Shaoshuai Shi,Xuesong Chen,Xu Liu,Shaojie Shen
Main category: cs.CV
TL;DR: This paper proposes a new strategy for motion forecasting in autonomous driving by first reasoning about behavior intentions using Inverse Reinforcement Learning, then using these intentions to guide trajectory prediction, resulting in improved prediction confidence and performance.
Details
Motivation: The motivation is to improve the safety of autonomous driving systems by enhancing the accuracy and confidence of motion forecasting for on-road traffic agents. Method: The paper proposes a 'First Reasoning, Then Forecasting' strategy that incorporates behavior intentions as spatial guidance for trajectory prediction. It introduces an interpretable, reward-driven intention reasoner based on Inverse Reinforcement Learning (IRL), uses a unified vectorized representation for traffic agents and scene elements, aggregates contextual features through a query-centric paradigm, performs policy rollouts to reason about intentions, and employs a hierarchical DETR-like decoder with bidirectional selective state space models for trajectory generation. Result: Extensive experiments on the Argoverse and nuScenes datasets show that the approach significantly improves trajectory prediction confidence and performs competitively with state-of-the-art methods. Conclusion: The paper concludes that the proposed method significantly enhances trajectory prediction confidence and achieves competitive performance compared to state-of-the-art methods. Abstract: Motion forecasting for on-road traffic agents presents both a significant challenge and a critical necessity for ensuring safety in autonomous driving systems. In contrast to most existing data-driven approaches that directly predict future trajectories, we rethink this task from a planning perspective, advocating a "First Reasoning, Then Forecasting" strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction. To achieve this, we introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning (IRL) scheme. Our method first encodes traffic agents and scene elements into a unified vectorized representation, then aggregates contextual features through a query-centric paradigm. This enables the derivation of a reward distribution, a compact yet informative representation of the target agent's behavior within the given scene context via IRL. Guided by this reward heuristic, we perform policy rollouts to reason about multiple plausible intentions, providing valuable priors for subsequent trajectory generation. Finally, we develop a hierarchical DETR-like decoder integrated with bidirectional selective state space models to produce accurate future trajectories along with their associated probabilities. Extensive experiments on the large-scale Argoverse and nuScenes motion forecasting datasets demonstrate that our approach significantly enhances trajectory prediction confidence, achieving highly competitive performance relative to state-of-the-art methods.[94] YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association
Xiang Yu,Xinyao Liu,Guang Liang
Main category: cs.CV
TL;DR: 本文提出了一种用于解决小型多目标跟踪问题的新方法,在检测和跟踪层面进行了创新,并在实际应用中取得了最先进的性能。
Details
Motivation: 从无人机视角跟踪小型、敏捷的多目标(如鸟类)是一个极具挑战性的计算机视觉任务,主要难点包括目标外观特征稀缺、相机和目标联合动态导致的复杂运动纠缠,以及密集鸟群行为引起的频繁遮挡和身份模糊。 Method: 本文采用了基于检测的跟踪范式,并在检测和关联层面进行了创新。检测方面,提出了一种名为SliceTrain的系统性训练增强框架,通过“确定性全覆盖切片”和“切片级随机增强”来解决高分辨率图像中小目标学习不足的问题。跟踪方面,设计了一种完全不依赖外观信息的鲁棒跟踪器,将“运动方向保持(EMA)机制”和结合“边界框扩展与距离惩罚”的自适应相似度度量集成到OC-SORT框架中。 Result: 该方法在SMOT4SB公共测试集上达到了55.205的SO-HOTA分数,达到了最先进的性能,并将在GitHub上开源代码。 Conclusion: 本文提出了一种针对小型多目标跟踪(SMOT)问题的创新性解决方案,并在MVA 2025“Finding Birds”挑战赛中取得了最先进的性能,证明了其有效性。 Abstract: Tracking small, agile multi-objects (SMOT), such as birds, from an Unmanned Aerial Vehicle (UAV) perspective is a highly challenging computer vision task. The difficulty stems from three main sources: the extreme scarcity of target appearance features, the complex motion entanglement caused by the combined dynamics of the camera and the targets themselves, and the frequent occlusions and identity ambiguity arising from dense flocking behavior. This paper details our championship-winning solution in the MVA 2025 "Finding Birds" Small Multi-Object Tracking Challenge (SMOT4SB), which adopts the tracking-by-detection paradigm with targeted innovations at both the detection and association levels. On the detection side, we propose a systematic training enhancement framework named \textbf{SliceTrain}. This framework, through the synergy of 'deterministic full-coverage slicing' and 'slice-level stochastic augmentation, effectively addresses the problem of insufficient learning for small objects in high-resolution image training. On the tracking side, we designed a robust tracker that is completely independent of appearance information. By integrating a \textbf{motion direction maintenance (EMA)} mechanism and an \textbf{adaptive similarity metric} combining \textbf{bounding box expansion and distance penalty} into the OC-SORT framework, our tracker can stably handle irregular motion and maintain target identities. Our method achieves state-of-the-art performance on the SMOT4SB public test set, reaching an SO-HOTA score of \textbf{55.205}, which fully validates the effectiveness and advancement of our framework in solving complex real-world SMOT problems. The source code will be made available at https://github.com/Salvatore-Love/YOLOv8-SMOT.[95] BRUM: Robust 3D Vehicle Reconstruction from 360 Sparse Images
Davide Di Nucci,Matteo Tomei,Guido Borghi,Luca Ciuffreda,Roberto Vezzani,Rita Cucchiara
Main category: cs.CV
TL;DR: 本研究改进了高斯点绘方法,结合深度图与鲁棒姿态估计,实现了稀疏视角下车辆的高质量三维重建,并提出了一个新的包含合成和真实数据的车辆数据集。
Details
Motivation: 现有的三维重建技术如NeRF和高斯点绘依赖于密集的输入视角,这在实际应用中限制了它们的可用性。因此,本文旨在解决从稀疏视角输入重建车辆的问题。 Method: 通过引入选择性光度损失函数,并使用DUSt3R架构替代标准的运动结构(Structure-from-Motion)流程,改进了传统的高斯点绘方法。此外,利用深度图和增强训练数据的方法提升了相机姿态估计的准确性。 Result: 实验结果表明,该方法在多个基准测试中表现出最先进的性能,即使在输入受限的情况下也能实现高质量的重建。 Conclusion: 该论文提出了一种改进的高斯点绘方法,结合深度图和鲁棒姿态估计架构,实现了在稀疏视角输入下对车辆的高质量三维重建,具有广泛的应用前景。 Abstract: Accurate 3D reconstruction of vehicles is vital for applications such as vehicle inspection, predictive maintenance, and urban planning. Existing methods like Neural Radiance Fields and Gaussian Splatting have shown impressive results but remain limited by their reliance on dense input views, which hinders real-world applicability. This paper addresses the challenge of reconstructing vehicles from sparse-view inputs, leveraging depth maps and a robust pose estimation architecture to synthesize novel views and augment training data. Specifically, we enhance Gaussian Splatting by integrating a selective photometric loss, applied only to high-confidence pixels, and replacing standard Structure-from-Motion pipelines with the DUSt3R architecture to improve camera pose estimation. Furthermore, we present a novel dataset featuring both synthetic and real-world public transportation vehicles, enabling extensive evaluation of our approach. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, showcasing the method's ability to achieve high-quality reconstructions even under constrained input conditions.[96] DeepShade: Enable Shade Simulation by Text-conditioned Image Generation
Longchao Da,Xiangrui Liu,Mithun Shivakoti,Thirulogasankar Pranav Kutralingam,Yezhou Yang,Hua Wei
Main category: cs.CV
TL;DR: This paper introduces DeepShade, a model that generates shade images using a novel dataset and 3D simulations, aiming to improve urban planning and route decisions during heatwaves.
Details
Motivation: Current routing systems do not incorporate shade information, which is important for public health during heatwaves. Estimating shade from satellite imagery is challenging due to noise and limited training data. Method: The paper proposes DeepShade, a diffusion-based model that uses RGB and Canny edge layers along with contrastive learning to synthesize shade variations over time, conditioned on textual descriptions of environmental conditions. Result: The authors created a comprehensive dataset using Blender-based 3D simulations aligned with satellite images and demonstrated improved performance in generating shade images, which were used to calculate shade ratios for route planning in Tempe, Arizona. Conclusion: The work will benefit society by providing a reference for urban planning in extreme heat weather and its potential practical applications in the environment. Abstract: Heatwaves pose a significant threat to public health, especially as global warming intensifies. However, current routing systems (e.g., online maps) fail to incorporate shade information due to the difficulty of estimating shades directly from noisy satellite imagery and the limited availability of training data for generative models. In this paper, we address these challenges through two main contributions. First, we build an extensive dataset covering diverse longitude-latitude regions, varying levels of building density, and different urban layouts. Leveraging Blender-based 3D simulations alongside building outlines, we capture building shadows under various solar zenith angles throughout the year and at different times of day. These simulated shadows are aligned with satellite images, providing a rich resource for learning shade patterns. Second, we propose the DeepShade, a diffusion-based model designed to learn and synthesize shade variations over time. It emphasizes the nuance of edge features by jointly considering RGB with the Canny edge layer, and incorporates contrastive learning to capture the temporal change rules of shade. Then, by conditioning on textual descriptions of known conditions (e.g., time of day, solar angles), our framework provides improved performance in generating shade images. We demonstrate the utility of our approach by using our shade predictions to calculate shade ratios for real-world route planning in Tempe, Arizona. We believe this work will benefit society by providing a reference for urban planning in extreme heat weather and its potential practical applications in the environment.[97] Out-of-distribution data supervision towards biomedical semantic segmentation
Yiquan Gao,Duohui Xu
Main category: cs.CV
TL;DR: Med-OoD是一种无需外部数据源、特征正则化目标或额外注释的生物医学分割框架,通过引入OoD数据监督解决了在有限和不完美医学数据集上学习时前景和背景对象之间的意外误分类问题。
Details
Motivation: 生物医学分割网络在有限和不完美的医学数据集上学习时容易出现前景和背景对象之间的意外误分类。受OoD数据在其他视觉任务上的强大能力的启发,作者希望解决这一问题。 Method: 提出了一种数据为中心的框架Med-OoD,将OoD数据监督引入到完全监督的生物医学分割中,同时不需要外部数据源、特征正则化目标或额外注释。该方法可以无缝集成到分割网络中,无需对架构进行任何修改。 Result: 大量实验表明,Med-OoD在很大程度上防止了各种分割网络在医学图像上的像素误分类,并在Lizard数据集上实现了显著的性能提升。此外,该框架在完全没有前景类别标签的OoD数据上训练医学分割网络,取得了76.1%的mIoU测试结果。 Conclusion: Med-OoD有效地解决了生物医学分割中的前景和背景误分类问题,并展示了在有限和不完美医学数据集上的强大性能。作者希望这种新的学习范式能引发人们对于OoD数据作用的重新思考。 Abstract: Biomedical segmentation networks easily suffer from the unexpected misclassification between foreground and background objects when learning on limited and imperfect medical datasets. Inspired by the strong power of Out-of-Distribution (OoD) data on other visual tasks, we propose a data-centric framework, Med-OoD to address this issue by introducing OoD data supervision into fully-supervised biomedical segmentation with none of the following needs: (i) external data sources, (ii) feature regularization objectives, (iii) additional annotations. Our method can be seamlessly integrated into segmentation networks without any modification on the architectures. Extensive experiments show that Med-OoD largely prevents various segmentation networks from the pixel misclassification on medical images and achieves considerable performance improvements on Lizard dataset. We also present an emerging learning paradigm of training a medical segmentation network completely using OoD data devoid of foreground class labels, surprisingly turning out 76.1% mIoU as test result. We hope this learning paradigm will attract people to rethink the roles of OoD data. Code is made available at https://github.com/StudioYG/Med-OoD.[98] Non-Adaptive Adversarial Face Generation
Sunpill Kim,Seunghun Paik,Chanwoo Hwang,Minsu Kim,Jae Hong Seo
Main category: cs.CV
TL;DR: This paper introduces a novel method to generate adversarial facial images that can bypass face recognition systems using attributed subspheres, achieving high success rates with minimal queries.
Details
Motivation: The motivation is to address security and privacy threats posed by adversarial attacks on face recognition systems, especially when these systems are used for identity verification. Method: The method utilizes the structural characteristics of the FRS feature space by leveraging attributed subspheres formed by individuals sharing the same attribute, enabling non-adaptive adversarial face generation. Result: The method achieves a success rate of over 93% against AWS's CompareFaces API with only a single non-adaptive query of 100 face images. Conclusion: The paper concludes that the proposed method effectively generates adversarial faces that can deceive face recognition systems with high success rates while requiring minimal queries. Abstract: Adversarial attacks on face recognition systems (FRSs) pose serious security and privacy threats, especially when these systems are used for identity verification. In this paper, we propose a novel method for generating adversarial faces-synthetic facial images that are visually distinct yet recognized as a target identity by the FRS. Unlike iterative optimization-based approaches (e.g., gradient descent or other iterative solvers), our method leverages the structural characteristics of the FRS feature space. We figure out that individuals sharing the same attribute (e.g., gender or race) form an attributed subsphere. By utilizing such subspheres, our method achieves both non-adaptiveness and a remarkably small number of queries. This eliminates the need for relying on transferability and open-source surrogate models, which have been a typical strategy when repeated adaptive queries to commercial FRSs are impossible. Despite requiring only a single non-adaptive query consisting of 100 face images, our method achieves a high success rate of over 93% against AWS's CompareFaces API at its default threshold. Furthermore, unlike many existing attacks that perturb a given image, our method can deliberately produce adversarial faces that impersonate the target identity while exhibiting high-level attributes chosen by the adversary.[99] LidarPainter: One-Step Away From Any Lidar View To Novel Guidance
Yuzhou Ji,Ke Ma,Hong Cai,Anchun Zhang,Lizhuang Ma,Xin Tan
Main category: cs.CV
TL;DR: 本文提出了LidarPainter,一种高效的单步扩散模型,用于高质量、实时的驾驶场景重建,且支持风格化生成。
Details
Motivation: 动态驾驶场景重建在数字孪生系统和自动驾驶仿真中具有重要意义,但现有方法在视图偏离输入轨迹时会出现背景和车辆模型的退化问题,且存在不一致、变形和耗时等限制。 Method: 提出LidarPainter,一种单步扩散模型,用于从稀疏LiDAR条件和伪影渲染中恢复一致的驾驶视图。 Result: LidarPainter在速度、质量和资源效率方面均优于现有最先进方法,速度是StreetCrafter的7倍,且仅需其五分之一的GPU内存。此外,LidarPainter支持使用文本提示生成具有特定风格的图像,如“雾天”和“夜晚”。 Conclusion: LidarPainter是一个能够实时从稀疏LiDAR条件和伪影渲染中恢复一致驾驶视图的单步扩散模型,实现了高保真车道变换的驾驶场景重建。 Abstract: Dynamic driving scene reconstruction is of great importance in fields like digital twin system and autonomous driving simulation. However, unacceptable degradation occurs when the view deviates from the input trajectory, leading to corrupted background and vehicle models. To improve reconstruction quality on novel trajectory, existing methods are subject to various limitations including inconsistency, deformation, and time consumption. This paper proposes LidarPainter, a one-step diffusion model that recovers consistent driving views from sparse LiDAR condition and artifact-corrupted renderings in real-time, enabling high-fidelity lane shifts in driving scene reconstruction. Extensive experiments show that LidarPainter outperforms state-of-the-art methods in speed, quality and resource efficiency, specifically 7 x faster than StreetCrafter with only one fifth of GPU memory required. LidarPainter also supports stylized generation using text prompts such as "foggy" and "night", allowing for a diverse expansion of the existing asset library.[100] Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph
Sergey Linok,Gleb Naumov
Main category: cs.CV
TL;DR: 本文提出了一种名为OVIGo-3DHSG的新方法,用于室内物体的开放词汇定位。通过结合层次场景图和大语言模型,该方法在处理复杂空间查询和增强空间上下文理解方面表现出色。
Details
Motivation: 为了更有效地处理涉及复杂空间关系的查询,需要一种能够增强空间上下文理解的方法,特别是在多层室内环境中。 Method: OVIGo-3DHSG利用RGB-D帧序列和一系列开放词汇基础模型以及传感器数据处理,构建了一个层次化的场景图,并结合大语言模型进行多步骤推理,以解决涉及其他对象空间参考的复杂查询。 Result: OVIGo-3DHSG在Habitat Matterport 3D语义多层场景中展示了其层次表示的语义和几何精度,相较于现有方法表现出更高效的场景理解和对象定位能力。 Conclusion: OVIGo-3DHSG通过将层次场景图与大语言模型相结合,实现了高效的场景理解和鲁棒的对象定位,展示了在需要空间推理和室内环境理解的应用中的巨大潜力。 Abstract: We propose OVIGo-3DHSG method - Open-Vocabulary Indoor Grounding of objects using 3D Hierarchical Scene Graph. OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph derived from sequences of RGB-D frames utilizing a set of open-vocabulary foundation models and sensor data processing. The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects. To effectively address complex queries involving spatial reference to other objects, we integrate the hierarchical scene graph with a Large Language Model for multistep reasoning. This integration leverages inter-layer (e.g., room-to-object) and intra-layer (e.g., object-to-object) connections, enhancing spatial contextual understanding. We investigate the semantic and geometry accuracy of hierarchical representation on Habitat Matterport 3D Semantic multi-floor scenes. Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods. Overall OVIGo-3DHSG demonstrates strong potential for applications requiring spatial reasoning and understanding of indoor environments. Related materials can be found at https://github.com/linukc/OVIGo-3DHSG.[101] Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers
Yi-Kuan Hsieh,Jun-Wei Hsieh,Xin Li,Yu-Ming Chang,Yu-Chee Tseng
Main category: cs.CV
TL;DR: 本文提出了一种新的基于块的对称剪枝和融合方法(BSPF-ViT),用于提高Vision Transformer(ViT)的效率,通过考虑标记交互来联合优化查询和键标记的剪枝,从而在减少计算成本的同时保持较高的准确性。
Details
Motivation: Vision Transformer(ViT)由于其高计算成本限制了实际应用,当前的剪枝方法往往因为独立剪枝查询和键标记而牺牲了准确性。 Method: 引入了基于块的对称剪枝和融合方法(BSPF-ViT),综合考虑每个标记及其邻居的重要性,并通过相似性融合步骤压缩保留的标记。 Result: BSPF-ViT在所有剪枝级别上都持续优于最先进的ViT方法,在DeiT-T上提高了1.3%,在DeiT-S上提高了2.0%,同时减少了50%的计算开销。 Conclusion: BSPF-ViT通过联合优化查询和键标记的剪枝,在减少计算成本的同时保持了较高的准确性,显著优于现有的ViT方法。 Abstract: Vision Transformer (ViT) has achieved impressive results across various vision tasks, yet its high computational cost limits practical applications. Recent methods have aimed to reduce ViT's $O(n^2)$ complexity by pruning unimportant tokens. However, these techniques often sacrifice accuracy by independently pruning query (Q) and key (K) tokens, leading to performance degradation due to overlooked token interactions. To address this limitation, we introduce a novel {\bf Block-based Symmetric Pruning and Fusion} for efficient ViT (BSPF-ViT) that optimizes the pruning of Q/K tokens jointly. Unlike previous methods that consider only a single direction, our approach evaluates each token and its neighbors to decide which tokens to retain by taking token interaction into account. The retained tokens are compressed through a similarity fusion step, preserving key information while reducing computational costs. The shared weights of Q/K tokens create a symmetric attention matrix, allowing pruning only the upper triangular part for speed up. BSPF-ViT consistently outperforms state-of-the-art ViT methods at all pruning levels, increasing ImageNet classification accuracy by 1.3% on DeiT-T and 2.0% on DeiT-S, while reducing computational overhead by 50%. It achieves 40% speedup with improved accuracy across various ViTs.[102] Learning Pixel-adaptive Multi-layer Perceptrons for Real-time Image Enhancement
Junyu Lou,Xiaorui Zhao,Kexuan Shi,Shuhang Gu
Main category: cs.CV
TL;DR: 本文提出了一种结合双边网格和多层感知机的图像增强方法,通过动态生成像素级的非线性变换参数,有效解决了现有方法在复杂颜色关系和局部变化上的不足。
Details
Motivation: 现有的基于双边网格的方法受限于线性仿射变换,难以建模复杂的颜色关系;而传统的MLP方法使用全局共享参数,难以处理局部变化。 Method: 提出了一种基于双边网格的像素自适应多层感知机(BPAM)框架,结合双边网格的空间建模能力和MLP的非线性映射能力,并采用网格分解策略和多通道引导图来优化参数生成和颜色信息利用。 Result: 在公共数据集上的广泛实验表明,该方法在性能上优于现有最先进方法,并保持了实时处理能力。 Conclusion: 实验结果表明,所提出的BPAM框架在公共数据集上优于现有技术方法,同时保持实时处理能力。 Abstract: Deep learning-based bilateral grid processing has emerged as a promising solution for image enhancement, inherently encoding spatial and intensity information while enabling efficient full-resolution processing through slicing operations. However, existing approaches are limited to linear affine transformations, hindering their ability to model complex color relationships. Meanwhile, while multi-layer perceptrons (MLPs) excel at non-linear mappings, traditional MLP-based methods employ globally shared parameters, which is hard to deal with localized variations. To overcome these dual challenges, we propose a Bilateral Grid-based Pixel-Adaptive Multi-layer Perceptron (BPAM) framework. Our approach synergizes the spatial modeling of bilateral grids with the non-linear capabilities of MLPs. Specifically, we generate bilateral grids containing MLP parameters, where each pixel dynamically retrieves its unique transformation parameters and obtain a distinct MLP for color mapping based on spatial coordinates and intensity values. In addition, we propose a novel grid decomposition strategy that categorizes MLP parameters into distinct types stored in separate subgrids. Multi-channel guidance maps are used to extract category-specific parameters from corresponding subgrids, ensuring effective utilization of color information during slicing while guiding precise parameter generation. Extensive experiments on public datasets demonstrate that our method outperforms state-of-the-art methods in performance while maintaining real-time processing capabilities.[103] AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving
Jiawei Xu,Kai Deng,Zexin Fan,Shenlong Wang,Jin Xie,Jian Yang
Main category: cs.CV
TL;DR: AD-GS 是一种无需标注的高质量自由视角驾驶场景渲染框架,通过新颖的可学习运动模型和自动场景分割方法,实现精准的动态对象建模。
Details
Motivation: 当前高质量的动态城市驾驶场景建模和渲染方法通常依赖于昂贵的手动对象轨迹标注,而自我监督的方法又无法准确捕捉动态对象的运动并合理分解场景,导致渲染结果出现瑕疵。 Method: AD-GS 利用结合了局部感知 B 样条曲线和全局感知三角函数的可学习运动模型,通过简化的伪 2D 分割自动将场景划分为物体和背景,使用动态高斯分布和双向时间可见性掩码表示物体,并引入可见性推理和物理刚性正则化以增强鲁棒性。 Result: AD-GS 在广泛评估中表现优异,显著优于当前最先进的无标注方法,并且与依赖标注的方法具有竞争力。 Conclusion: AD-GS 是一种新颖的自我监督框架,能够在没有全面语义标注的情况下,实现高质量的驾驶场景自由视角渲染,并且在性能上优于现有的无标注方法,同时与依赖标注的方法具有竞争力。 Abstract: Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce AD-GS, a novel self-supervised framework for high-quality free-viewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling flexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotation-free methods and is competitive with annotation-dependent approaches.[104] Neural Human Pose Prior
Michal Heker,Sefy Kararlitsky,David Tolpin
Main category: cs.CV
TL;DR: 本文提出了一种基于归一化流的方法对人体姿态进行建模,通过学习6D旋转格式的灵活密度来提供强大的姿态先验。
Details
Motivation: 为了克服现有方法(启发式或低表达能力)的局限性,需要一种有原则的、数据驱动的方法来对人体姿态进行建模。 Method: 通过使用RealNVP学习6D旋转格式表示的姿态的灵活密度,利用Gram-Schmidt过程的逆过程解决有效6D旋转流形上分布建模的挑战。 Result: 论文通过定性和定量评估展示了所学先验的有效性,并通过消融研究分析了其影响。 Conclusion: 该论文提出了一种基于归一化流的神经先验模型,用于对人体姿态进行建模,为人体运动捕捉和重建流程中集成姿态先验提供了坚实的概率基础。 Abstract: We introduce a principled, data-driven approach for modeling a neural prior over human body poses using normalizing flows. Unlike heuristic or low-expressivity alternatives, our method leverages RealNVP to learn a flexible density over poses represented in the 6D rotation format. We address the challenge of modeling distributions on the manifold of valid 6D rotations by inverting the Gram-Schmidt process during training, enabling stable learning while preserving downstream compatibility with rotation-based frameworks. Our architecture and training pipeline are framework-agnostic and easily reproducible. We demonstrate the effectiveness of the learned prior through both qualitative and quantitative evaluations, and we analyze its impact via ablation studies. This work provides a sound probabilistic foundation for integrating pose priors into human motion capture and reconstruction pipelines.[105] Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmentation
Edwin Arkel Rios,Fernando Mikael,Oswin Gosal,Femiloye Oyerinde,Hao-Chun Liang,Bo-Cheng Lai,Min-Chun Hu
Main category: cs.CV
TL;DR: 本论文挑战了传统的依赖预训练模型的方法,通过引入TGDA框架实现了从零开始训练高性能的细粒度图像识别系统。
Details
Motivation: 现有的FGIR方法依赖于在大规模数据集如ImageNet上预训练的骨干网络,这种依赖限制了在资源受限环境中的适应性,并阻碍了针对FGIR独特挑战的特定任务架构的发展。 Method: 介绍了一种新的训练框架TGDA,该框架通过知识蒸馏,结合了数据感知增强和弱监督。 Result: 在三个FGIR基准上的广泛实验表明,该方法在低分辨率和高分辨率输入的不同设置下均能稳定匹配或超越最先进的预训练模型。 Conclusion: TGDA为细粒度图像识别提供了一种适应性强的替代预训练方法,使得设计特定任务和硬件感知的架构成为可能,从而推动了高效细粒度视觉系统的发展。 Abstract: Fine-grained image recognition (FGIR) aims to distinguish visually similar sub-categories within a broader class, such as identifying bird species. While most existing FGIR methods rely on backbones pretrained on large-scale datasets like ImageNet, this dependence limits adaptability to resource-constrained environments and hinders the development of task-specific architectures tailored to the unique challenges of FGIR. In this work, we challenge the conventional reliance on pretrained models by demonstrating that high-performance FGIR systems can be trained entirely from scratch. We introduce a novel training framework, TGDA, that integrates data-aware augmentation with weak supervision via a fine-grained-aware teacher model, implemented through knowledge distillation. This framework unlocks the design of task-specific and hardware-aware architectures, including LRNets for low-resolution FGIR and ViTFS, a family of Vision Transformers optimized for efficient inference. Extensive experiments across three FGIR benchmarks over diverse settings involving low-resolution and high-resolution inputs show that our method consistently matches or surpasses state-of-the-art pretrained counterparts. In particular, in the low-resolution setting, LRNets trained with TGDA improve accuracy by up to 23\% over prior methods while requiring up to 20.6x less parameters, lower FLOPs, and significantly less training data. Similarly, ViTFS-T can match the performance of a ViT B-16 pretrained on ImageNet-21k while using 15.3x fewer trainable parameters and requiring orders of magnitudes less data. These results highlight TGDA's potential as an adaptable alternative to pretraining, paving the way for more efficient fine-grained vision systems.[106] Hybrid Ensemble Approaches: Optimal Deep Feature Fusion and Hyperparameter-Tuned Classifier Ensembling for Enhanced Brain Tumor Classification
Zahid Ullah,Dragan Pamucar,Jihie Kim
Main category: cs.CV
TL;DR: This study proposes a novel double ensembling framework using deep learning and machine learning models to improve the accuracy of brain tumor detection in MRI scans, outperforming current methods.
Details
Motivation: The motivation stems from the limitations of human specialists in accurately diagnosing tumors from MRI images due to factors like fatigue, limited expertise, and insufficient image detail. This can lead to missed or misidentified tumors, necessitating a more reliable and precise diagnostic approach. Method: The study introduces a double ensembling framework that uses pre-trained deep convolutional neural networks and vision transformer networks for feature extraction, followed by fine-tuned machine learning classifiers. The method includes preprocessing, data augmentation, transfer learning, and hyperparameter tuning. Feature fusion and classifier fusion techniques are applied to enhance classification accuracy. Result: The proposed framework outperforms the state-of-the-art methods in brain tumor classification. The results show that feature fusion and classifier fusion, along with hyperparameter fine-tuning, significantly enhance diagnostic accuracy. An ablation study also demonstrates the contribution of each component to the overall performance. Conclusion: The proposed double ensembling framework, which combines pre-trained deep learning models for feature extraction and fine-tuned machine learning models for classification, significantly improves the accuracy of brain tumor detection in MRI scans compared to existing methods. Abstract: Magnetic Resonance Imaging (MRI) is widely recognized as the most reliable tool for detecting tumors due to its capability to produce detailed images that reveal their presence. However, the accuracy of diagnosis can be compromised when human specialists evaluate these images. Factors such as fatigue, limited expertise, and insufficient image detail can lead to errors. For example, small tumors might go unnoticed, or overlap with healthy brain regions could result in misidentification. To address these challenges and enhance diagnostic precision, this study proposes a novel double ensembling framework, consisting of ensembled pre-trained deep learning (DL) models for feature extraction and ensembled fine-tuned hyperparameter machine learning (ML) models to efficiently classify brain tumors. Specifically, our method includes extensive preprocessing and augmentation, transfer learning concepts by utilizing various pre-trained deep convolutional neural networks and vision transformer networks to extract deep features from brain MRI, and fine-tune hyperparameters of ML classifiers. Our experiments utilized three different publicly available Kaggle MRI brain tumor datasets to evaluate the pre-trained DL feature extractor models, ML classifiers, and the effectiveness of an ensemble of deep features along with an ensemble of ML classifiers for brain tumor classification. Our results indicate that the proposed feature fusion and classifier fusion improve upon the state of the art, with hyperparameter fine-tuning providing a significant enhancement over the ensemble method. Additionally, we present an ablation study to illustrate how each component contributes to accurate brain tumor classification.[107] Wavelet-based Decoupling Framework for low-light Stereo Image Enhancement
Shuangli Du,Siming Yan,Zhenghao Shi,Zhenzhen You,Lu Sun
Main category: cs.CV
TL;DR: 这篇论文提出了一种新的低光立体图像增强方法,利用小波变换解耦特征空间,并通过专门模块处理光照和纹理,从而提升增强效果。
Details
Motivation: 现有的低光图像增强方法通常将所有退化因素编码到单一潜在空间中,导致特征纠缠和模型容易进行捷径学习。为了解决这些问题,论文提出了特征空间解耦的方法。 Method: 该论文使用小波变换将特征空间分解为低频分支和高频分支,并提出高频引导的跨视角交互模块(HF-CIM)和细节与纹理增强模块(DTEM)来分别处理光照调整和纹理增强。 Result: 实验结果表明,所提出的算法在光照调整和高频信息恢复方面具有显著优势,并且代码和数据集已公开可用。 Conclusion: 论文提出了一种基于小波变换的低光立体图像增强方法,通过解耦特征空间分别处理低频和高频信息,从而改善光照调整和纹理增强。 Abstract: Low-light images suffer from complex degradation, and existing enhancement methods often encode all degradation factors within a single latent space. This leads to highly entangled features and strong black-box characteristics, making the model prone to shortcut learning. To mitigate the above issues, this paper proposes a wavelet-based low-light stereo image enhancement method with feature space decoupling. Our insight comes from the following findings: (1) Wavelet transform enables the independent processing of low-frequency and high-frequency information. (2) Illumination adjustment can be achieved by adjusting the low-frequency component of a low-light image, extracted through multi-level wavelet decomposition. Thus, by using wavelet transform the feature space is decomposed into a low-frequency branch for illumination adjustment and multiple high-frequency branches for texture enhancement. Additionally, stereo low-light image enhancement can extract useful cues from another view to improve enhancement. To this end, we propose a novel high-frequency guided cross-view interaction module (HF-CIM) that operates within high-frequency branches rather than across the entire feature space, effectively extracting valuable image details from the other view. Furthermore, to enhance the high-frequency information, a detail and texture enhancement module (DTEM) is proposed based on cross-attention mechanism. The model is trained on a dataset consisting of images with uniform illumination and images with non-uniform illumination. Experimental results on both real and synthetic images indicate that our algorithm offers significant advantages in light adjustment while effectively recovering high-frequency information. The code and dataset are publicly available at: https://github.com/Cherisherr/WDCI-Net.git.[108] Revealing the Ancient Beauty: Digital Reconstruction of Temple Tiles using Computer Vision
Arkaprabha Basu
Main category: cs.CV
TL;DR: 该研究提出了三种基于机器学习和计算机视觉的文化遗产保护新技术,提升了印度纪念碑保护的效率和审美质量。
Details
Motivation: 印度纪念碑以其建筑技巧和审美吸引力而闻名,需要特殊的保护技术和方法。 Method: 提出了三种前沿技术:分形卷积方法、自敏感瓦片填充(SSTF)方法以及超分辨率策略,并结合MosaicSlice数据增强技术。 Result: 开发了无缝区域填充和高细节瓦片技术,能够在可承受的成本下保持真实性,并提升图像质量。 Conclusion: 该研究通过引入自动化和新颖的数据增强策略,在文化遗产保护领域实现了效率和美学质量的提升,保持了传统与创新之间的微妙平衡。 Abstract: Modern digitised approaches have dramatically changed the preservation and restoration of cultural treasures, integrating computer scientists into multidisciplinary projects with ease. Machine learning, deep learning, and computer vision techniques have revolutionised developing sectors like 3D reconstruction, picture inpainting,IoT-based methods, genetic algorithms, and image processing with the integration of computer scientists into multidisciplinary initiatives. We suggest three cutting-edge techniques in recognition of the special qualities of Indian monuments, which are famous for their architectural skill and aesthetic appeal. First is the Fractal Convolution methodology, a segmentation method based on image processing that successfully reveals subtle architectural patterns within these irreplaceable cultural buildings. The second is a revolutionary Self-Sensitive Tile Filling (SSTF) method created especially for West Bengal's mesmerising Bankura Terracotta Temples with a brand-new data augmentation method called MosaicSlice on the third. Furthermore, we delve deeper into the Super Resolution strategy to upscale the images without losing significant amount of quality. Our methods allow for the development of seamless region-filling and highly detailed tiles while maintaining authenticity using a novel data augmentation strategy within affordable costs introducing automation. By providing effective solutions that preserve the delicate balance between tradition and innovation, this study improves the subject and eventually ensures unrivalled efficiency and aesthetic excellence in cultural heritage protection. The suggested approaches advance the field into an era of unmatched efficiency and aesthetic quality while carefully upholding the delicate equilibrium between tradition and innovation.[109] RODS: Robust Optimization Inspired Diffusion Sampling for Detecting and Reducing Hallucination in Generative Models
Yiqi Tian,Pengfei Jin,Mingze Yuan,Na Li,Bo Zeng,Quanzheng Li
Main category: cs.CV
TL;DR: RODS通过优化视角改进扩散模型的采样过程,有效减少幻觉问题,提高生成质量与稳定性。
Details
Motivation: 扩散模型的采样过程存在幻觉问题,影响生成结果的准确性,因此需要一种无需重新训练且计算成本低的解决方案。 Method: RODS利用损失景观的几何线索检测和纠正高风险采样步骤,实现平滑的采样轨迹并自适应调整扰动。 Result: RODS在AFHQv2、FFHQ和11k-hands数据集上验证,能够检测超过70%的幻觉样本并纠正超过25%,且不引入新伪影。 Conclusion: RODS是一个新的扩散采样器,通过优化视角减少采样过程中的幻觉问题,提高了采样保真度和鲁棒性。 Abstract: Diffusion models have achieved state-of-the-art performance in generative modeling, yet their sampling procedures remain vulnerable to hallucinations, often stemming from inaccuracies in score approximation. In this work, we reinterpret diffusion sampling through the lens of optimization and introduce RODS (Robust Optimization-inspired Diffusion Sampler), a novel method that detects and corrects high-risk sampling steps using geometric cues from the loss landscape. RODS enforces smoother sampling trajectories and adaptively adjusts perturbations, reducing hallucinations without retraining and at minimal additional inference cost. Experiments on AFHQv2, FFHQ, and 11k-hands demonstrate that RODS improves both sampling fidelity and robustness, detecting over 70% of hallucinated samples and correcting more than 25%, all while avoiding the introduction of new artifacts.[110] MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM
Tao Chen,Jingyi Zhang,Decheng Liu,Chunlei Peng
Main category: cs.CV
TL;DR: 本文提出了一种名为MGFFD-VLM的新框架,结合了属性驱动的混合LoRA策略和多粒度提示学习,提高了伪造检测的准确性和可解释性。
Details
Motivation: 现有的基于视觉大语言模型的伪造检测方法未能充分利用与面部质量相关的属性,且缺乏有效的训练策略。 Method: 该方法集成了属性驱动的混合LoRA策略、多粒度提示学习和伪造感知训练策略,并设计了多个与伪造相关的辅助损失函数。 Result: 实验结果表明,所提出的方法在文本伪造判断和分析方面均优于现有方法,取得了更高的准确率。 Conclusion: 本文提出了一种新的伪造检测框架MGFFD-VLM,并通过扩展VQA数据集到DD-VQA+来提高伪造分类和分析的准确性和可解释性。 Abstract: Recent studies have utilized visual large language models (VLMs) to answer not only "Is this face a forgery?" but also "Why is the face a forgery?" These studies introduced forgery-related attributes, such as forgery location and type, to construct deepfake VQA datasets and train VLMs, achieving high accuracy while providing human-understandable explanatory text descriptions. However, these methods still have limitations. For example, they do not fully leverage face quality-related attributes, which are often abnormal in forged faces, and they lack effective training strategies for forgery-aware VLMs. In this paper, we extend the VQA dataset to create DD-VQA+, which features a richer set of attributes and a more diverse range of samples. Furthermore, we introduce a novel forgery detection framework, MGFFD-VLM, which integrates an Attribute-Driven Hybrid LoRA Strategy to enhance the capabilities of Visual Large Language Models (VLMs). Additionally, our framework incorporates Multi-Granularity Prompt Learning and a Forgery-Aware Training Strategy. By transforming classification and forgery segmentation results into prompts, our method not only improves forgery classification but also enhances interpretability. To further boost detection performance, we design multiple forgery-related auxiliary losses. Experimental results demonstrate that our approach surpasses existing methods in both text-based forgery judgment and analysis, achieving superior accuracy.[111] Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models
Felix Nützel,Mischa Dombrowski,Bernhard Kainz
Main category: cs.CV
TL;DR: 该论文提出了一种基于生成模型和Bimodal Bias Merging的新方法,在医学图像中实现了更准确的短语定位。
Details
Motivation: 当前最先进的方法依赖于判别式、自监督对比模型,而生成模型在零样本短语定位任务中的潜力尚未被充分探索。 Method: 使用生成式文本到图像扩散模型,结合CXr-BERT语言模型,并引入了一种新的后处理技术Bimodal Bias Merging (BBM)。 Result: 生成模型在mIoU分数上超过了当前判别方法的两倍,并且通过BBM技术进一步提升了性能。 Conclusion: 生成模型在医学图像的短语定位任务中表现出色,通过引入Bimodal Bias Merging(BBM)后处理技术,进一步提升了定位准确性。这为临床实践中更稳健和可解释的应用铺平了道路。 Abstract: Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at https://github.com/Felix-012/generate_to_ground.[112] Calisthenics Skills Temporal Video Segmentation
Antonio Finocchiaro,Giovanni Maria Farinella,Antonino Furnari
Main category: cs.CV
TL;DR: This paper introduces a dataset and baseline method for automated segmentation of static calisthenics skills in videos, aiming to aid athletes and judges.
Details
Motivation: The motivation stems from the need for automated tools in calisthenics to recognize and evaluate isometric skills from videos, which can assist athletes and competition judges. Method: The researchers proposed a dataset of video footage of static calisthenics skills, annotated with temporal segmentation, and applied a baseline approach to address the problem of skill temporal segmentation. Result: The results demonstrate the feasibility of the proposed approach for temporal segmentation of calisthenics skills but indicate potential for enhancement. Conclusion: The study concludes that the proposed dataset and baseline approach provide an initial step towards the automation of calisthenics skill evaluation, with room for further improvement. Abstract: Calisthenics is a fast-growing bodyweight discipline that consists of different categories, one of which is focused on skills. Skills in calisthenics encompass both static and dynamic elements performed by athletes. The evaluation of static skills is based on their difficulty level and the duration of the hold. Automated tools able to recognize isometric skills from a video by segmenting them to estimate their duration would be desirable to assist athletes in their training and judges during competitions. Although the video understanding literature on action recognition through body pose analysis is rich, no previous work has specifically addressed the problem of calisthenics skill temporal video segmentation. This study aims to provide an initial step towards the implementation of automated tools within the field of Calisthenics. To advance knowledge in this context, we propose a dataset of video footage of static calisthenics skills performed by athletes. Each video is annotated with a temporal segmentation which determines the extent of each skill. We hence report the results of a baseline approach to address the problem of skill temporal segmentation on the proposed dataset. The results highlight the feasibility of the proposed problem, while there is still room for improvement.[113] Comparative Analysis of CNN Performance in Keras, PyTorch and JAX on PathMNIST
Anida Nezović,Jalal Romano,Nada Marić,Medina Kapo,Amila Akagić
Main category: cs.CV
TL;DR: 本文研究了不同深度学习框架在医学图像分类中的表现,特别是在使用PathMNIST数据集进行卷积神经网络(CNN)实现时的训练效率、分类准确性和推理速度。
Details
Motivation: 尽管各种深度学习框架在模型开发和部署方面提供了独特的优势,但它们在医学成像任务中的比较性能仍未得到充分探索。 Method: 使用PathMNIST数据集作为基准,对这些框架上的卷积神经网络(CNN)实现进行了全面分析,并评估了训练效率、分类准确性和推理速度。 Result: 研究结果揭示了不同框架在计算速度和模型准确性方面的权衡,并提供了适用于实际应用的见解。 Conclusion: 该研究强调了计算速度和模型准确性之间的权衡,为医学图像分析领域的研究人员和从业者提供了有价值的见解。 Abstract: Deep learning has significantly advanced the field of medical image classification, particularly with the adoption of Convolutional Neural Networks (CNNs). Various deep learning frameworks such as Keras, PyTorch and JAX offer unique advantages in model development and deployment. However, their comparative performance in medical imaging tasks remains underexplored. This study presents a comprehensive analysis of CNN implementations across these frameworks, using the PathMNIST dataset as a benchmark. We evaluate training efficiency, classification accuracy and inference speed to assess their suitability for real-world applications. Our findings highlight the trade-offs between computational speed and model accuracy, offering valuable insights for researchers and practitioners in medical image analysis.[114] Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants
Sybelle Goedicke-Fritz,Michelle Bous,Annika Engel,Matthias Flotho,Pascal Hirsch,Hannah Wittig,Dino Milanovic,Dominik Mohr,Mathias Kaspar,Sogand Nemat,Dorothea Kerner,Arno Bücker,Andreas Keller,Sascha Meyer,Michael Zemlin,Philipp Flotho
Main category: cs.CV
TL;DR: 本研究开发了一种基于出生第一天胸部X光图像的深度学习模型,通过领域特定预训练和渐进式层冻结等技术,实现了对极低出生体重婴儿支气管肺发育不良(BPD)结果的准确预测。
Details
Motivation: 支气管肺发育不良(BPD)是一种影响35%极低出生体重婴儿的慢性肺部疾病,会导致终生呼吸并发症。目前的预防干预措施带有严重的风险,因此对BPD结果的早期预后和预测至关重要,以避免对低风险婴儿造成不必要的毒性。 Method: 我们微调了一个专门针对成人胸部X光图像预训练的ResNet-50模型,采用了渐进式层冻结和差异学习率以防止过拟合,并评估了CutMix增强和线性探测的效果。 Result: 在预测中度/重度BPD结果方面,表现最好的模型结合了渐进式冻结、线性探测和CutMix,达到了0.78±0.10的AUROC、0.69±0.10的平衡准确率和0.67±0.11的F1分数。领域内预训练显著优于ImageNet初始化(p=0.031)。 Conclusion: 通过领域特定的预训练,使用出生第一天的常规X光片可以准确预测支气管肺发育不良(BPD)的结果。通过逐步冻结和线性探测,该方法在计算上是可行的,适用于现场级实施和未来的联邦学习部署。 Abstract: Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.[115] FADE: Adversarial Concept Erasure in Flow Models
Zixuan Fu,Yan Ren,Finn Carter,Chenyue Wang,Ze Niu,Dacheng Yu,Emily Davis,Bo Zhang
Main category: cs.CV
TL;DR: FADE is a novel method for removing sensitive or harmful concepts in text-to-image diffusion models, offering strong privacy and fairness guarantees while maintaining high image quality.
Details
Motivation: Diffusion models pose privacy and fairness risks by memorizing sensitive concepts or perpetuating biases, necessitating a method to remove such concepts. Method: FADE combines trajectory-aware fine-tuning and an adversarial objective to erase specified concepts while preserving model performance. Result: FADE achieves state-of-the-art concept removal performance, improving the harmonic mean of removal and fidelity by 5-10% over existing methods like ESD, UCE, MACE, and ANT. Conclusion: FADE provides a novel concept erasure method for text-to-image diffusion models, ensuring privacy, fairness, and model fidelity without retraining from scratch. Abstract: Diffusion models have demonstrated remarkable image generation capabilities, but also pose risks in privacy and fairness by memorizing sensitive concepts or perpetuating biases. We propose a novel \textbf{concept erasure} method for text-to-image diffusion models, designed to remove specified concepts (e.g., a private individual or a harmful stereotype) from the model's generative repertoire. Our method, termed \textbf{FADE} (Fair Adversarial Diffusion Erasure), combines a trajectory-aware fine-tuning strategy with an adversarial objective to ensure the concept is reliably removed while preserving overall model fidelity. Theoretically, we prove a formal guarantee that our approach minimizes the mutual information between the erased concept and the model's outputs, ensuring privacy and fairness. Empirically, we evaluate FADE on Stable Diffusion and FLUX, using benchmarks from prior work (e.g., object, celebrity, explicit content, and style erasure tasks from MACE). FADE achieves state-of-the-art concept removal performance, surpassing recent baselines like ESD, UCE, MACE, and ANT in terms of removal efficacy and image quality. Notably, FADE improves the harmonic mean of concept removal and fidelity by 5--10\% over the best prior method. We also conduct an ablation study to validate each component of FADE, confirming that our adversarial and trajectory-preserving objectives each contribute to its superior performance. Our work sets a new standard for safe and fair generative modeling by unlearning specified concepts without retraining from scratch.[116] Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation
Antonio Finocchiaro,Giovanni Maria Farinella,Antonino Furnari
Main category: cs.CV
TL;DR: 本研究提出了一种高效的徒手技能识别方法,通过深度估计和运动员定位技术替代传统姿态估计,显著提升了推理速度和分类准确率,并具有灵活的模块化设计,便于未来改进和实际应用。
Details
Motivation: 传统的徒手技能识别方法基于姿态估计,计算成本高、推理时间长且设置复杂,限制了其在实时应用或移动设备上的适用性。因此,需要一种更高效、更快速的方法。 Method: 使用Depth Anything V2进行深度估计,使用YOLOv10进行运动员定位,从背景中分割主体,而非依赖传统的人体姿态估计技术。 Result: 该方法在推理速度上比基于骨架的方法快38.3倍,在使用RGB图像块的同时,通过深度块提升了分类准确率(0.837 vs. 0.815)。 Conclusion: 本文提出了一种直接的徒手技能识别方法,通过深度估计和运动员块检索来避免计算成本高昂的人体姿态估计模块,从而提高效率、减少推理时间并提升分类准确性。此外,该方法具有模块化设计,便于组件的灵活替换,适用于未来增强和现实应用。 Abstract: Calisthenics skill classification is the computer vision task of inferring the skill performed by an athlete from images, enabling automatic performance assessment and personalized analytics. Traditional methods for calisthenics skill recognition are based on pose estimation methods to determine the position of skeletal data from images, which is later fed to a classification algorithm to infer the performed skill. Despite the progress in human pose estimation algorithms, they still involve high computational costs, long inference times, and complex setups, which limit the applicability of such approaches in real-time applications or mobile devices. This work proposes a direct approach to calisthenics skill recognition, which leverages depth estimation and athlete patch retrieval to avoid the computationally expensive human pose estimation module. Using Depth Anything V2 for depth estimation and YOLOv10 for athlete localization, we segment the subject from the background rather than relying on traditional pose estimation techniques. This strategy increases efficiency, reduces inference time, and improves classification accuracy. Our approach significantly outperforms skeleton-based methods, achieving 38.3x faster inference with RGB image patches and improved classification accuracy with depth patches (0.837 vs. 0.815). Beyond these performance gains, the modular design of our pipeline allows for flexible replacement of components, enabling future enhancements and adaptation to real-world applications.[117] Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models
Samuel Lavoie,Michael Noukhovitch,Aaron Courville
Main category: cs.CV
TL;DR: This paper introduces Discrete Latent Codes (DLCs) for diffusion models, improving image generation quality and enabling novel, compositional sample creation beyond training data.
Details
Motivation: To find ideal representations for conditioning diffusion models that improve fidelity, ease of generation, and compositionality. Method: Introduced Discrete Latent Code (DLC) derived from Simplicial Embeddings with self-supervised learning, applied to diffusion models for image generation. Result: Diffusion models trained with DLCs achieved state-of-the-art results on ImageNet and enabled novel out-of-distribution image generation. Conclusion: DLCs enable efficient and compositional representation for diffusion models, improving image generation fidelity and enabling out-of-distribution sample creation. Abstract: We argue that diffusion models' success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.[118] Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors
Subin Jeon,In Cho,Junyoung Hong,Seon Joo Kim
Main category: cs.CV
TL;DR: KeyDiff3D is a new framework for 3D keypoint estimation from a single image, leveraging a pretrained multi-view diffusion model to provide 3D geometric cues and feature extraction, enabling accurate keypoint estimation and 3D object manipulation without manual annotations or calibrated multi-view images.
Details
Motivation: The motivation is to overcome the limitations of previous methods that rely on manual annotations or calibrated multi-view images, which are expensive to collect. Method: The method uses a pretrained multi-view diffusion model to generate multi-view images from a single image, serving as a supervision signal to provide 3D geometric cues. It also uses the diffusion model as a powerful 2D multi-view feature extractor and constructs 3D feature volumes from its intermediate representations. Result: The experimental results show that the method is effective in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image. Conclusion: KeyDiff3D is a framework for unsupervised monocular 3D keypoints estimation that accurately predicts 3D keypoints from a single image without manual annotations or calibrated multi-view images. Abstract: This paper introduces KeyDiff3D, a framework for unsupervised monocular 3D keypoints estimation that accurately predicts 3D keypoints from a single image. While previous methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect, our method enables monocular 3D keypoints estimation using only a collection of single-view images. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, this model generates multi-view images from a single image, serving as a supervision signal to provide 3D geometric cues to our model. We also use the diffusion model as a powerful 2D multi-view feature extractor and construct 3D feature volumes from its intermediate representations. This transforms implicit 3D priors learned by the diffusion model into explicit 3D features. Beyond accurate keypoints estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse aspects and datasets, including Human3.6M, Stanford Dogs, and several in-the-wild and out-of-domain datasets, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.[119] Improving Lightweight Weed Detection via Knowledge Distillation
Ahmet Oğuz Saltık,Max Voigt,Sourav Modak,Mike Beckworth,Anthony Stein
Main category: cs.CV
TL;DR: This paper proposes CWD and MGD to enhance lightweight models for real-time weed detection in precision agriculture, achieving better accuracy without increasing model complexity.
Details
Motivation: Weed detection is critical for precision agriculture, but deploying accurate models on resource-limited platforms is challenging, especially for visually similar weed species. Method: Channel-wise Knowledge Distillation (CWD) and Masked Generative Distillation (MGD) were used to improve lightweight models using YOLO11x as the teacher model and YOLO11n as the student model. Result: Both CWD and MGD improved detection performance, with CWD achieving a 2.5% improvement and MGD achieving a 1.9% improvement in mAP50 over the baseline. Conclusion: CWD and MGD methods are effective, efficient, and practical for improving weed detection accuracy in precision agriculture. Abstract: Weed detection is a critical component of precision agriculture, facilitating targeted herbicide application and reducing environmental impact. However, deploying accurate object detection models on resource-limited platforms remains challenging, particularly when differentiating visually similar weed species commonly encountered in plant phenotyping applications. In this work, we investigate Channel-wise Knowledge Distillation (CWD) and Masked Generative Distillation (MGD) to enhance the performance of lightweight models for real-time smart spraying systems. Utilizing YOLO11x as the teacher model and YOLO11n as both reference and student, both CWD and MGD effectively transfer knowledge from the teacher to the student model. Our experiments, conducted on a real-world dataset comprising sugar beet crops and four weed types (Cirsium, Convolvulus, Fallopia, and Echinochloa), consistently show increased AP50 across all classes. The distilled CWD student model achieves a notable improvement of 2.5% and MGD achieves 1.9% in mAP50 over the baseline without increasing model complexity. Additionally, we validate real-time deployment feasibility by evaluating the student YOLO11n model on Jetson Orin Nano and Raspberry Pi 5 embedded devices, performing five independent runs to evaluate performance stability across random seeds. These findings confirm CWD and MGD as an effective, efficient, and practical approach for improving deep learning-based weed detection accuracy in precision agriculture and plant phenotyping scenarios.[120] Cluster Contrast for Unsupervised Visual Representation Learning
Nikolaos Giakoumoglou,Tania Stathaki
Main category: cs.CV
TL;DR: CueCo是一种结合对比学习和聚类方法的新型无监督视觉表示学习方法,它通过同时分散和对齐特征表示来提高分类准确性。
Details
Motivation: 结合对比学习和聚类方法的优势,以提升无监督视觉表示学习的效果。 Method: CueCo利用查询和键两个神经网络,在特征空间内使用对比损失和聚类目标同时分散和对齐特征表示。 Result: 在CIFAR-10上达到91.40%的top-1分类准确率,CIFAR-100上68.56%,ImageNet-100上78.65%。 Conclusion: CueCo通过整合对比学习与聚类,为无监督视觉表示学习设定了新的方向。 Abstract: We introduce Cluster Contrast (CueCo), a novel approach to unsupervised visual representation learning that effectively combines the strengths of contrastive learning and clustering methods. Inspired by recent advancements, CueCo is designed to simultaneously scatter and align feature representations within the feature space. This method utilizes two neural networks, a query and a key, where the key network is updated through a slow-moving average of the query outputs. CueCo employs a contrastive loss to push dissimilar features apart, enhancing inter-class separation, and a clustering objective to pull together features of the same cluster, promoting intra-class compactness. Our method achieves 91.40% top-1 classification accuracy on CIFAR-10, 68.56% on CIFAR-100, and 78.65% on ImageNet-100 using linear evaluation with a ResNet-18 backbone. By integrating contrastive learning with clustering, CueCo sets a new direction for advancing unsupervised visual representation learning.[121] Text-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation
Kaiwen Huang,Yi Zhou,Huazhu Fu,Yizhe Zhang,Chen Gong,Tao Zhou
Main category: cs.CV
TL;DR: 本文提出了一种结合文本信息进行半监督医学图像分割的新方法Text-SemiSeg,并证明其优于现有技术。
Details
Motivation: 在半监督医学图像分割中,标注数据有限且成本高,而文本信息可以提供额外的上下文来增强视觉语义理解。然而,在3D医学成像任务中探索利用文本数据增强视觉语义嵌入的研究仍然很少。 Method: 提出了一种新的文本驱动的多平面视觉交互框架,包含三个模块:文本增强多平面表示(TMR)、类别感知语义对齐(CSA)和动态认知增强(DCA) Result: 实验表明,该模型能够有效利用文本信息增强视觉特征,并在三个公开数据集上优于其他方法。 Conclusion: 提出的Text-SemiSeg框架在半监督医学图像分割任务中通过文本信息有效地增强了视觉特征,并且优于其他方法。 Abstract: Semi-supervised medical image segmentation is a crucial technique for alleviating the high cost of data annotation. When labeled data is limited, textual information can provide additional context to enhance visual semantic understanding. However, research exploring the use of textual data to enhance visual semantic embeddings in 3D medical imaging tasks remains scarce. In this paper, we propose a novel text-driven multiplanar visual interaction framework for semi-supervised medical image segmentation (termed Text-SemiSeg), which consists of three main modules: Text-enhanced Multiplanar Representation (TMR), Category-aware Semantic Alignment (CSA), and Dynamic Cognitive Augmentation (DCA). Specifically, TMR facilitates text-visual interaction through planar mapping, thereby enhancing the category awareness of visual features. CSA performs cross-modal semantic alignment between the text features with introduced learnable variables and the intermediate layer of visual features. DCA reduces the distribution discrepancy between labeled and unlabeled data through their interaction, thus improving the model's robustness. Finally, experiments on three public datasets demonstrate that our model effectively enhances visual features with textual information and outperforms other methods. Our code is available at https://github.com/taozh2017/Text-SemiSeg.[122] OD-VIRAT: A Large-Scale Benchmark for Object Detection in Realistic Surveillance Environments
Hayat Ullah,Abbas Khan,Arslan Munir,Hari Kalva
Main category: cs.CV
TL;DR: 论文提出了两个新的大规模监控数据集OD-VIRAT Large和OD-VIRAT Tiny,并对最新的目标检测架构进行了基准测试,旨在提升复杂环境下的目标检测性能。
Details
Motivation: 需要多样且具有挑战性的数据集来全面评估模型性能,并开发更可靠的监控系统以提高公共安全。现有的数据集无法满足这些需求,因此作者提出了新的基准数据集。 Method: 论文通过提供大规模的真实监控数据集,并对最新的目标检测架构(如RETMDET、YOLOX、RetinaNet、DETR和Deformable-DETR)进行基准测试,评估其在复杂背景、遮挡和小尺度目标等挑战性条件下的性能。 Result: 论文提出了两个新的目标检测基准数据集,OD-VIRAT Large包含870万个标注实例,OD-VIRAT Tiny包含288,901个标注实例,并对多个最先进的目标检测架构进行了基准测试,提供了在挑战性条件下的性能评估结果。 Conclusion: 该论文提出了OD-VIRAT Large和OD-VIRAT Tiny两个视觉目标检测基准,旨在推动监控图像中的视觉理解任务,并为开发更高效和鲁棒的目标检测架构提供基础。 Abstract: Realistic human surveillance datasets are crucial for training and evaluating computer vision models under real-world conditions, facilitating the development of robust algorithms for human and human-interacting object detection in complex environments. These datasets need to offer diverse and challenging data to enable a comprehensive assessment of model performance and the creation of more reliable surveillance systems for public safety. To this end, we present two visual object detection benchmarks named OD-VIRAT Large and OD-VIRAT Tiny, aiming at advancing visual understanding tasks in surveillance imagery. The video sequences in both benchmarks cover 10 different scenes of human surveillance recorded from significant height and distance. The proposed benchmarks offer rich annotations of bounding boxes and categories, where OD-VIRAT Large has 8.7 million annotated instances in 599,996 images and OD-VIRAT Tiny has 288,901 annotated instances in 19,860 images. This work also focuses on benchmarking state-of-the-art object detection architectures, including RETMDET, YOLOX, RetinaNet, DETR, and Deformable-DETR on this object detection-specific variant of VIRAT dataset. To the best of our knowledge, it is the first work to examine the performance of these recently published state-of-the-art object detection architectures on realistic surveillance imagery under challenging conditions such as complex backgrounds, occluded objects, and small-scale objects. The proposed benchmarking and experimental settings will help in providing insights concerning the performance of selected object detection models and set the base for developing more efficient and robust object detection architectures.[123] AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models
Santosh Vasa,Aditi Ramadwar,Jnana Rama Krishna Darabattula,Md Zafar Anwar,Stanislaw Antol,Andrei Vatavu,Thomas Monninger,Sihao Ding
Main category: cs.CV
TL;DR: This paper proposes the AutoVDC framework, leveraging Vision-Language Models to automatically detect erroneous annotations in autonomous driving datasets, significantly improving data quality and reducing manual effort.
Details
Motivation: The need for high-quality, precisely annotated datasets for training autonomous driving systems is critical. Manual annotation is labor-intensive, expensive, and often imperfect. This work aims to automate the detection of erroneous annotations to reduce effort and improve data quality. Method: The paper introduces the AutoVDC framework, which uses Vision-Language Models (VLMs) to detect erroneous annotations in vision datasets. The framework is validated using the KITTI and nuImages datasets, with artificially injected errors to test error detection rates. Different VLMs and the impact of their fine-tuning are also analyzed. Result: The AutoVDC framework demonstrated high effectiveness in identifying injected errors in datasets. Comparisons between different VLMs and the impact of fine-tuning were explored, showing promising results for automated data cleaning in autonomous driving applications. Conclusion: The AutoVDC framework has shown high performance in detecting errors and cleaning data, indicating its potential to enhance the reliability and accuracy of large-scale datasets in autonomous driving. Abstract: Training of autonomous driving systems requires extensive datasets with precise annotations to attain robust performance. Human annotations suffer from imperfections, and multiple iterations are often needed to produce high-quality datasets. However, manually reviewing large datasets is laborious and expensive. In this paper, we introduce AutoVDC (Automated Vision Data Cleaning) framework and investigate the utilization of Vision-Language Models (VLMs) to automatically identify erroneous annotations in vision datasets, thereby enabling users to eliminate these errors and enhance data quality. We validate our approach using the KITTI and nuImages datasets, which contain object detection benchmarks for autonomous driving. To test the effectiveness of AutoVDC, we create dataset variants with intentionally injected erroneous annotations and observe the error detection rate of our approach. Additionally, we compare the detection rates using different VLMs and explore the impact of VLM fine-tuning on our pipeline. The results demonstrate our method's high performance in error detection and data cleaning experiments, indicating its potential to significantly improve the reliability and accuracy of large-scale production datasets in autonomous driving.[124] QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval
Jaehyun Kwak,Ramahdani Muhammad Izaaz Inhar,Se-Young Yun,Sung-Ju Lee
Main category: cs.CV
TL;DR: 本文提出QuRe方法,通过减少假阴性提高图像检索的用户满意度,并发布新数据集HP-FashionIQ用于评估模型与人类偏好的一致性。
Details
Motivation: 现有的CIR方法忽视了检索结果中其他图像的相关性,导致可能包含不相关的图像,降低了用户满意度。 Method: QuRe通过优化奖励模型目标并引入难负样本策略,改进了对比学习中的假阴性问题。 Result: QuRe在FashionIQ和CIRR数据集上表现优异,并且在HP-FashionIQ数据集中与用户满意度对齐最佳。 Conclusion: QuRe有效地减少了假阴性,提高了图像检索与用户满意度的一致性,并在多个数据集中表现出最先进的性能。 Abstract: Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learning-which treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-FashionIQ dataset. The source code is available at https://github.com/jackwaky/QuRe.[125] InterpIoU: Rethinking Bounding Box Regression with Interpolation-Based IoU Optimization
Haoyuan Liu,Hiroshi Watanabe
Main category: cs.CV
TL;DR: This paper proposes InterpIoU and Dynamic InterpIoU, novel loss functions for object detection that improve upon existing IoU-based losses by providing meaningful gradients in non-overlapping cases and avoiding issues such as bounding box enlargement.
Details
Motivation: Existing IoU-based losses for BBR are sensitive to box shape, size, and distribution, leading to suboptimal optimization for small objects and bounding box enlargement due to misalignment with the IoU objective. Method: InterpIoU replaces handcrafted geometric penalties with a term based on the IoU between interpolated boxes and the target, while Dynamic InterpIoU dynamically adjusts interpolation coefficients based on IoU values. Result: Experiments on COCO, VisDrone, and PASCAL VOC demonstrate that the proposed methods consistently outperform state-of-the-art IoU-based losses across various detection frameworks, particularly improving small object detection. Conclusion: InterpIoU and Dynamic InterpIoU offer superior performance in object detection compared to existing IoU-based losses, particularly for small object detection. Abstract: Bounding box regression (BBR) is fundamental to object detection, where the regression loss is crucial for accurate localization. Existing IoU-based losses often incorporate handcrafted geometric penalties to address IoU's non-differentiability in non-overlapping cases and enhance BBR performance. However, these penalties are sensitive to box shape, size, and distribution, often leading to suboptimal optimization for small objects and undesired behaviors such as bounding box enlargement due to misalignment with the IoU objective. To address these limitations, we propose InterpIoU, a novel loss function that replaces handcrafted geometric penalties with a term based on the IoU between interpolated boxes and the target. By using interpolated boxes to bridge the gap between predictions and ground truth, InterpIoU provides meaningful gradients in non-overlapping cases and inherently avoids the box enlargement issue caused by misaligned penalties. Simulation results further show that IoU itself serves as an ideal regression target, while existing geometric penalties are both unnecessary and suboptimal. Building on InterpIoU, we introduce Dynamic InterpIoU, which dynamically adjusts interpolation coefficients based on IoU values, enhancing adaptability to scenarios with diverse object distributions. Experiments on COCO, VisDrone, and PASCAL VOC show that our methods consistently outperform state-of-the-art IoU-based losses across various detection frameworks, with particularly notable improvements in small object detection, confirming their effectiveness.[126] DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition
Hayat Ullah,Muhammad Ali Shafique,Abbas Khan,Arslan Munir
Main category: cs.CV
TL;DR: 本文提出了一种高效的视频识别模型 DVFL-Net,通过知识蒸馏和时空焦点调制技术,在保持高性能的同时显著降低计算成本,适用于实时人体动作识别。
Details
Motivation: 尽管 Transformer 模型在视频识别任务中取得了最先进的性能,但其计算成本较高,尤其是在处理密集视频数据时。因此,需要提出一种高效的视频识别方法以支持设备端部署。 Method: 提出了一种轻量级的视频焦点调制网络 DVFL-Net,并结合正向 KL 散度和时空焦点调制方法,从预训练教师模型中提取时空知识,以有效减少计算量并保持高识别性能。 Result: DVFL-Net 在多个基准数据集(如 UCF50、UCF101、HMDB51、SSV2 和 Kinetics-400)上表现优异,具有较低的内存消耗、更少的 GFLOPs 以及强大的准确性。 Conclusion: DVFL-Net 通过时空焦点调制和知识蒸馏,实现了性能和效率的最佳平衡,使其成为实时人体动作识别应用的实用解决方案。 Abstract: The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense video data. To address this, we propose a lightweight Video Focal Modulation Network, DVFL-Net, which distills spatiotemporal knowledge from a large pre-trained teacher into a compact nano student model, enabling efficient on-device deployment. DVFL-Net utilizes knowledge distillation and spatial-temporal feature modulation to significantly reduce computation while preserving high recognition performance. We employ forward Kullback-Leibler (KL) divergence alongside spatio-temporal focal modulation to effectively transfer both local and global context from the Video-FocalNet Base (teacher) to the proposed VFL-Net (student). We evaluate DVFL-Net on UCF50, UCF101, HMDB51, SSV2, and Kinetics-400, benchmarking it against recent state-of-the-art methods in Human Action Recognition (HAR). Additionally, we conduct a detailed ablation study analyzing the impact of forward KL divergence. The results confirm the superiority of DVFL-Net in achieving an optimal balance between performance and efficiency, demonstrating lower memory usage, reduced GFLOPs, and strong accuracy, making it a practical solution for real-time HAR applications.[127] Traffic-Aware Pedestrian Intention Prediction
Fahimeh Orvati Nia,Hai Lin
Main category: cs.CV
TL;DR: 本文提出了一种新的行人意图预测模型TA-STGCN,通过整合动态交通信号和场景信息,显著提高了预测准确性。
Details
Motivation: 当前模型未能充分考虑动态交通信号和场景上下文信息,这对实际应用至关重要。 Method: 引入动态交通信号状态和边界框大小作为关键特征,结合交通标志及其状态(红、黄、绿)进行行人意图预测。 Result: 在PIE数据集上,TA-STGCN模型的准确率超过了现有方法4.75%。 Conclusion: TA-STGCN模型在行人意图预测方面表现出更高的准确性,相比基线模型提升了4.75%。 Abstract: Accurate pedestrian intention estimation is crucial for the safe navigation of autonomous vehicles (AVs) and hence attracts a lot of research attention. However, current models often fail to adequately consider dynamic traffic signals and contextual scene information, which are critical for real-world applications. This paper presents a Traffic-Aware Spatio-Temporal Graph Convolutional Network (TA-STGCN) that integrates traffic signs and their states (Red, Yellow, Green) into pedestrian intention prediction. Our approach introduces the integration of dynamic traffic signal states and bounding box size as key features, allowing the model to capture both spatial and temporal dependencies in complex urban environments. The model surpasses existing methods in accuracy. Specifically, TA-STGCN achieves a 4.75% higher accuracy compared to the baseline model on the PIE dataset, demonstrating its effectiveness in improving pedestrian intention prediction.[128] Describe Anything Model for Visual Question Answering on Text-rich Images
Yen-Linh Vu,Dinh-Thang Duong,Truong-Binh Duong,Anh-Khoi Nguyen,Thanh-Huy Nguyen,Le Thien Phuc Nguyen,Jianhua Xing,Xingjian Li,Tianyang Wang,Ulas Bagci,Min Xu
Main category: cs.CV
TL;DR: DAM-QA leverages DAM's region-aware vision-language modeling to improve text-rich VQA tasks, showing significant performance gains, especially on DocVQA.
Details
Motivation: The region-level descriptive capability of DAM is hypothesized to benefit Visual Question Answering (VQA), particularly for images with dense text where fine-grained textual extraction is crucial. Method: DAM-QA introduces a mechanism to aggregate answers from multiple regional views of image content, focusing on text-based reasoning. Result: Experiments on six VQA benchmarks show that DAM-QA outperforms baseline DAM by over 7 points on DocVQA and achieves the best performance among region-aware models with fewer parameters. Conclusion: DAM-QA effectively utilizes region-aware capabilities of DAM for text-rich VQA tasks, especially showing good performance on DocVQA and narrowing the gap with generalist VLMs. Abstract: Recent progress has been made in region-aware vision-language modeling, particularly with the emergence of the Describe Anything Model (DAM). DAM is capable of generating detailed descriptions of any specific image areas or objects without the need for additional localized image-text alignment supervision. We hypothesize that such region-level descriptive capability is beneficial for the task of Visual Question Answering (VQA), especially in challenging scenarios involving images with dense text. In such settings, the fine-grained extraction of textual information is crucial to producing correct answers. Motivated by this, we introduce DAM-QA, a framework with a tailored evaluation protocol, developed to investigate and harness the region-aware capabilities from DAM for the text-rich VQA problem that requires reasoning over text-based information within images. DAM-QA incorporates a mechanism that aggregates answers from multiple regional views of image content, enabling more effective identification of evidence that may be tied to text-related elements. Experiments on six VQA benchmarks show that our approach consistently outperforms the baseline DAM, with a notable 7+ point gain on DocVQA. DAM-QA also achieves the best overall performance among region-aware models with fewer parameters, significantly narrowing the gap with strong generalist VLMs. These results highlight the potential of DAM-like models for text-rich and broader VQA tasks when paired with efficient usage and integration strategies. Our code is publicly available at https://github.com/Linvyl/DAM-QA.git.[129] Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios
Van-Hoang-Anh Phan,Chi-Tam Nguyen,Doan-Trung Au,Thanh-Danh Phan,Minh-Thien Duong,My-Ha Le
Main category: cs.CV
TL;DR: 本文提出了一种基于纯视觉的高效障碍物避障方案,利用YOLOv11和Depth Anything V2进行障碍物检测与距离估计,并结合Frenet-Pure Pursuit规划策略,实现在复杂环境中的安全导航。
Details
Motivation: 障碍物避障是确保自动驾驶车辆安全的关键能力,而精确的感知和运动规划对于复杂环境中的无碰撞导航至关重要。 Method: 该研究采用纯视觉感知模块,结合YOLOv11用于目标检测,Depth Anything V2用于单目深度估计,再通过Frenet-Pure Pursuit方法进行路径规划,形成完整的避障流水线。 Result: 系统在大学校园的多种场景中进行了评估,验证了其在处理各类障碍物和提升自动驾驶导航方面的有效性。 Conclusion: 本文提出的障碍物避障方案在准确性、效率和鲁棒性方面表现出色,为自动驾驶技术提供了一种可行的纯视觉解决方案。 Abstract: Obstacle avoidance is essential for ensuring the safety of autonomous vehicles. Accurate perception and motion planning are crucial to enabling vehicles to navigate complex environments while avoiding collisions. In this paper, we propose an efficient obstacle avoidance pipeline that leverages a camera-only perception module and a Frenet-Pure Pursuit-based planning strategy. By integrating advancements in computer vision, the system utilizes YOLOv11 for object detection and state-of-the-art monocular depth estimation models, such as Depth Anything V2, to estimate object distances. A comparative analysis of these models provides valuable insights into their accuracy, efficiency, and robustness in real-world conditions. The system is evaluated in diverse scenarios on a university campus, demonstrating its effectiveness in handling various obstacles and enhancing autonomous navigation. The video presenting the results of the obstacle avoidance experiments is available at: https://www.youtube.com/watch?v=FoXiO5S_tA8[130] Mitigating Object Hallucinations via Sentence-Level Early Intervention
Shangpin Peng,Senqiao Yang,Li Jiang,Zhuotao Tian
Main category: cs.CV
TL;DR: SENTINEL是一种新的框架,通过句子级别的早期干预和领域内偏好学习,有效减少多模态大语言模型中的幻觉现象,同时优于现有方法。
Details
Motivation: 现有的幻觉缓解方法要么计算成本过高,要么在训练数据和模型输出之间引入分布不匹配。此外,幻觉主要在文本生成的早期阶段出现并传播到后续输出中。 Method: SENTINEL通过句子级别的早期干预,利用领域内偏好学习消除对人工注释的依赖。它通过迭代采样模型输出,使用两个开放词汇检测器交叉验证对象存在,并将句子分类为幻觉/非幻觉类别来引导模型训练。 Result: SENTINEL相比原始模型减少了超过90%的幻觉现象,并且在各种基准测试中表现优异。 Conclusion: SENTINEL有效地减少了多模态大语言模型中的幻觉现象,并在幻觉基准和通用能力基准上均优于之前最先进的方法,证明了其优越性和泛化能力。 Abstract: Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose **SENTINEL** (**S**entence-level **E**arly i**N**tervention **T**hrough **IN**-domain pr**E**ference **L**earning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90\% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.[131] Interpreting Radiologist's Intention from Eye Movements in Chest X-ray Diagnosis
Trong-Thang Pham,Anh Nguyen,Zhigang Deng,Carol C. Wu,Hien Van Nguyen,Ngan Le
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的模型RadGazeIntent,用于捕捉放射科医生眼动背后的诊断意图,并展示了其优于现有方法的性能。
Details
Motivation: 放射科医生依靠眼动来导航和解读医学图像,而现有模型未能捕捉每次注视背后的潜在意图。 Method: 我们引入了一种基于深度学习的方法RadGazeIntent,该方法设计用于模拟放射科医生寻找并主动搜索疾病的意图行为。 Result: 实验结果表明RadGazeIntent在预测放射科医生检查特定发现方面优于基线方法。 Conclusion: RadGazeIntent能够有效预测放射科医生在特定时刻正在检查的发现,优于所有意图标记数据集上的基线方法。 Abstract: Radiologists rely on eye movements to navigate and interpret medical images. A trained radiologist possesses knowledge about the potential diseases that may be present in the images and, when searching, follows a mental checklist to locate them using their gaze. This is a key observation, yet existing models fail to capture the underlying intent behind each fixation. In this paper, we introduce a deep learning-based approach, RadGazeIntent, designed to model this behavior: having an intention to find something and actively searching for it. Our transformer-based architecture processes both the temporal and spatial dimensions of gaze data, transforming fine-grained fixation features into coarse, meaningful representations of diagnostic intent to interpret radiologists' goals. To capture the nuances of radiologists' varied intention-driven behaviors, we process existing medical eye-tracking datasets to create three intention-labeled subsets: RadSeq (Systematic Sequential Search), RadExplore (Uncertainty-driven Exploration), and RadHybrid (Hybrid Pattern). Experimental results demonstrate RadGazeIntent's ability to predict which findings radiologists are examining at specific moments, outperforming baseline methods across all intention-labeled datasets.[132] SpatialTrackerV2: 3D Point Tracking Made Easy
Yuxi Xiao,Jianyuan Wang,Nan Xue,Nikita Karaev,Yuri Makarov,Bingyi Kang,Xing Zhu,Hujun Bao,Yujun Shen,Xiaowei Zhou
Main category: cs.CV
TL;DR: SpatialTrackerV2是一个高性能的3D点跟踪方法,通过统一内在连接点跟踪、单目深度和相机姿态估计,实现了比现有方法高30%的性能提升,并且运行速度快50倍。
Details
Motivation: 传统的模块化3D跟踪管道依赖于现成的组件,这限制了其性能。为了解决这个问题,SpatialTrackerV2提出了一种新的方法,通过统一内在连接点跟踪、单目深度和相机姿态估计来提高3D跟踪的性能。 Method: SpatialTrackerV2采用了一种端到端的全可微架构,将世界空间中的3D运动分解为场景几何、相机自我运动和像素级对象运动,从而实现单目视频的3D点跟踪。 Result: SpatialTrackerV2在广泛的的数据集上进行了可扩展的训练,包括合成序列、带有姿态的RGB-D视频和未标记的野外视频,并且在这些数据上表现出了比现有3D跟踪方法高30%的性能提升,同时与领先的动态3D重建方法的精度相当,但运行速度快50倍。 Conclusion: SpatialTrackerV2是一个高性能的3D点跟踪方法,它通过统一内在连接点跟踪、单目深度和相机姿态估计,实现了比现有3D跟踪方法高30%的性能提升,并且与领先的动态3D重建方法的精度相当,但运行速度快50倍。 Abstract: We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50$\times$ faster.[133] MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding
Renjie Li,Ruijie Ye,Mingyang Wu,Hao Frank Yang,Zhiwen Fan,Hezhen Hu,Zhengzhong Tu
Main category: cs.CV
TL;DR: 本文提出了一种大规模基准数据集MMHU,用于自动驾驶中的人类行为分析,包含丰富的注释和多样的数据来源,并开发了人工参与的注释流程以支持多种任务的评估。
Details
Motivation: 尽管最近的研究已经探索了人类行为的各个方面,如动作、轨迹和意图,但仍然缺乏一个用于评估自动驾驶中人类行为理解的综合基准。 Method: 开发了一个人工参与的注释流程,以生成丰富的行为描述,并对多个任务(包括运动预测、运动生成和人类行为问答)进行了基准测试。 Result: 提出了一个名为MMHU的大规模基准数据集,包含57k个人类动作片段和173万帧数据,涵盖了人类动作、轨迹、动作文本描述、意图以及与驾驶安全相关的关键行为标签。 Conclusion: MMHU-Benchmark.github.io 项目页面的推出进一步支持了该数据集的广泛应用和持续发展。 Abstract: Humans are integral components of the transportation ecosystem, and understanding their behaviors is crucial to facilitating the development of safe driving systems. Although recent progress has explored various aspects of human behavior$\unicode{x2014}$such as motion, trajectories, and intention$\unicode{x2014}$a comprehensive benchmark for evaluating human behavior understanding in autonomous driving remains unavailable. In this work, we propose $\textbf{MMHU}$, a large-scale benchmark for human behavior analysis featuring rich annotations, such as human motion and trajectories, text description for human motions, human intention, and critical behavior labels relevant to driving safety. Our dataset encompasses 57k human motion clips and 1.73M frames gathered from diverse sources, including established driving datasets such as Waymo, in-the-wild videos from YouTube, and self-collected data. A human-in-the-loop annotation pipeline is developed to generate rich behavior captions. We provide a thorough dataset analysis and benchmark multiple tasks$\unicode{x2014}$ranging from motion prediction to motion generation and human behavior question answering$\unicode{x2014}$thereby offering a broad evaluation suite. Project page : https://MMHU-Benchmark.github.io.[134] CytoSAE: Interpretable Cell Embeddings for Hematology
Muhammed Furkan Dasdelen,Hyesu Lim,Michele Buck,Katharina S. Götze,Carsten Marr,Steffen Schneider
Main category: cs.CV
TL;DR: CytoSAE applies sparse autoencoder techniques to hematology for explainable identification of morphologically relevant concepts and achieves strong performance in AML subtype classification.
Details
Motivation: Sparse autoencoders (SAEs) have shown promise in interpreting transformer-based models in other domains, and their application to medical imaging, specifically hematology, aims to explain model inferences and discover relevant visual concepts. Method: CytoSAE was trained on over 40,000 peripheral blood single-cell images and applied to diverse datasets, including bone marrow cytology, to identify visual concepts linked to medical expert validation. Result: CytoSAE generalizes well to diverse datasets, identifies disease-specific and patient-specific concepts, detects pathognomonic cells and localized abnormalities, and achieves performance comparable to state-of-the-art methods on AML subtype classification. Conclusion: CytoSAE, a sparse autoencoder, demonstrates applicability in hematology by identifying morphologically relevant concepts and offering explainability on the sub-cellular level for AML subtype classification. Abstract: Sparse autoencoders (SAEs) emerged as a promising tool for mechanistic interpretability of transformer-based foundation models. Very recently, SAEs were also adopted for the visual domain, enabling the discovery of visual concepts and their patch-wise attribution to tokens in the transformer model. While a growing number of foundation models emerged for medical imaging, tools for explaining their inferences are still lacking. In this work, we show the applicability of SAEs for hematology. We propose CytoSAE, a sparse autoencoder which is trained on over 40,000 peripheral blood single-cell images. CytoSAE generalizes to diverse and out-of-domain datasets, including bone marrow cytology, where it identifies morphologically relevant concepts which we validated with medical experts. Furthermore, we demonstrate scenarios in which CytoSAE can generate patient-specific and disease-specific concepts, enabling the detection of pathognomonic cells and localized cellular abnormalities at the patch level. We quantified the effect of concepts on a patient-level AML subtype classification task and show that CytoSAE concepts reach performance comparable to the state-of-the-art, while offering explainability on the sub-cellular level. Source code and model weights are available at https://github.com/dynamical-inference/cytosae.[135] PhysX: Physical-Grounded 3D Asset Generation
Ziang Cao,Zhaoxi Chen,Linag Pan,Ziwei Liu
Main category: cs.CV
TL;DR: PhysX introduces PhysXNet and PhysXGen to enable physical-grounded 3D asset generation, bridging the gap in physics-annotated datasets and enhancing real-world applicability.
Details
Motivation: 3D modeling needs to shift focus from virtual to physical properties to improve real-world applications, as current generative models neglect physical-grounded modeling. Method: PhysX proposes PhysXNet, a physics-grounded 3D dataset, and PhysXGen, a feed-forward framework that injects physical knowledge into pre-trained 3D structural space using a dual-branch architecture. Result: The experiments demonstrate that PhysXGen achieves superior performance and generalization in physics-grounded image-to-3D asset generation. Conclusion: PhysX provides an end-to-end paradigm for generating physical-grounded 3D assets, enhancing real-world applications in simulation and embodied AI. Abstract: 3D modeling is moving from virtual to physical. Existing 3D generation primarily emphasizes geometries and textures while neglecting physical-grounded modeling. Consequently, despite the rapid development of 3D generative models, the synthesized 3D assets often overlook rich and important physical properties, hampering their real-world application in physical domains like simulation and embodied AI. As an initial attempt to address this challenge, we propose \textbf{PhysX}, an end-to-end paradigm for physical-grounded 3D asset generation. 1) To bridge the critical gap in physics-annotated 3D datasets, we present PhysXNet - the first physics-grounded 3D dataset systematically annotated across five foundational dimensions: absolute scale, material, affordance, kinematics, and function description. In particular, we devise a scalable human-in-the-loop annotation pipeline based on vision-language models, which enables efficient creation of physics-first assets from raw 3D assets.2) Furthermore, we propose \textbf{PhysXGen}, a feed-forward framework for physics-grounded image-to-3D asset generation, injecting physical knowledge into the pre-trained 3D structural space. Specifically, PhysXGen employs a dual-branch architecture to explicitly model the latent correlations between 3D structures and physical properties, thereby producing 3D assets with plausible physical predictions while preserving the native geometry quality. Extensive experiments validate the superior performance and promising generalization capability of our framework. All the code, data, and models will be released to facilitate future research in generative physical AI.eess.IV [Back]
[136] CompressedVQA-HDR: Generalized Full-reference and No-reference Quality Assessment Models for Compressed High Dynamic Range Videos
Wei Sun,Linhan Cao,Kang Fu,Dandan Zhu,Jun Jia,Menghan Hu,Xiongkuo Min,Guangtao Zhai
Main category: eess.IV
TL;DR: This paper proposes CompressedVQA-HDR, a new framework for video quality assessment tailored to HDR content, achieving top performance in benchmark evaluations.
Details
Motivation: Existing VQA methods struggle with generalization, especially for high dynamic range (HDR) videos, necessitating a more effective and adaptable solution. Method: The paper proposes two models (FR and NR) using Swin Transformer and SigLip 2 as backbones, incorporating transfer learning and iterative training to overcome limited HDR data. Result: Experimental results show that both models outperform existing methods, achieving superior performance on HDR video quality assessment. Conclusion: The proposed CompressedVQA-HDR framework achieves state-of-the-art performance in assessing the quality of HDR videos, with the FR model winning first place in a relevant challenge. Abstract: Video compression is a standard procedure applied to all videos to minimize storage and transmission demands while preserving visual quality as much as possible. Therefore, evaluating the visual quality of compressed videos is crucial for guiding the practical usage and further development of video compression algorithms. Although numerous compressed video quality assessment (VQA) methods have been proposed, they often lack the generalization capability needed to handle the increasing diversity of video types, particularly high dynamic range (HDR) content. In this paper, we introduce CompressedVQA-HDR, an effective VQA framework designed to address the challenges of HDR video quality assessment. Specifically, we adopt the Swin Transformer and SigLip 2 as the backbone networks for the proposed full-reference (FR) and no-reference (NR) VQA models, respectively. For the FR model, we compute deep structural and textural similarities between reference and distorted frames using intermediate-layer features extracted from the Swin Transformer as its quality-aware feature representation. For the NR model, we extract the global mean of the final-layer feature maps from SigLip 2 as its quality-aware representation. To mitigate the issue of limited HDR training data, we pre-train the FR model on a large-scale standard dynamic range (SDR) VQA dataset and fine-tune it on the HDRSDR-VQA dataset. For the NR model, we employ an iterative mixed-dataset training strategy across multiple compressed VQA datasets, followed by fine-tuning on the HDRSDR-VQA dataset. Experimental results show that our models achieve state-of-the-art performance compared to existing FR and NR VQA models. Moreover, CompressedVQA-HDR-FR won first place in the FR track of the Generalizable HDR & SDR Video Quality Measurement Grand Challenge at IEEE ICME 2025. The code is available at https://github.com/sunwei925/CompressedVQA-HDR.[137] Identifying Signatures of Image Phenotypes to Track Treatment Response in Liver Disease
Matthias Perkonigg,Nina Bastati,Ahmed Ba-Ssalamah,Peter Mesenbrink,Alexander Goehler,Miljen Martic,Xiaofei Zhou,Michael Trauner,Georg Langs
Main category: eess.IV
TL;DR: Unsupervised machine learning creates a liver tissue vocabulary from MRI scans that quantifies treatment response and predicts biopsy features non-invasively.
Details
Motivation: The motivation is to develop quantifiable image patterns that guide individual treatment and aid in developing novel therapies for liver disease, particularly by leveraging unsupervised machine learning for non-invasive assessment. Method: Deep clustering networks were used to encode and cluster patches of medical images into a low-dimensional latent space, establishing a tissue vocabulary. This method was applied to a cohort of non-alcoholic steatohepatitis patients to compare longitudinal liver changes between placebo and treatment groups. Result: The method identified specific liver tissue change pathways associated with treatment and achieved better separation between treatment groups than established non-imaging measures. Additionally, the tissue vocabulary could predict biopsy-derived features and was validated on a replication cohort. Conclusion: The study concludes that unsupervised machine learning can effectively identify image-based pattern vocabularies in liver tissue from MRI scans, which can quantify treatment response in diffuse liver disease and predict biopsy-derived features non-invasively. Abstract: Quantifiable image patterns associated with disease progression and treatment response are critical tools for guiding individual treatment, and for developing novel therapies. Here, we show that unsupervised machine learning can identify a pattern vocabulary of liver tissue in magnetic resonance images that quantifies treatment response in diffuse liver disease. Deep clustering networks simultaneously encode and cluster patches of medical images into a low-dimensional latent space to establish a tissue vocabulary. The resulting tissue types capture differential tissue change and its location in the liver associated with treatment response. We demonstrate the utility of the vocabulary on a randomized controlled trial cohort of non-alcoholic steatohepatitis patients. First, we use the vocabulary to compare longitudinal liver change in a placebo and a treatment cohort. Results show that the method identifies specific liver tissue change pathways associated with treatment, and enables a better separation between treatment groups than established non-imaging measures. Moreover, we show that the vocabulary can predict biopsy derived features from non-invasive imaging data. We validate the method on a separate replication cohort to demonstrate the applicability of the proposed method.[138] Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis
Nataliia Molchanova,Alessandro Cagol,Mario Ocampo-Pineda,Po-Jui Lu,Matthias Weigel,Xinjie Chen,Erin Beck,Charidimos Tsagkas,Daniel Reich,Colin Vanden Bulcke,Anna Stolting,Serena Borrelli,Pietro Maggi,Adrien Depeursinge,Cristina Granziera,Henning Mueller,Pedro M. Gordaliza,Meritxell Bach Cuadra
Main category: eess.IV
TL;DR: 本文提出了一种用于多发性硬化症皮质病变检测和分割的自动化方法,通过多中心MRI数据和改进的nnU-Net框架,展示了良好的检测能力,并分析了影响模型性能的因素,以促进临床应用。