Skip to content

Table of Contents

cs.CL [Back]

[1] DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

Jusheng Zhang,Yijia Fan,Kaitong Cai,Zimeng Huang,Xiaofei Sun,Jian Wang,Chengpei Tang,Keze Wang

Main category: cs.CL

TL;DR: 本研究提出了一种名为DrDiff的长文本生成框架,通过动态专家调度、分层稀疏注意力和软吸收引导优化技术,有效解决了效率与质量之间的权衡,并在多个基准测试中表现优异。

Details Motivation: 为了解决长文本生成中的效率-质量权衡问题,提高生成速度和模型性能。 Method: 提出了DrDiff框架,包含三个核心技术:动态专家调度机制、分层稀疏注意力机制和软吸收引导优化策略。 Result: 在各种长文本生成基准上的综合实验表明,DrDiff在性能和效率方面均优于现有SOTA方法。 Conclusion: DrDiff框架在长文本生成方面优于现有的最先进方法,通过三种核心技术解决了效率与质量之间的权衡问题。 Abstract: This paper introduces DrDiff, a novel framework for long-text generation that overcomes the efficiency-quality trade-off through three core technologies. First, we design a dynamic expert scheduling mechanism that intelligently allocates computational resources during the diffusion process based on text complexity, enabling more efficient handling of text generation tasks of varying difficulty. Second, we introduce a Hierarchical Sparse Attention (HSA) mechanism that adaptively adjusts attention patterns according to a variety of input lengths, reducing computational complexity from O($n^2$) to O($n$) while maintaining model performance. Finally, we propose a soft absorption guidance optimization strategy that combines with DPM-solver++ to reduce diffusion steps, significantly improving generation speed. Comprehensive experiments on various long-text generation benchmarks demonstrate the superiority of our DrDiff over the existing SOTA methods.

[2] SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking under Domain Shift in ASR

Pu Wang,Shinji Watanabe,Hugo Van hamme

Main category: cs.CL

TL;DR: This paper introduces and evaluates parameter-efficient fine-tuning methods, including a new structured SVD-guided approach, for domain adaptation in speech recognition tasks.

Details Motivation: Current state-of-the-art PEFT methods are mainly developed for language and vision tasks, with limited validation in speech, motivating the need for a comprehensive evaluation and new solutions tailored for speech applications. Method: The paper benchmarks existing PEFT methods, such as VeRA, DoRA, PiSSA, and SVFT, and introduces a new method called structured SVD-guided (SSVD) fine-tuning. Result: All PEFT methods, including the proposed SSVD, are evaluated on domain-shifted speech recognition tasks such as child speech and dialectal variation across multiple model scales. Conclusion: The paper concludes that SSVD is a more efficient and effective method for domain adaptation in speech applications, and the implementations of all methods are released to support future work. Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a scalable solution for adapting large foundation models. While low-rank adaptation (LoRA) is widely used in speech applications, its state-of-the-art variants, e.g., VeRA, DoRA, PiSSA, and SVFT, are developed mainly for language and vision tasks, with limited validation in speech. This work presents the first comprehensive integration and benchmarking of these PEFT methods within ESPnet. We further introduce structured SVD-guided (SSVD) fine-tuning, which selectively rotates input-associated right singular vectors while keeping output-associated vectors fixed to preserve semantic mappings. This design enables robust domain adaptation with minimal trainable parameters and improved efficiency. We evaluate all methods on domain-shifted speech recognition tasks, including child speech and dialectal variation, across model scales from 0.1B to 2B. All implementations are released in ESPnet to support reproducibility and future work.

[3] Clustering Discourses: Racial Biases in Short Stories about Women Generated by Large Language Models

Gustavo Bonil,João Gondim,Marina dos Santos,Simone Hashiguti,Helena Maia,Nadia Silva,Helio Pedrini,Sandra Avila

Main category: cs.CL

TL;DR: This paper explores how LLaMA 3.2-3B generates narratives about Black and white women in Portuguese, revealing underlying colonial and gender biases.

Details Motivation: To understand how large language models may reinforce historical and racial inequalities through narrative generation. Method: Generated 2100 texts and used computational methods for semantic grouping, followed by qualitative discourse analysis. Result: Identified three discursive representations: social overcoming, ancestral mythification, and subjective self-realization; showed how texts reinforce colonial framing of female bodies. Conclusion: Language models can perpetuate systemic inequalities, and an integrated approach combining machine learning with qualitative analysis is essential to uncover such biases. Abstract: This study investigates how large language models, in particular LLaMA 3.2-3B, construct narratives about Black and white women in short stories generated in Portuguese. From 2100 texts, we applied computational methods to group semantically similar stories, allowing a selection for qualitative analysis. Three main discursive representations emerge: social overcoming, ancestral mythification and subjective self-realization. The analysis uncovers how grammatically coherent, seemingly neutral texts materialize a crystallized, colonially structured framing of the female body, reinforcing historical inequalities. The study proposes an integrated approach, that combines machine learning techniques with qualitative, manual discourse analysis.

[4] IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations

Hyunji Nam,Lucia Langlois,James Malamut,Mei Tan,Dorottya Demszky

Main category: cs.CL

TL;DR: This paper introduces IDEAlgin, a scalable method for evaluating LLM-generated annotations against expert human judgments, showing that LLMs prompted with IDEAlgin align significantly better with experts than traditional metrics.

Details Motivation: The motivation is to address the lack of validated, scalable measures for evaluating whether LLM-generated annotations align with expert human judgments in interpretive annotation tasks, such as thematic analysis or student feedback. Method: The study introduces IDEAlgin, a benchmarking paradigm for capturing expert similarity ratings using a 'pick-the-odd-one-out' triplet judgment task. It evaluates various similarity metrics, including vector-based metrics and LLM-as-a-judge via IDEAlgin, against human benchmarks on two real-world educational datasets. Result: Vector-based similarity metrics largely fail to capture the nuanced dimensions of similarity meaningful to experts. Prompting LLMs via IDEAlgin significantly improves alignment with expert judgments (9-30% increase) compared to traditional lexical and vector-based metrics. Conclusion: IDEAlgin is a promising paradigm for evaluating LLMs against open-ended expert annotations at scale, which can inform the responsible deployment of LLMs in education and beyond. Abstract: Large language models (LLMs) are increasingly applied to open-ended, interpretive annotation tasks, such as thematic analysis by researchers or generating feedback on student work by teachers. These tasks involve free-text annotations requiring expert-level judgments grounded in specific objectives (e.g., research questions or instructional goals). Evaluating whether LLM-generated annotations align with those generated by expert humans is challenging to do at scale, and currently, no validated, scalable measure of similarity in ideas exists. In this paper, we (i) introduce the scalable evaluation of interpretive annotation by LLMs as a critical and understudied task, (ii) propose IDEAlgin, an intuitive benchmarking paradigm for capturing expert similarity ratings via a "pick-the-odd-one-out" triplet judgment task, and (iii) evaluate various similarity metrics, including vector-based ones (topic models, embeddings) and LLM-as-a-judge via IDEAlgin, against these human benchmarks. Applying this approach to two real-world educational datasets (interpretive analysis and feedback generation), we find that vector-based metrics largely fail to capture the nuanced dimensions of similarity meaningful to experts. Prompting LLMs via IDEAlgin significantly improves alignment with expert judgments (9-30% increase) compared to traditional lexical and vector-based metrics. These results establish IDEAlgin as a promising paradigm for evaluating LLMs against open-ended expert annotations at scale, informing responsible deployment of LLMs in education and beyond.

[5] A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation

Kesen Wang,Daulet Toibazar,Pedro J. Moreno

Main category: cs.CL

TL;DR: This paper introduces a self-evolving adversarial workflow for enhancing long-context QA generation in Arabic, using an automated system of LVLMs that improves performance without human input.

Details Motivation: The motivation is to improve long-context Question-Answer (QA) Generation in Arabic by creating a system that continuously learns and enhances question difficulty and relevance. Method: The method involves an end-to-end, self-evolving adversarial workflow that uses a question generator, an evaluator, and a swarm of answer generators to iteratively refine performance without human intervention. Result: The result is a substantial improvement in the performance of Arabic LVLMs on long-context comprehension tasks, demonstrated on a new benchmark called AraLongBench. Conclusion: The paper concludes that their self-evolving adversarial workflow significantly enhances the long-context comprehension capabilities of Arabic Large Vision Language Models (LVLMs) and outperforms static pipelines. Abstract: We present an end-to-end, self-evolving adversarial workflow for long-context Question-Answer (QA) Generation in Arabic. By orchestrating multiple specialized LVLMs: a question generator, an evaluator, and a swarm of answer generators, our system iteratively refines its own performance without any human intervention. Starting from raw, multi-page Arabic documents across diverse domains, the question generator produces fine-grained, context-aware queries to be tackled by the answer generator swarm, and the evaluator assesses and feeds back quality metrics. This closed-loop cycle enables continuous learning: low-confidence outputs trigger automated re-generation and model updates, progressively enhancing question difficulty and relevance. Moreover, we set the quality metrics as a tunable hyperparameter, enabling question generation at controllable and customizable difficulty levels. We release AraLongBench, a large-scale Arabic benchmark of single- and multi-page challenges spanning hundreds of pages, and demonstrate that our self-evolving workflow substantially outperform static pipelines, markedly boosting the long-context comprehension capabilities of leading Arabic Large Vision Language Models (LVLMs). Lastly, we also meticulously architect a fully automated agentic workflow for long-context Arabic document collection.

[6] Advancing Minority Stress Detection with Transformers: Insights from the Social Media Datasets

Santosh Chapagain,Cory J Cascalheira,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi,Jillian R. Scheer

Main category: cs.CL

TL;DR: This study evaluates transformer-based models for detecting minority stress in online discourse, showing that graph-augmented approaches outperform traditional methods and offer potential for digital health interventions.

Details Motivation: Sexual and gender minorities face disproportionately poor health outcomes due to minority stress. This study aims to develop effective computational tools for detecting minority stress in online discourse, enabling better digital health interventions and public health policy. Method: The researchers benchmarked transformer models (ELECTRA, BERT, RoBERTa, BART) against traditional machine learning baselines and graph-augmented variants. They assessed zero-shot and few-shot learning paradigms and conducted experiments on two large Reddit datasets with 12,645 and 5,789 posts, repeated over five random seeds. Result: Graph-augmented transformer models showed consistently improved performance in minority stress detection compared to non-augmented models. Supervised fine-tuning with relational context outperformed zero and few-shot learning approaches. Graph structures enhanced the identification of linguistic markers like identity concealment, internalized stigma, and calls for support. Conclusion: The study concludes that graph-enhanced transformers offer the most reliable foundation for detecting minority stress in online discourse, outperforming traditional models and zero/few-shot approaches. Abstract: Individuals from sexual and gender minority groups experience disproportionately high rates of poor health outcomes and mental disorders compared to their heterosexual and cisgender counterparts, largely as a consequence of minority stress as described by Meyer's (2003) model. This study presents the first comprehensive evaluation of transformer-based architectures for detecting minority stress in online discourse. We benchmark multiple transformer models including ELECTRA, BERT, RoBERTa, and BART against traditional machine learning baselines and graph-augmented variants. We further assess zero-shot and few-shot learning paradigms to assess their applicability on underrepresented datasets. Experiments are conducted on the two largest publicly available Reddit corpora for minority stress detection, comprising 12,645 and 5,789 posts, and are repeated over five random seeds to ensure robustness. Our results demonstrate that integrating graph structure consistently improves detection performance across transformer-only models and that supervised fine-tuning with relational context outperforms zero and few-shot approaches. Theoretical analysis reveals that modeling social connectivity and conversational context via graph augmentation sharpens the models' ability to identify key linguistic markers such as identity concealment, internalized stigma, and calls for support, suggesting that graph-enhanced transformers offer the most reliable foundation for digital health interventions and public health policy.

[7] English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM

Taekyung Ahn,Hosung Nam

Main category: cs.CL

TL;DR: 该研究展示了一种基于LoRA的微调方法,使得多模态大语言模型能够同时执行自动发音评估和发音错误检测,无需复杂架构修改,简化了训练过程并提高了效率。

Details Motivation: 传统的自动发音评估(APA)和发音错误检测与诊断(MDD)任务通常需要复杂的架构更改或独立的训练流程,因此需要寻找一种更简单、更高效的训练方法。 Method: 利用微软的Phi-4-multimodal-instruct模型,通过低秩自适应(LoRA)方法进行微调,并在Speechocean762数据集上进行训练和评估。 Result: 模型在发音评估得分预测方面与人工评分具有强皮尔逊相关系数(PCC > 0.7),同时实现了较低的词错误率(WER)和音素错误率(PER)(均 < 0.15)。仅微调LoRA层即可达到与微调所有音频层相当的性能。 Conclusion: 通过使用LoRA微调多模态大语言模型(MLLM),可以无需完全微调即可建立集成的发音评估系统,这为英语二语学习者提供了更便捷、集成和有效的计算机辅助发音训练(CAPT)技术。 Abstract: This study demonstrates that a Multimodal Large Language Model (MLLM) adapted via Low-Rank Adaptation (LoRA) can perform both Automatic Pronunciation Assessment (APA) and Mispronunciation Detection and Diagnosis (MDD) simultaneously. Leveraging Microsoft's Phi-4-multimodal-instruct, our fine-tuning method eliminates the need for complex architectural changes or separate training procedures conventionally required for these distinct tasks. Fine-tuned on the Speechocean762 dataset, the pronunciation evaluation scores predicted by the model exhibited a strong Pearson Correlation Coefficient (PCC > 0.7) with human-assigned scores, while achieving low Word Error Rate (WER) and Phoneme Error Rate (PER) (both < 0.15). Notably, fine-tuning only the LoRA layers was sufficient to achieve performance levels comparable to those achieved by fine-tuning all audio layers. This research highlights that an integrated pronunciation assessment system can be established by adapting large multimodal models without full fine-tuning, utilizing a significantly simpler training methodology compared to previous joint models designed for simultaneous APA and MDD. This efficient LoRA-based approach paves the way for more accessible, integrated, and effective Computer-Assisted Pronunciation Training (CAPT) technologies for English L2 learners.

[8] Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities

Youngwoo Kim,Himanshu Beniwal,Steven L. Johnson,Thomas Hartvigsen

Main category: cs.CL

TL;DR: This paper introduces an interpretable architecture to extract implicit content moderation criteria from historical data, revealing transparent insights and undocumented patterns in community-specific moderation practices.

Details Motivation: The motivation stems from the need for transparency and explicit classification criteria in content moderation systems, especially since many online communities operate with implicit and diverse standards. Method: The method involves creating a criteria matrix by representing moderation standards as score tables of lexical expressions linked to content removal, enabling systematic comparisons between communities. Result: The experiments show that the extracted lexical patterns can replicate the performance of neural moderation models while uncovering variations in the enforcement of norms and previously undocumented moderation patterns. Conclusion: This paper concludes that the proposed interpretable architecture successfully identifies and extracts implicit moderation criteria from historical data, providing transparent insights into content moderation practices across different online communities. Abstract: Effective content moderation systems require explicit classification criteria, yet online communities like subreddits often operate with diverse, implicit standards. This work introduces a novel approach to identify and extract these implicit criteria from historical moderation data using an interpretable architecture. We represent moderation criteria as score tables of lexical expressions associated with content removal, enabling systematic comparison across different communities. Our experiments demonstrate that these extracted lexical patterns effectively replicate the performance of neural moderation models while providing transparent insights into decision-making processes. The resulting criteria matrix reveals significant variations in how seemingly shared norms are actually enforced, uncovering previously undocumented moderation patterns including community-specific tolerances for language, features for topical restrictions, and underlying subcategories of the toxic speech classification.

[9] ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly

Kimihiro Hasegawa,Wiradee Imrattanatrai,Masaki Asada,Susan Holm,Yuran Wang,Vincent Zhou,Ken Fukuda,Teruko Mitamura

Main category: cs.CL

TL;DR: The paper introduces ProMQA-Assembly, a new multimodal QA dataset for evaluating assistants in assembly tasks, highlighting the need for better models in this domain.

Details Motivation: The motivation stems from the lack of practical testbeds for evaluating assistants in assembly tasks, which limits the development and application of such systems in both everyday and industrial settings. Method: The paper describes the creation of a new multimodal QA dataset called ProMQA-Assembly. This dataset was developed using a semi-automated QA annotation approach enhanced with fine-grained action labels. Instruction task graphs were also created for benchmarking and verification purposes. Result: The result is the ProMQA-Assembly dataset consisting of 391 QA pairs, which supports the evaluation of multimodal models in understanding assembly activities. Benchmarking shows that current models have significant room for improvement. Conclusion: The paper concludes that there is significant room for improvement in current multimodal models when applied to procedural-activity assistance, particularly in assembly tasks. Abstract: Assistants on assembly tasks have a large potential to benefit humans from everyday tasks to industrial settings. However, no testbeds support application-oriented system evaluation in a practical setting, especially in assembly. To foster the development, we propose a new multimodal QA dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals in an online-style manner. In the development, we adopt a semi-automated QA annotation approach, where LLMs generate candidates and humans verify them, as a cost-effective method, and further improve it by integrating fine-grained action labels to diversify question types. Furthermore, we create instruction task graphs for the target tasks of assembling toy vehicles. These newly created task graphs are used in our benchmarking experiment, as well as to facilitate the human verification process in the QA annotation. Utilizing our dataset, we benchmark models, including competitive proprietary multimodal models. Our results suggest great room for improvement for the current models. We believe our new evaluation dataset can contribute to the further development of procedural-activity assistants.

[10] DiaCBT: A Long-Periodic Dialogue Corpus Guided by Cognitive Conceptualization Diagram for CBT-based Psychological Counseling

Yougen Zhou,Ningning Zhou,Qin Chen,Jie Zhou,Aimin Zhou,Liang He

Main category: cs.CL

TL;DR: 本研究构建了一个基于认知行为疗法的对话数据集,并验证了其在提升LLMs心理咨询能力方面的有效性。

Details Motivation: 由于社会污名化和治疗师资源有限,心理治疗覆盖面不足,LLMs提供了一种扩展心理健康服务的可能解决方案。 Method: 构建基于认知行为疗法的长期对话语料库,并开发综合评估框架。 Result: 结果显示DiaCBT能够提升LLMs在模拟心理学家中表现,证明其在训练心理咨询代理中的有效性。 Conclusion: DiaCBT有效地增强了LLMs模拟CBT专家的能力,具有培训更专业的心理咨询代理的潜力。 Abstract: Psychotherapy reaches only a small fraction of individuals suffering from mental disorders due to social stigma and the limited availability of therapists. Large language models (LLMs), when equipped with professional psychotherapeutic skills, offer a promising solution to expand access to mental health services. However, the lack of psychological conversation datasets presents significant challenges in developing effective psychotherapy-guided conversational agents. In this paper, we construct a long-periodic dialogue corpus for counseling based on cognitive behavioral therapy (CBT). Our curated dataset includes multiple sessions for each counseling and incorporates cognitive conceptualization diagrams (CCDs) to guide client simulation across diverse scenarios. To evaluate the utility of our dataset, we train an in-depth counseling model and present a comprehensive evaluation framework to benchmark it against established psychological criteria for CBT-based counseling. Results demonstrate that DiaCBT effectively enhances LLMs' ability to emulate psychologists with CBT expertise, underscoring its potential for training more professional counseling agents.

[11] Mitigating Data Imbalance in Automated Speaking Assessment

Fong-Chun Tsai,Kuan-Tang Huang,Bi-Cheng Yan,Tien-Hong Lo,Berlin Chen

Main category: cs.CL

TL;DR: This paper proposes the BLV loss to enhance fairness and accuracy in automated speaking assessment models by addressing class imbalance.

Details Motivation: ASA models often produce biased predictions due to class imbalance, which needs to be addressed. Method: A novel training objective called BLV loss is proposed to address class imbalance in ASA models. Result: Experiments on ICNALE dataset show that the BLV loss significantly improves classification accuracy and fairness when applied to a BERT-based model. Conclusion: The BLV loss improves the performance and fairness of ASA models on imbalanced data. Abstract: Automated Speaking Assessment (ASA) plays a crucial role in evaluating second-language (L2) learners proficiency. However, ASA models often suffer from class imbalance, leading to biased predictions. To address this, we introduce a novel objective for training ASA models, dubbed the Balancing Logit Variation (BLV) loss, which perturbs model predictions to improve feature representation for minority classes without modifying the dataset. Evaluations on the ICNALE benchmark dataset show that integrating the BLV loss into a celebrated text-based (BERT) model significantly enhances classification accuracy and fairness, making automated speech evaluation more robust for diverse learners.

[12] Training LLMs to be Better Text Embedders through Bidirectional Reconstruction

Chang Su,Dengliang Shi,Siyuan Huang,Jintao Du,Changhua Meng,Yu Cheng,Weiqiang Wang,Zhouhan Lin

Main category: cs.CL

TL;DR: A new training stage for LLMs improves text embedding performance, achieving state-of-the-art results on the MTEB benchmark.

Details Motivation: Existing LLM-based text embeddings using final tokens like [EOS] are limited because these tokens are not trained to capture whole context semantics, especially affecting retrieval and re-ranking tasks. Method: The method introduces a new training stage with bidirectional generative reconstruction tasks, EBQ2D and EBD2Q, to enrich the final token embedding semantics. Result: Experimental results show significant improvements on the Massive Text Embedding Benchmark (MTEB) across various LLM base models and scales. Conclusion: The proposed training stage significantly improves LLM performance on text embedding tasks, establishing new state-of-the-art results. Abstract: Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.

[13] Structure-Learnable Adapter Fine-Tuning for Parameter-Efficient Large Language Models

Ming Gong,Yingnan Deng,Nia Qi,Yujun Zou,Zhihao Xue,Yun Zi

Main category: cs.CL

TL;DR: This paper proposes an adapter-based fine-tuning approach that dynamically constructs efficient task-specific substructures, improving parameter efficiency and model adaptability across multiple natural language understanding tasks.

Details Motivation: The paper addresses the challenges of parameter redundancy, rigid structure, and limited task adaptability in fine-tuning large language models. Method: The method introduces differentiable gating functions and structural sparsity control variables to automatically optimize adapter insertion points, activation paths, and module combinations. It employs a structure search mechanism to dynamically construct task-specific efficient substructures during training while keeping the backbone parameters frozen. Result: Experimental results show that the proposed method outperforms mainstream parameter-efficient tuning techniques on multiple tasks. It achieves improved performance with better accuracy, compression rate, and robustness to noise and perturbation. Conclusion: The proposed adapter-based fine-tuning method improves parameter utilization and representational capacity by enabling flexible model structure adjustment in multi-task settings. It outperforms mainstream parameter-efficient tuning techniques, achieving a better balance among accuracy, compression rate, and robustness. Abstract: This paper addresses the issues of parameter redundancy, rigid structure, and limited task adaptability in the fine-tuning of large language models. It proposes an adapter-based fine-tuning method built on a structure-learnable mechanism. By introducing differentiable gating functions and structural sparsity control variables, the method enables automatic optimization of adapter insertion points, activation paths, and module combinations. This allows the model to adjust its structure flexibly in multi-task settings to match different task characteristics. With the backbone parameters kept frozen, the method uses a structure search mechanism to guide the dynamic construction of task-specific efficient substructures during training. This significantly improves parameter utilization and representational capacity. In addition, the paper designs a set of sensitivity analysis experiments to systematically evaluate the effects of sparsity weight, noise injection ratio, and data perturbation on model performance. These experiments verify the stability and robustness of the proposed method across various multi-task natural language understanding tasks. The experimental results show that the proposed method outperforms mainstream parameter-efficient tuning techniques on multiple tasks. It achieves a better balance among accuracy, compression rate, and robustness to noise and perturbation.

[14] A Long Short-Term Memory (LSTM) Model for Business Sentiment Analysis Based on Recurrent Neural Network

Md. Jahidul Islam Razin,Md. Abdul Karim,M. F. Mridha,S M Rafiuddin,Tahira Alam

Main category: cs.CL

TL;DR: This paper proposes a modified LSTM-based RNN model for business sentiment analysis that achieves 91.33% accuracy, helping businesses better understand customer feedback and improve marketing strategies.

Details Motivation: The motivation of this paper is to improve sentiment analysis techniques for business purposes, particularly to address the limitations of traditional RNNs in capturing long-term dependencies due to the vanishing gradient problem. Method: A modified long short-term memory (LSTM) model, a type of recurrent neural network (RNN), is used to address the vanishing gradient problem in traditional RNNs. The model is trained and tested using a product review dataset, with 70% of the data used for training and 30% for testing. Result: The proposed modified RNN model (LSTM) achieved an accuracy of 91.33%, demonstrating superior performance compared to conventional RNN models in analyzing customer product reviews. Conclusion: The modified RNN model (LSTM) outperforms conventional RNN models in business sentiment analysis, achieving an accuracy of 91.33%. This approach can help businesses and e-commerce platforms assess customer feedback and improve their marketing strategies. Abstract: Business sentiment analysis (BSA) is one of the significant and popular topics of natural language processing. It is one kind of sentiment analysis techniques for business purposes. Different categories of sentiment analysis techniques like lexicon-based techniques and different types of machine learning algorithms are applied for sentiment analysis on different languages like English, Hindi, Spanish, etc. In this paper, long short-term memory (LSTM) is applied for business sentiment analysis, where a recurrent neural network is used. An LSTM model is used in a modified approach to prevent the vanishing gradient problem rather than applying the conventional recurrent neural network (RNN). To apply the modified RNN model, product review dataset is used. In this experiment, 70\% of the data is trained for the LSTM and the rest 30\% of the data is used for testing. The result of this modified RNN model is compared with other conventional RNN models, and a comparison is made among the results. It is noted that the proposed model performs better than the other conventional RNN models. Here, the proposed model, i.e., the modified RNN model approach has achieved around 91.33\% of accuracy. By applying this model, any business company or e-commerce business site can identify the feedback from their customers about different types of products that customers like or dislike. Based on the customer reviews, a business company or e-commerce platform can evaluate its marketing strategy.

[15] Measuring Scalar Constructs in Social Science with LLMs

Hauke Licht,Rupak Sarkar,Patrick Y. Wu,Pranav Goel,Niklas Stoehr,Elliott Ash,Alexander Miserlis Hoyle

Main category: cs.CL

TL;DR: This paper evaluates methods for measuring continuous language constructs using large language models and finds that approaches like token-probability-weighted scoring and finetuning smaller models can improve measurement accuracy.

Details Motivation: The motivation behind this study is the need to accurately measure continuous scalar constructs in language, such as complexity or emotionality, using large language models (LLMs) in the field of social science. Method: The researchers evaluated four approaches to scalar construct measurement using LLMs: unweighted direct pointwise scoring, aggregation of pairwise comparisons, token-probability-weighted pointwise scoring, and finetuning. They used multiple datasets from the political science literature for their evaluation. Result: The results showed that direct pointwise scoring by LLMs produces discontinuous distributions. Pairwise comparisons improved measurement quality, but the best results came from weighting pointwise scores by token probability. Finetuning smaller models with a limited number of training pairs was also found to match or exceed the performance of prompted LLMs. Conclusion: The study concludes that while LLMs can be used to measure scalar constructs in social science, their effectiveness can be significantly enhanced by using techniques like token-probability-weighted pointwise scoring or finetuning smaller models. Abstract: Many constructs that characterize language, like its complexity or emotionality, have a naturally continuous semantic structure; a public speech is not just "simple" or "complex," but exists on a continuum between extremes. Although large language models (LLMs) are an attractive tool for measuring scalar constructs, their idiosyncratic treatment of numerical outputs raises questions of how to best apply them. We address these questions with a comprehensive evaluation of LLM-based approaches to scalar construct measurement in social science. Using multiple datasets sourced from the political science literature, we evaluate four approaches: unweighted direct pointwise scoring, aggregation of pairwise comparisons, token-probability-weighted pointwise scoring, and finetuning. Our study yields actionable findings for applied researchers. First, LLMs prompted to generate pointwise scores directly from texts produce discontinuous distributions with bunching at arbitrary numbers. The quality of the measurements improves with pairwise comparisons made by LLMs, but it improves even more by taking pointwise scores and weighting them by token probability. Finally, finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.

[16] From Evaluation to Defense: Constructing Persistent Edit-Based Fingerprints for Large Language Models

Yue Li,Xin Yi,Dongsheng Shi,Yongyi Cui,Gerard de Melo,Xiaoling Wang,Linlin Wang

Main category: cs.CL

TL;DR: 本文提出FSFT方法,利用知识编辑进行大型语言模型的指纹注入,有效减少指纹降级,提升性能。

Details Motivation: 传统的指令调优指纹注入方法会导致模型性能下降、资源消耗大且在模型修改后效果不佳,因此需要一种更轻量和持久的方法。 Method: 首次将知识编辑应用于指纹注入,并提出了指纹子空间感知微调方法(FSFT)以减少指纹降级。 Result: FSFT 方法在最坏情况下也比传统微调性能高出10%,但指纹与相似文本特征高度相似仍是一个挑战。 Conclusion: 知识编辑为大型语言模型的指纹注入提供了一个轻量级且高效的方法,但需要更强大和细致的方法来提高鲁棒性。 Abstract: The intellectual property (IP) protection of Large Language Models (LLMs) is increasingly critical. Injecting specialized fingerprints into LLMs through instruction tuning is a common IP protection technique. However, this may significantly degrade model performance, requires substantial computational resources, and exhibits poor persistence under model modifications. We argue that knowledge editing offers a lightweight alternative that is more suitable for fingerprint injection. Accordingly, we apply knowledge editing to fingerprint injection for the first time and demonstrate its strong capability. Despite using scrambled text as fingerprints to prevent them from being overwritten during fine-tuning, degradation still occurs under large-scale fine-tuning. To address this, we propose Fingerprint Subspace-aware Fine-Tuning (FSFT), which reduces fingerprint degradation by constraining the update of the fingerprint subspace. The performance of FSFT exceeds fine-tuning by 10% even in the worst-case scenario. Additionally, we observe that the fingerprint-injected models struggle to distinguish between fingerprints and similar texts due to the high similarity of their features. This finding underscores the urgent need for more robust and fine-grained fingerprinting injection methods for LLMs.

[17] An experimental and computational study of an Estonian single-person word naming

Kaidi Lõo,Arvi Tavast,Maria Heitmeier,Harald Baayen

Main category: cs.CL

TL;DR: The study shows that DLM-based measures predict lexical processing well, indicating the significant role of meaning in word naming tasks.

Details Motivation: To determine if DLM-based measures are predictive of lexical processing response variables and how they compare to classical predictors. Method: A large-scale single-subject experiment combining the word naming task with eye-tracking and analyzed using a generalized additive model. Result: DLM-based measures are powerful predictors for lexical processing, classical predictors often provide more precise fits, and meaning is heavily involved in the naming task. Conclusion: DLM-based measures are powerful predictors for lexical processing, and the study reveals that meaning is heavily involved in the word naming task. Abstract: This study investigates lexical processing in Estonian. A large-scale single-subject experiment is reported that combines the word naming task with eye-tracking. Five response variables (first fixation duration, total fixation duration, number of fixations, word naming latency, and spoken word duration) are analyzed with the generalized additive model. Of central interest is the question of whether measures for lexical processing generated by a computational model of the mental lexicon (the Discriminative Lexicon Model, DLM) are predictive for these response variables, and how they compare to classical predictors such as word frequency, neighborhood size, and inflectional paradigm size. Computational models were implemented both with linear and deep mappings. Central findings are, first, that DLM-based measures are powerful predictors for lexical processing, second, that DLM-measures using deep learning are not necessarily more precise predictors of lexical processing than DLM-measures using linear mappings, third, that classical predictors tend to provide somewhat more precise fits compared to DLM-based predictors (except for total fixation duration, where the two provide equivalent goodness of fit), and fourth, that in the naming task lexical variables are not predictive for first fixation duration and the total number of fixations. As the DLM works with mappings from form to meaning, the predictivity of DLM-based measures for total fixation duration, naming latencies, and spoken word duration indicates that meaning is heavily involved in the present word naming task.

[18] Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader

Jannis Vamvas,Ignacio Pérez Prat,Not Battesta Soliva,Sandra Baltermia-Guetg,Andrina Beeli,Simona Beeli,Madlaina Capeder,Laura Decurtins,Gian Peder Gregori,Flavia Hobi,Gabriela Holderegger,Arina Lazzarini,Viviana Lazzarini,Walter Rosselli,Bettina Vital,Anna Rutkiewicz,Rico Sennrich

Main category: cs.CL

TL;DR: This paper introduces a benchmark for machine translation evaluation in Romansh, showing that translation into Romansh is still a challenge.

Details Motivation: The Romansh language has limited resources for evaluating machine translation, necessitating the development of a specific benchmark. Method: The study used the WMT24++ benchmark to create human reference translations for six varieties of Romansh. Existing MT systems and LLMs were evaluated automatically. Result: Existing MT systems and LLMs perform relatively well on translating from Romansh into German, but struggle with translation into Romansh. Conclusion: Translation into Romansh remains challenging, even though translation from Romansh to German is relatively well managed. Abstract: The Romansh language, spoken in Switzerland, has limited resources for machine translation evaluation. In this paper, we present a benchmark for six varieties of Romansh: Rumantsch Grischun, a supra-regional variety, and five regional varieties: Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader. Our reference translations were created by human translators based on the WMT24++ benchmark, which ensures parallelism with more than 55 other languages. An automatic evaluation of existing MT systems and LLMs shows that translation out of Romansh into German is handled relatively well for all the varieties, but translation into Romansh is still challenging.

[19] Domain Adaptation of LLMs for Process Data

Rafael Seidi Oyamada,Jari Peeperkorn,Jochen De Weerdt,Johannes De Smedt

Main category: cs.CL

TL;DR: This paper explores the use of pretrained Large Language Models (LLMs) in Process Mining without reformulating data into natural language, showing improved performance and efficiency in Predictive Process Monitoring tasks.

Details Motivation: The motivation stems from the observation that current applications of LLMs in Process Mining focus on prompt engineering or transforming event logs into narrative-style datasets, while LLMs are inherently good at generating sequences of tokens, which aligns with the objectives in Process Mining. Method: The study focuses on parameter-efficient fine-tuning techniques to adapt pretrained LLMs to process data without reformulating it into natural language. The experimental setup involves Predictive Process Monitoring (PPM) with both single- and multi-task predictions. Result: The results show that the fine-tuned LLMs outperform state-of-the-art RNN approaches and recent narrative-style-based solutions, particularly in multi-task prediction. Additionally, the models converge faster and require less hyperparameter optimization. Conclusion: This study concludes that direct adaptation of pretrained LLMs to process data without natural language reformulation can significantly improve predictive performance in Predictive Process Monitoring, especially in multi-task settings. Abstract: In recent years, Large Language Models (LLMs) have emerged as a prominent area of interest across various research domains, including Process Mining (PM). Current applications in PM have predominantly centered on prompt engineering strategies or the transformation of event logs into narrative-style datasets, thereby exploiting the semantic capabilities of LLMs to address diverse tasks. In contrast, this study investigates the direct adaptation of pretrained LLMs to process data without natural language reformulation, motivated by the fact that these models excel in generating sequences of tokens, similar to the objective in PM. More specifically, we focus on parameter-efficient fine-tuning techniques to mitigate the computational overhead typically associated with such models. Our experimental setup focuses on Predictive Process Monitoring (PPM), and considers both single- and multi-task predictions. The results demonstrate a potential improvement in predictive performance over state-of-the-art recurrent neural network (RNN) approaches and recent narrative-style-based solutions, particularly in the multi-task setting. Additionally, our fine-tuned models exhibit faster convergence and require significantly less hyperparameter optimization.

[20] SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala

Ashmari Pramodya,Nirasha Nelki,Heshan Shalinda,Chamila Liyanage,Yusuke Sakai,Randil Pushpananda,Ruvan Weerasinghe,Hidetaka Kamigaito,Taro Watanabe

Main category: cs.CL

TL;DR: 本文提出了首个针对低资源语言僧伽罗语的多选题问答基准 SinhalaMMLU,并评估了大型语言模型在该数据集上的表现,结果显示当前模型在文化特定内容上的适应能力仍有较大提升空间。

Details Motivation: 现有大型语言模型的评估主要集中在通用或英语中心的主题上,忽略了低资源语言和文化特定内容,而现有的多语言基准往往依赖自动翻译,可能引入错误并误解原始文化背景。 Method: 构建了一个针对僧伽罗语(Sinhala)的多选题问答基准 SinhalaMMLU,并对 26 个大型语言模型进行了评估。 Result: 在 SinhalaMMLU 上,Claude 3.5 Sonnet 和 GPT-4o 分别以 67% 和 62% 的平均准确率表现最佳,但整体模型表现仍然有限,尤其在文化丰富领域如人文学科中表现较差。 Conclusion: 当前大型语言模型在低资源和文化特定背景下的适应能力仍有较大提升空间,特别是在人文等文化丰富领域表现不足。 Abstract: Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.

[21] Comparison of End-to-end Speech Assessment Models for the NOCASA 2025 Challenge

Aleksei Žavoronkov,Tanel Alumäe

Main category: cs.CL

TL;DR: This paper introduces three models for assessing children's Norwegian pronunciation, with the GOP-CTC-based model achieving the best results.

Details Motivation: The paper is motivated by the need for automatic word-level pronunciation assessment for children learning Norwegian as a second language, as part of the NOCASA 2025 Challenge. Method: The paper explores three models: an encoder-decoder Siamese architecture (E2E-R), a prefix-tuned direct classification model using pretrained wav2vec2.0 representations, and a novel model integrating alignment-free GOP features via CTC. A weighted ordinal cross-entropy loss was also introduced. Result: The proposed GOP-CTC-based model achieved the highest performance in pronunciation assessment, outperforming other methods and baseline models. Conclusion: Among the three models, the GOP-CTC-based model performed best, surpassing challenge baselines and achieving top leaderboard scores. Abstract: This paper presents an analysis of three end-to-end models developed for the NOCASA 2025 Challenge, aimed at automatic word-level pronunciation assessment for children learning Norwegian as a second language. Our models include an encoder-decoder Siamese architecture (E2E-R), a prefix-tuned direct classification model leveraging pretrained wav2vec2.0 representations, and a novel model integrating alignment-free goodness-of-pronunciation (GOP) features computed via CTC. We introduce a weighted ordinal cross-entropy loss tailored for optimizing metrics such as unweighted average recall and mean absolute error. Among the explored methods, our GOP-CTC-based model achieved the highest performance, substantially surpassing challenge baselines and attaining top leaderboard scores.

[22] LatPhon: Lightweight Multilingual G2P for Romance Languages and English

Luis Felipe Chary,Miguel Arjona Ramirez

Main category: cs.CL

TL;DR: LatPhon是一种高效的多语言G2P转换模型,适用于拉丁语系语言的语音处理任务,具有低错误率和小内存占用的优点。

Details Motivation: 图素到音素(G2P)转换是文本到语音(TTS)、自动语音识别(ASR)等系统的关键前端,需要一个跨多拉丁语系语言的解决方案。 Method: 使用Transformer模型,对六种拉丁语系语言进行联合训练,并在公共ipa-dict语料库上测试性能。 Result: LatPhon在公共ipa-dict语料库上实现了3.5%的平均音素错误率,优于基线模型,并且内存占用仅为30MB。 Conclusion: LatPhon是一个紧凑的多语言G2P模型,能够作为拉丁语语音管道的通用前端。 Abstract: Grapheme-to-phoneme (G2P) conversion is a key front-end for text-to-speech (TTS), automatic speech recognition (ASR), speech-to-speech translation (S2ST) and alignment systems, especially across multiple Latin-script languages.We present LatPhon, a 7.5 M - parameter Transformer jointly trained on six such languages--English, Spanish, French, Italian, Portuguese, and Romanian. On the public ipa-dict corpus, it attains a mean phoneme error rate (PER) of 3.5%, outperforming the byte-level ByT5 baseline (5.4%) and approaching language-specific WFSTs (3.2%) while occupying 30 MB of memory, which makes on-device deployment feasible when needed. These results indicate that compact multilingual G2P can serve as a universal front-end for Latin-language speech pipelines.

[23] AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

Guibin Zhang,Junhao Wang,Junjie Chen,Wangchunshu Zhou,Kun Wang,Shuicheng Yan

Main category: cs.CL

TL;DR: 提出了一种新的自动化框架AgenTracer及训练出的轻量级故障追踪器AgenTracer-8B,通过多粒度强化学习高效诊断多智能体交互中的错误,显著提高了LLM智能体失败归因的准确性,并提升了多智能体系统的性能。

Details Motivation: 基于LLM的智能体系统由于其复杂性而容易失败,而当前最先进的推理LLM在失败归因方面的准确率通常低于10%。因此,需要一种新的方法来提高LLM智能体失败归因的准确性。 Method: 提出了一种名为AgenTracer的自动化框架,用于通过反事实重放和编程故障注入标注失败的多智能体轨迹,产生了一个精心策划的数据集TracerTraj。基于此资源,开发了AgenTracer-8B,这是一种轻量级故障追踪器。 Result: 在Who&When基准测试中,AgenTracer-8B 比Gemini-2.5-Pro和Claude-4-Sonnet等大型专有LLM高出18.18%,并且为MetaGPT和MaAS等多智能体系统提供了4.8-14.2%的性能提升。 Conclusion: AgenTracer-8B 通过多粒度强化学习训练,能够高效诊断多智能体交互中的错误,显著优于现有的大型专有LLM,并为现成的多智能体系统提供可操作的反馈,实现了新的LLM智能体失败归因标准。 Abstract: Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains, empowering self-correcting and self-evolving agentic AI.

[24] LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations

Daniela Gottesman,Alon Gilae-Dotan,Ido Cohen,Yoav Gur-Arieh,Marius Mosbach,Ori Yoran,Mor Geva

Main category: cs.CL

TL;DR: LMEnt 是一个用于研究语言模型知识获取的工具集,包括知识丰富的语料库、高效的实体检索方法和多个预训练模型。

Details Motivation: 语言模型越来越多地用于需要世界知识的实际应用,但其内部如何将数据转化为知识和世界信念的过程尚不清楚。 Method: LMEnt 提供了一个知识丰富的预训练语料库、一种基于实体的检索方法以及12个预训练模型和4K个中间检查点。 Result: LMEnt 的实体检索方法比之前的方法高出80.4%,预训练模型在知识基准测试中表现良好。 Conclusion: LMEnt 是一个用于研究语言模型在预训练期间知识获取的工具集,有助于研究知识表示、可塑性、编辑和学习动态。 Abstract: Language models (LMs) increasingly drive real-world applications that require world knowledge. However, the internal processes through which models turn data into representations of knowledge and beliefs about the world, are poorly understood. Insights into these processes could pave the way for developing LMs with knowledge representations that are more consistent, robust, and complete. To facilitate studying these questions, we present LMEnt, a suite for analyzing knowledge acquisition in LMs during pretraining. LMEnt introduces: (1) a knowledge-rich pretraining corpus, fully annotated with entity mentions, based on Wikipedia, (2) an entity-based retrieval method over pretraining data that outperforms previous approaches by as much as 80.4%, and (3) 12 pretrained models with up to 1B parameters and 4K intermediate checkpoints, with comparable performance to popular open-sourced models on knowledge benchmarks. Together, these resources provide a controlled environment for analyzing connections between entity mentions in pretraining and downstream performance, and the effects of causal interventions in pretraining data. We show the utility of LMEnt by studying knowledge acquisition across checkpoints, finding that fact frequency is key, but does not fully explain learning trends. We release LMEnt to support studies of knowledge in LMs, including knowledge representations, plasticity, editing, attribution, and learning dynamics.

[25] Learning Mechanism Underlying NLP Pre-Training and Fine-Tuning

Yarden Tzach,Ronit D. Gross,Ella Koresh,Shalom Rosner,Or Shpringer,Tal Halevi,Ido Kanter

Main category: cs.CL

TL;DR: The paper investigates how pre-training enhances NLP models, revealing that it improves token clustering and model accuracy through transformer layers, even though the training objective focuses on individual token identification.

Details Motivation: The motivation was to explore the mechanisms behind successful pre-training in NLP models and understand how pre-training impacts the fine-tuning process for specific classification tasks. Method: The research utilized the BERT-6 architecture, pre-trained on the Wikipedia dataset and fine-tuned on the FewRel and DBpedia classification tasks, analyzing the token-level performance and clustering effects. Result: Key findings include the increase in accuracy per token with its frequency, quantification of pre-training success through average token accuracy, clustering of tokens into strong match groups, enhancement of performance through transformer layers, and the independence of prediction confidence from input accuracy. Conclusion: The study concludes that pre-training in NLP models, similar to processes in image classification, enhances the model's ability to understand and generate language by creating higher-order language structures and improving fine-tuning accuracy. Abstract: Natural language processing (NLP) enables the understanding and generation of meaningful human language, typically using a pre-trained complex architecture on a large dataset to learn the language and next fine-tune its weights to implement a specific task. Twofold goals are examined; to understand the mechanism underlying successful pre-training and to determine the interplay between the pre-training accuracy and the fine-tuning of classification tasks. The following main results were obtained; the accuracy per token (APT) increased with its appearance frequency in the dataset, and its average over all tokens served as an order parameter to quantify pre-training success, which increased along the transformer blocks. Pre-training broke the symmetry among tokens and grouped them into finite, small, strong match token clusters, as inferred from the presented token confusion matrix. This feature was sharpened along the transformer blocks toward the output layer, enhancing its performance considerably compared with that of the embedding layer. Consequently, higher-order language structures were generated by pre-training, even though the learning cost function was directed solely at identifying a single token. These pre-training findings were reflected by the improved fine-tuning accuracy along the transformer blocks. Additionally, the output label prediction confidence was found to be independent of the average input APT, as the input meaning was preserved since the tokens are replaced primarily by strong match tokens. Finally, although pre-training is commonly absent in image classification tasks, its underlying mechanism is similar to that used in fine-tuning NLP classification tasks, hinting at its universality. The results were based on the BERT-6 architecture pre-trained on the Wikipedia dataset and fine-tuned on the FewRel and DBpedia classification tasks.

[26] Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges

Weiyuan Li,Xintao Wang,Siyu Yuan,Rui Xu,Jiangjie Chen,Qingqing Dong,Yanghua Xiao,Deqing Yang

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型在复杂任务中评估可靠性的问题,构建了ComplexEval基准测试,揭示了模型对辅助信息诱导偏见的显著敏感性。

Details Motivation: 随着大型语言模型(LLMs)面临日益复杂和多样化的任务,其评估变得具有挑战性。LLMs作为评估者的范式成为一种可扩展的解决方案,但在复杂任务中的可靠性仍未得到充分研究。 Method: 构建了ComplexEval这一挑战性基准,系统性地揭示和量化了辅助信息诱导的偏见。 Result: 验证了6种先前未探索的偏见,在12个基础和3个高级场景中发现模型易受偏见影响,且偏见程度与任务复杂性相关。 Conclusion: 研究发现所有评估模型都容易受到这些偏见的影响,偏见程度随任务复杂性增加而增加,并且大型推理模型表现出矛盾的脆弱性。 Abstract: As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks--where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical--remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.

[27] Continuous Saudi Sign Language Recognition: A Vision Transformer Approach

Soukeina Elhassen,Lama Al Khuzayem,Areej Alhothali,Ohoud Alzamzami,Nahed Alowaidi

Main category: cs.CL

TL;DR: 本文介绍了首个连续沙特手语数据集KAU-CSSL,并提出了一种高效的识别模型,为沙特聋人社区和手语研究领域做出了重要贡献。

Details Motivation: 沙特阿拉伯有超过84,000人依赖沙特手语作为主要交流方式,但由于对手语的认知有限,导致他们在教育和职业机会上面临不平等,从而造成社会排斥。此外,最先进的解决方案主要集中在非阿拉伯手语上,导致阿拉伯手语资源严重匮乏。 Method: 引入了一个新的连续沙特手语数据集KAU-CSSL,并提出了一种基于Transformer的模型,该模型利用预训练的ResNet-18进行空间特征提取,以及具有双向LSTM的Transformer Encoder处理时间依赖性。 Result: 研究提出的模型在依赖签名者模式下达到99.02%的准确率,在独立签名者模式下达到77.71%的准确率。 Conclusion: 研究为沙特聋人社区提供了重要的资源,通过开发首个连续沙特手语数据集KAU-CSSL和提出基于Transformer的模型,提高了手语识别和翻译技术。 Abstract: Sign language (SL) is an essential communication form for hearing-impaired and deaf people, enabling engagement within the broader society. Despite its significance, limited public awareness of SL often leads to inequitable access to educational and professional opportunities, thereby contributing to social exclusion, particularly in Saudi Arabia, where over 84,000 individuals depend on Saudi Sign Language (SSL) as their primary form of communication. Although certain technological approaches have helped to improve communication for individuals with hearing impairments, there continues to be an urgent requirement for more precise and dependable translation techniques, especially for Arabic sign language variants like SSL. Most state-of-the-art solutions have primarily focused on non-Arabic sign languages, resulting in a considerable absence of resources dedicated to Arabic sign language, specifically SSL. The complexity of the Arabic language and the prevalence of isolated sign language datasets that concentrate on individual words instead of continuous speech contribute to this issue. To address this gap, our research represents an important step in developing SSL resources. To address this, we introduce the first continuous Saudi Sign Language dataset called KAU-CSSL, focusing on complete sentences to facilitate further research and enable sophisticated recognition systems for SSL recognition and translation. Additionally, we propose a transformer-based model, utilizing a pretrained ResNet-18 for spatial feature extraction and a Transformer Encoder with Bidirectional LSTM for temporal dependencies, achieving 99.02\% accuracy at signer dependent mode and 77.71\% accuracy at signer independent mode. This development leads the way to not only improving communication tools for the SSL community but also making a substantial contribution to the wider field of sign language.

[28] Design and Optimization of Reinforcement Learning-Based Agents in Text-Based Games

Haonan Wang,Mingjia Zhao,Junfeng Sun,Wei Liu

Main category: cs.CL

TL;DR: 本文介绍了一种用于文本游戏的新型强化学习代理设计与学习方法,通过深度学习模型处理游戏文本并利用策略梯度方法进行学习,取得了优于以往代理的游戏完成率和胜率。

Details Motivation: 随着AI技术的进步,使用代理玩文本游戏的研究日益流行,因此需要一种新的代理设计与学习方法。 Method: 首先应用深度学习模型处理游戏文本并构建世界模型,随后通过基于策略梯度的深度强化学习方法使代理学习从状态值转换为最优策略。 Result: 增强后的代理在多个基于文本的游戏中表现更佳,并在游戏完成率和胜率上显著超过了之前的代理。 Conclusion: 该研究为文本游戏中使用强化学习提供了新的理解与实证基础,并为在更广泛的领域与问题中开发和优化强化学习代理奠定了基础。 Abstract: As AI technology advances, research in playing text-based games with agents has becomeprogressively popular. In this paper, a novel approach to agent design and agent learning ispresented with the context of reinforcement learning. A model of deep learning is first applied toprocess game text and build a world model. Next, the agent is learned through a policy gradient-based deep reinforcement learning method to facilitate conversion from state value to optimal policy.The enhanced agent works better in several text-based game experiments and significantlysurpasses previous agents on game completion ratio and win rate. Our study introduces novelunderstanding and empirical ground for using reinforcement learning for text games and sets thestage for developing and optimizing reinforcement learning agents for more general domains andproblems.

cs.CV [Back]

[29] 2nd Place Solution for CVPR2024 E2E Challenge: End-to-End Autonomous Driving Using Vision Language Model

Zilong Guo,Yi Luo,Long Sha,Dongxu Wang,Panqu Wang,Chenyang Xu,Yi Yang

Main category: cs.CV

TL;DR: This paper shows that integrating Vision Language Models (VLMs) into an end-to-end autonomous driving system using a single camera can significantly improve driving task performance, highlighting the potential of vision-based approaches.

Details Motivation: The motivation behind this work is to explore whether powerful large language models, particularly multi-modality Vision Language Models, can enhance the performance of end-to-end autonomous driving systems. Method: The research employs a single camera vision-based end-to-end autonomous driving system integrated with multi-modality Vision Language Models (VLMs) to assess their effectiveness in driving tasks. Result: The proposed method demonstrated impressive performance on driving tasks, emerging as the best camera-only solution across the leaderboard. Conclusion: The study concludes that combining end-to-end architectural design with knowledgeable Vision Language Models (VLMs) can yield impressive performance in end-to-end autonomous driving tasks, demonstrating the effectiveness of a vision-based approach. Abstract: End-to-end autonomous driving has drawn tremendous attention recently. Many works focus on using modular deep neural networks to construct the end-to-end archi-tecture. However, whether using powerful large language models (LLM), especially multi-modality Vision Language Models (VLM) could benefit the end-to-end driving tasks remain a question. In our work, we demonstrate that combining end-to-end architectural design and knowledgeable VLMs yield impressive performance on the driving tasks. It is worth noting that our method only uses a single camera and is the best camera-only solution across the leaderboard, demonstrating the effectiveness of vision-based driving approach and the potential for end-to-end driving tasks.

[30] PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?

Mennatullah Siam

Main category: cs.CV

TL;DR: 这项研究探讨了视频多模态大语言模型在像素级视觉基础任务中的表现,特别是在处理描述物体运动模式的自然语言表达时。

Details Motivation: 尽管多模态大语言模型在使用图像和文本模态方面显示出了令人印象深刻的泛化能力,但它们在像素级视觉基础方面的能力研究较少,尤其是在视频扩展上。 Method: 研究者们确认了当前基准测试的不足,并引入了四个以运动为中心的探测技术来评估视频MLLMs识别真实运动和理解运动顺序的能力。 Result: 研究者们建立了一个强大的单图像基准,与或优于之前的方法,并探索了提供在MoCentric-Bench上最先进的性能的简单以运动为中心的适应技术。 Conclusion: MoCentric-Bench挑战了未来的模型,以改进视频中的密集时空定位和像素级理解。 Abstract: Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities. While their extension to video has enabled tasks such as video question answering and video captioning, their pixel-level visual grounding abilities are less studied. In this work, we raise the pertinent question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions describing their motion patterns. We identify the shortcomings in the current benchmarks, where we show that a single frame can often suffice for capturing the motion referring expression without any temporal reasoning. To address this, we introduce four motion-centric probing techniques, particularly designed for the visual grounding task, to study video MLLMs' ability to identify true motion from a fake one and their ability to grasp the motion order. Consequently, we provide a motion-centric benchmark, MoCentric-Bench. It ensures that video MLLMs are evaluated towards leveraging the interaction between motion and language rather than being dominated by static appearance cues emphasized in existing visual grounding datasets. We further establish strong single-image baselines that are on par with or outperform prior methods. Finally, we explore simple motion-centric adaptation techniques that provide state-of-the-art performance on our MoCentric-Bench. Our motion-centric benchmark, evaluation and findings challenge future models to improve dense spatiotemporal grounding and pixel-level understanding within videos. Code and datasets will be made publicly available at https://github.com/MSiam/PixFoundation-2.0.git.

[31] Multi-Scale Deep Learning for Colon Histopathology: A Hybrid Graph-Transformer Approach

Sadra Saremi,Amirhossein Ahmadkhan Kordbacheh

Main category: cs.CV

TL;DR: This paper proposes a hybrid deep learning model combining transformers, CNNs, and capsule networks to improve colon cancer classification on histopathological images.

Details Motivation: Early detection of colon cancer is critical for preventing its deterioration, and current methods can be improved by capturing multi-scale features more effectively. Method: The research utilizes a hybrid architecture (HG-TNet) combining transformers for global contextual understanding and CNNs for local feature extraction, with capsule networks and a self-supervised rotation prediction objective. Result: The proposed model achieved better performance in accuracy, loss function, and spatial order preservation compared to standard architectures. Conclusion: The proposed hybrid multi-scale deep learning architecture effectively improves colon cancer classification performance by leveraging the strengths of transformers and CNNs, along with capsule networks to preserve spatial orders. Abstract: Colon cancer also known as Colorectal cancer, is one of the most malignant types of cancer worldwide. Early-stage detection of colon cancer is highly crucial to prevent its deterioration. This research presents a hybrid multi-scale deep learning architecture that synergizes capsule networks, graph attention mechanisms, transformer modules, and residual learning to advance colon cancer classification on the Lung and Colon Cancer Histopathological Image Dataset (LC25000) dataset. The proposed model in this paper utilizes the HG-TNet model that introduces a hybrid architecture that joins strength points in transformers and convolutional neural networks to capture multi-scale features in histopathological images. Mainly, a transformer branch extracts global contextual bonds by partitioning the image into patches by convolution-based patch embedding and then processing these patches through a transformer encoder. Analogously, a dedicated CNN branch captures fine-grained, local details through successive Incorporation these diverse features, combined with a self-supervised rotation prediction objective, produce a robust diagnostic representation that surpasses standard architectures in performance. Results show better performance not only in accuracy or loss function but also in these algorithms by utilizing capsule networks to preserve spatial orders and realize how each element individually combines and forms whole structures.

[32] PRECISE-AS: Personalized Reinforcement Learning for Efficient Point-of-Care Echocardiography in Aortic Stenosis Diagnosis

Armin Saadat,Nima Hashemi,Hooman Vaseli,Michael Y. Tsang,Christina Luong,Michiel Van de Panne,Teresa S. M. Tsang,Purang Abolmaesumi

Main category: cs.CV

TL;DR: 本研究提出了一种基于强化学习的主动视频采集技术,通过动态选择关键超声视频,提高了主动脉瓣狭窄诊断的效率和准确性。

Details Motivation: 由于资源限制,特别是在农村和医疗资源匮乏地区,获取超声心动图检查存在困难,因此需要一种更高效和可扩展的诊断方法。 Method: 采用强化学习驱动的主动视频采集框架,动态选择最具信息量的超声视频,而非使用固定数量的视频。 Result: 在2572名患者的数据上测试,该方法以仅使用47%的视频数据达到了80.6%的分类准确率。 Conclusion: 该研究提出了一种基于强化学习的主动视频采集框架,用于主动脉瓣狭窄诊断,提高了诊断效率和个性化水平。 Abstract: Aortic stenosis (AS) is a life-threatening condition caused by a narrowing of the aortic valve, leading to impaired blood flow. Despite its high prevalence, access to echocardiography (echo), the gold-standard diagnostic tool, is often limited due to resource constraints, particularly in rural and underserved areas. Point-of-care ultrasound (POCUS) offers a more accessible alternative but is restricted by operator expertise and the challenge of selecting the most relevant imaging views. To address this, we propose a reinforcement learning (RL)-driven active video acquisition framework that dynamically selects each patient's most informative echo videos. Unlike traditional methods that rely on a fixed set of videos, our approach continuously evaluates whether additional imaging is needed, optimizing both accuracy and efficiency. Tested on data from 2,572 patients, our method achieves 80.6% classification accuracy while using only 47% of the echo videos compared to a full acquisition. These results demonstrate the potential of active feature acquisition to enhance AS diagnosis, making echocardiographic assessments more efficient, scalable, and personalized. Our source code is available at: https://github.com/Armin-Saadat/PRECISE-AS.

[33] LiGuard: A Streamlined Open-Source Framework for Rapid & Interactive Lidar Research

Muhammad Shahbaz,Shaurya Agarwal

Main category: cs.CV

TL;DR: LiGuard是一个为激光雷达研究设计的开源框架,简化了代码开发、参数调整和结果可视化,显著减少了重复工作并提高了代码复用性。

Details Motivation: 激光雷达数据的研究和应用中存在大量重复开发工作,许多研究在方法步骤上有很多重叠,如数据输入/输出、预处理/后处理和常用算法,同时数据或算法的小变化可能导致代码大修。 Method: 开发了一个名为LiGuard的开源软件框架,提供数据输入/输出、预处理/后处理和常用算法的内置支持,并允许交互式添加/移除/重新排序自定义算法和调整参数。 Result: LiGuard能够通过结构化目录生成代码文件,实现整个项目或组件的轻松共享和重用,并通过案例研究验证了其有效性。 Conclusion: LiGuard是一个开源软件框架,旨在促进激光雷达基础研究和应用的快速发展,通过提供结构化的代码框架和内置支持,提高代码复用率和研究效率。 Abstract: There is a growing interest in the development of lidar-based autonomous mobility and Intelligent Transportation Systems (ITS). To operate and research on lidar data, researchers often develop code specific to application niche. This approach leads to duplication of efforts across studies that, in many cases, share multiple methodological steps such as data input/output (I/O), pre/post processing, and common algorithms in multi-stage solutions. Moreover, slight changes in data, algorithms, and/or research focus may force major revisions in the code. To address these challenges, we present LiGuard, an open-source software framework that allows researchers to: 1) rapidly develop code for their lidar-based projects by providing built-in support for data I/O, pre/post processing, and commonly used algorithms, 2) interactively add/remove/reorder custom algorithms and adjust their parameters, and 3) visualize results for classification, detection, segmentation, and tracking tasks. Moreover, because it creates all the code files in structured directories, it allows easy sharing of entire projects or even the individual components to be reused by other researchers. The effectiveness of LiGuard is demonstrated via case studies.

[34] PercepTwin: Modeling High-Fidelity Digital Twins for Sim2Real LiDAR-based Perception for Intelligent Transportation Systems

Muhammad Shahbaz,Shaurya Agarwal

Main category: cs.CV

TL;DR: This paper presents a methodology for generating high-quality synthetic datasets using High-Fidelity Digital Twins (HiFi DTs) to enable scalable and cost-effective Sim2Real learning for LiDAR-based perception in intelligent transportation systems.

Details Motivation: The motivation is to address the challenge of creating large-scale labeled datasets for LiDAR-based perception systems in ITS, which are costly and time-consuming. This leads to a need for scalable alternatives like Sim2Real learning, which depends on high-fidelity simulations. Method: The paper proposes a methodology for creating synthetic datasets using High-Fidelity Digital Twins (HiFi DTs), incorporating static geometry modeling, road infrastructure replication, and dynamic traffic scenario generation. It uses open-source tools, satellite imagery, and OpenStreetMap data. Result: The result is a detailed workflow for generating high-quality synthetic datasets using HiFi DTs, enabling robust Sim2Real learning for LiDAR-based perception systems in ITS. Conclusion: The paper concludes that leveraging high-fidelity digital twins can create scalable, cost-effective, and diverse synthetic datasets that enable effective Sim2Real learning for LiDAR-based perception systems in intelligent transportation systems. Abstract: LiDAR-based perception in intelligent transportation systems (ITS), for tasks such as object detection, tracking, and semantic and instance segmentation, is predominantly solved by deep neural network models which often require large-scale labeled datasets during training to achieve generalization. However, creating these datasets is costly. time consuming and require human labor before the datasets are ready for training models. This hinders scalability of the LiDAR-based perception systems in ITS. Sim2Real learning offers scalable alternative, however, its effectiveness is dependent on the fidelity of the source simulation(s) to real-world, in terms of environment structure, actor dynamics, and sensor emulations. In response, this paper introduces a rigorous and reproducible methodology for creating large-scale, high-quality synthetic datasets using High-Fidelity Digital Twins (HiFi DTs). The proposed workflow outlines the steps, tools, and best practices for digitally replicating real-world environments, encompassing static geometry modeling, road infrastructure replication, and dynamic traffic scenario generation. Leveraging open-source and readily available resources such as satellite imagery and OpenStreetMap data, alongside specific sensor configurations, this paper provides practical, detailed guidance for constructing robust synthetic environments. These environments subsequently facilitate scalable, cost-effective, and diverse dataset generation, forming a reliable foundation for robust Sim2Real learning.

[35] High-Fidelity Digital Twins for Bridging the Sim2Real Gap in LiDAR-Based ITS Perception

Muhammad Shahbaz,Shaurya Agarwal

Main category: cs.CV

TL;DR: 该论文提出了一种高保真数字孪生框架,有效解决了Sim2Real领域迁移中的分布偏移问题,并提升了LiDAR感知模型在真实世界中的表现。

Details Motivation: 由于在仿真中训练的LiDAR感知模型在真实世界数据上表现不佳,论文旨在解决Sim2Real之间的领域迁移问题。 Method: 论文通过引入真实世界背景几何、车道级道路拓扑以及传感器特定的规格和布局,构建了一个高保真数字孪生框架,并利用多个度量指标评估了合成数据与真实数据的分布对齐程度。 Result: 实验表明,基于HiFi DT生成的合成数据训练的模型比使用真实数据训练的模型性能高4.8%,并且HiFi DT显著减少了领域偏移并提高了在不同评估场景下的泛化能力。 Conclusion: 该论文提出了一种高保真数字孪生(HiFi DT)框架,用于解决Sim2Real领域迁移中的分布偏移问题,并验证了其在真实ITS应用中的有效性。 Abstract: Sim2Real domain transfer offers a cost-effective and scalable approach for developing LiDAR-based perception (e.g., object detection, tracking, segmentation) in Intelligent Transportation Systems (ITS). However, perception models trained in simulation often under perform on real-world data due to distributional shifts. To address this Sim2Real gap, this paper proposes a high-fidelity digital twin (HiFi DT) framework that incorporates real-world background geometry, lane-level road topology, and sensor-specific specifications and placement. We formalize the domain adaptation challenge underlying Sim2Real learning and present a systematic method for constructing simulation environments that yield in-domain synthetic data. An off-the-shelf 3D object detector is trained on HiFi DT-generated synthetic data and evaluated on real data. Our experiments show that the DT-trained model outperforms the equivalent model trained on real data by 4.8%. To understand this gain, we quantify distributional alignment between synthetic and real data using multiple metrics, including Chamfer Distance (CD), Maximum Mean Discrepancy (MMD), Earth Mover's Distance (EMD), and Fr'echet Distance (FD), at both raw-input and latent-feature levels. Results demonstrate that HiFi DTs substantially reduce domain shift and improve generalization across diverse evaluation scenarios. These findings underscore the significant role of digital twins in enabling reliable, simulation-based LiDAR perception for real-world ITS applications.

[36] Single Domain Generalization in Diabetic Retinopathy: A Neuro-Symbolic Learning Approach

Midhat Urooj,Ayan Banerjee,Farhat Shaikh,Kuntal Thakur,Sandeep Gupta

Main category: cs.CV

TL;DR: KG-DG, a neuro-symbolic framework combining deep learning with symbolic reasoning using clinical knowledge, significantly improves domain generalization for diabetic retinopathy classification.

Details Motivation: Domain generalization in medical imaging remains a challenge due to distribution shifts in real-world scenarios. Existing models trained on single or limited sources often fail to generalize, necessitating a more robust solution. Method: KG-DG framework was developed, integrating vision transformers with symbolic reasoning using clinical lesion ontologies and retinal vessel segmentation. Domain embeddings were aligned by minimizing KL divergence, and a confidence-weighted integration strategy was employed. Result: KG-DG achieved up to a 5.2% accuracy gain in cross-domain settings and a 6% improvement over baseline ViT models. The symbolic-only model achieved 63.67% average accuracy in multi-domain generalization, while the complete neuro-symbolic model outperformed existing baselines in single-domain generalization. Lesion-based features showed 84.65% accuracy, proving their effectiveness. Conclusion: The study concludes that neuro-symbolic integration, combining vision transformers with expert-guided symbolic reasoning, is a promising approach for achieving domain-invariant and clinically robust medical AI systems in diabetic retinopathy classification. Abstract: Domain generalization remains a critical challenge in medical imaging, where models trained on single sources often fail under real-world distribution shifts. We propose KG-DG, a neuro-symbolic framework for diabetic retinopathy (DR) classification that integrates vision transformers with expert-guided symbolic reasoning to enable robust generalization across unseen domains. Our approach leverages clinical lesion ontologies through structured, rule-based features and retinal vessel segmentation, fusing them with deep visual representations via a confidence-weighted integration strategy. The framework addresses both single-domain generalization (SDG) and multi-domain generalization (MDG) by minimizing the KL divergence between domain embeddings, thereby enforcing alignment of high-level clinical semantics. Extensive experiments across four public datasets (APTOS, EyePACS, Messidor-1, Messidor-2) demonstrate significant improvements: up to a 5.2% accuracy gain in cross-domain settings and a 6% improvement over baseline ViT models. Notably, our symbolic-only model achieves a 63.67% average accuracy in MDG, while the complete neuro-symbolic integration achieves the highest accuracy compared to existing published baselines and benchmarks in challenging SDG scenarios. Ablation studies reveal that lesion-based features (84.65% accuracy) substantially outperform purely neural approaches, confirming that symbolic components act as effective regularizers beyond merely enhancing interpretability. Our findings establish neuro-symbolic integration as a promising paradigm for building clinically robust, and domain-invariant medical AI systems.

[37] A Data-Driven RetinaNet Model for Small Object Detection in Aerial Images

Zhicheng Tang,Jinwen Tang,Yi Shang

Main category: cs.CV

TL;DR: DDR-Net is a data-driven deep learning model designed to enhance the detection of small objects in aerial imaging, offering efficient performance even with limited data, and surpassing RetinaNet and other contemporary models.

Details Motivation: In the realm of aerial imaging, the ability to detect small objects is pivotal for a myriad of applications, encompassing environmental surveillance, urban design, and crisis management. Method: DDR-Net introduces novel, data-driven techniques to autonomously ascertain optimal feature maps and anchor estimations, cultivating a tailored and proficient training process while maintaining precision. Additionally, this paper presents an innovative sampling technique to bolster model efficacy under limited data training constraints. Result: Empirical assessments over assorted aerial avian imagery datasets demonstrate that DDR-Net markedly surpasses RetinaNet and alternative contemporary models. Conclusion: DDR-Net significantly reduces the cost and time required for data collection and training, offering efficient performance even with limited data. Abstract: In the realm of aerial imaging, the ability to detect small objects is pivotal for a myriad of applications, encompassing environmental surveillance, urban design, and crisis management. Leveraging RetinaNet, this work unveils DDR-Net: a data-driven, deep-learning model devised to enhance the detection of diminutive objects. DDR-Net introduces novel, data-driven techniques to autonomously ascertain optimal feature maps and anchor estimations, cultivating a tailored and proficient training process while maintaining precision. Additionally, this paper presents an innovative sampling technique to bolster model efficacy under limited data training constraints. The model's enhanced detection capabilities support critical applications including wildlife and habitat monitoring, traffic flow optimization, and public safety improvements through accurate identification of small objects like vehicles and pedestrians. DDR-Net significantly reduces the cost and time required for data collection and training, offering efficient performance even with limited data. Empirical assessments over assorted aerial avian imagery datasets demonstrate that DDR-Net markedly surpasses RetinaNet and alternative contemporary models. These innovations advance current aerial image analysis technologies and promise wide-ranging impacts across multiple sectors including agriculture, security, and archaeology.

[38] STAR: A Fast and Robust Rigid Registration Framework for Serial Histopathological Images

Zeyu Liu,Shengwei Ding

Main category: cs.CV

TL;DR: STAR is a fast, open-source framework for rigid registration of serial whole-slide histopathological images, enabling efficient cross-stain alignment and supporting AI workflows in computational pathology.

Details Motivation: The need for a lightweight, fast, and reproducible method for rigid registration of consecutive-section histopathological images to enable cross-stain comparison and AI-driven workflows. Method: STAR uses stain-conditioned preprocessing, a hierarchical coarse-to-fine correlation strategy, adaptive kernel scaling, and built-in quality control to achieve rigid registration. Result: STAR achieves reliable rigid registration across various tissue types and staining protocols, with fast processing times and robustness to cross-stain variability and partial tissue overlap. Conclusion: STAR provides a reproducible baseline for rigid registration of serial whole-slide histopathological images, offering speed, robustness, and open accessibility for clinical and computational pathology applications. Abstract: Registration of serial whole-slide histopathological images (WSIs) is critical for enabling direct comparison across diverse stains and for preparing paired datasets in artificial intelligence (AI) workflows such as virtual staining and biomarker prediction. While existing methods often rely on complex deformable or deep learning approaches that are computationally intensive and difficult to reproduce, lightweight rigid frameworks-sufficient for many consecutive-section scenarios-remain underdeveloped. We introduce STAR (Serial Tissue Alignment for Rigid registration), a fast and robust open-source framework for multi-WSI alignment. STAR integrates stain-conditioned preprocessing with a hierarchical coarse-to-fine correlation strategy, adaptive kernel scaling, and built-in quality control, achieving reliable rigid registration across heterogeneous tissue types and staining protocols, including hematoxylin-eosin (H&E), special histochemical stains (e.g., PAS, PASM, Masson's), and immunohistochemical (IHC) markers (e.g., CD31, KI67). Evaluated on the ANHIR 2019 and ACROBAT 2022 datasets spanning multiple organs and scanning conditions, STAR consistently produced stable alignments within minutes per slide, demonstrating robustness to cross-stain variability and partial tissue overlap. Beyond benchmarks, we present case studies on H&E-IHC alignment, construction of multi-IHC panels, and typical failure modes, underscoring both utility and limitations. Released as an open and lightweight tool, STAR provides a reproducible baseline that lowers the barrier for clinical adoption and enables large-scale paired data preparation for next-generation computational pathology.

[39] Resilient Multimodal Industrial Surface Defect Detection with Uncertain Sensors Availability

Shuai Jiang,Yunfeng Ma,Jingyu Zhou,Yuan Bian,Yaonan Wang,Min Liu

Main category: cs.CV

TL;DR: 该论文介绍了一种针对多模态工业表面缺陷检测中模态缺失问题的新方法,通过跨模态提示学习和对称对比学习方法提高了检测性能。

Details Motivation: 解决由于传感器不可用导致的模态缺失问题,这在多模态融合中会导致信息空缺和学习模式转换的问题。 Method: 提出跨模态提示学习和对称对比学习方法,利用文本模态作为桥梁进行双视觉模态融合,并设计了配对对立文本提示和三模态对比预训练。 Result: 实验结果显示,所提出的方法在RGB和3D模态总缺失率为0.7的情况下,达到了73.83%的I-AUROC和93.05%的P-AUROC,超过了现有最先进的方法。 Conclusion: 该论文提出了一种新的方法来解决多模态工业表面缺陷检测中的模态缺失问题,通过跨模态提示学习和对称对比学习,提高了检测性能。 Abstract: Multimodal industrial surface defect detection (MISDD) aims to identify and locate defect in industrial products by fusing RGB and 3D modalities. This article focuses on modality-missing problems caused by uncertain sensors availability in MISDD. In this context, the fusion of multiple modalities encounters several troubles, including learning mode transformation and information vacancy. To this end, we first propose cross-modal prompt learning, which includes: i) the cross-modal consistency prompt serves the establishment of information consistency of dual visual modalities; ii) the modality-specific prompt is inserted to adapt different input patterns; iii) the missing-aware prompt is attached to compensate for the information vacancy caused by dynamic modalities-missing. In addition, we propose symmetric contrastive learning, which utilizes text modality as a bridge for fusion of dual vision modalities. Specifically, a paired antithetical text prompt is designed to generate binary text semantics, and triple-modal contrastive pre-training is offered to accomplish multimodal learning. Experiment results show that our proposed method achieves 73.83% I-AUROC and 93.05% P-AUROC with a total missing rate 0.7 for RGB and 3D modalities (exceeding state-of-the-art methods 3.84% and 5.58% respectively), and outperforms existing approaches to varying degrees under different missing types and rates. The source code will be available at https://github.com/SvyJ/MISDD-MM.

[40] EdgeAttNet: Towards Barb-Aware Filament Segmentation

Victor Solomon,Piet Martens,Jingyu Liu,Rafal Angryk

Main category: cs.CV

TL;DR: EdgeAttNet是一种新的分割架构,它通过引入一个可学习的边缘图来提高对太阳细丝及其须状结构的分割准确性。

Details Motivation: 现有的方法往往无法捕捉到细丝的精细结构,尤其是须状结构,因为它们在建模长距离依赖性和空间细节上的能力有限。 Method: 通过线性变换注意力机制中的Key和Query矩阵,将边缘图信息整合进模型,以更有效地捕捉细丝边界和须状结构。 Result: EdgeAttNet在MAGFILO数据集上超越了U-Net和其他基于U-Net的变压器基线,实现了更高的分割准确性,对细丝丝状结构的识别效果显著更好,并且推理速度更快,适合实际部署。 Conclusion: EdgeAttNet是基于U-Net骨干的分割架构,通过引入可学习的边缘图,提高了空间敏感性和分割准确性,同时减少了可训练参数的数量。 Abstract: Accurate segmentation of solar filaments in H-alpha observations is critical for determining filament chirality, a key factor in the behavior of Coronal Mass Ejections (CMEs). However, existing methods often fail to capture fine-scale filament structures, particularly barbs, due to a limited ability to model long-range dependencies and spatial detail. We propose EdgeAttNet, a segmentation architecture built on a U-Net backbone by introducing a novel, learnable edge map derived directly from the input image. This edge map is incorporated into the model by linearly transforming the attention Key and Query matrices with the edge information, thereby guiding the self-attention mechanism at the network's bottleneck to more effectively capture filament boundaries and barbs. By explicitly integrating this structural prior into the attention computations, EdgeAttNet enhances spatial sensitivity and segmentation accuracy while reducing the number of trainable parameters. Trained end-to-end, EdgeAttNet outperforms U-Net and other U-Net-based transformer baselines on the MAGFILO dataset. It achieves higher segmentation accuracy and significantly better recognition of filament barbs, with faster inference performance suitable for practical deployment.

[41] KEPT: Knowledge-Enhanced Prediction of Trajectories from Consecutive Driving Frames with Vision-Language Models

Yujin Wang,Tianyi Wang,Quanfeng Liu,Wenxian Fan,Junfeng Jiao,Christian Claudel,Yunbing Yan,Bingzhao Gao,Jianqiang Wang,Hong Chen

Main category: cs.CV

TL;DR: This paper proposes KEPT, a knowledge-enhanced vision-language model (VLM) framework for autonomous driving, which effectively predicts vehicle trajectories by integrating a novel video encoder and retrieval system, achieving state-of-the-art results on nuScenes dataset.

Details Motivation: The motivation stems from the challenge that existing vision-language models (VLMs) struggle to ground their reasoning in scene dynamics and domain knowledge, which is critical for accurate and safe trajectory prediction in autonomous driving. Method: The paper introduces the KEPT framework, which combines a TFSF video encoder trained with self-supervised learning and a k-means + HNSW retrieval stack. This system uses retrieved exemplars in CoT prompts with planning constraints, while a triple-stage fine-tuning aligns the language head to spatial and motion cues. It was evaluated on the nuScenes dataset. Result: KEPT achieved state-of-the-art performance on nuScenes dataset, with 0.70m average L2 and 0.21% collision rate under NoAvg protocol, and 0.31m average L2 with 0.07% collision rate under TemAvg. Ablation studies showed the effectiveness of its components, and the retrieval system achieved sub-millisecond latency. Conclusion: The KEPT framework, which integrates a temporal frequency-spatial fusion video encoder and a scalable retrieval stack, demonstrates promising results for autonomous driving by offering interpretable and trustworthy trajectory predictions. Abstract: Accurate short-horizon trajectory prediction is pivotal for safe and reliable autonomous driving, yet existing vision-language models (VLMs) often fail to effectively ground their reasoning in scene dynamics and domain knowledge. To address this challenge, this paper introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT couples a temporal frequency-spatial fusion (TFSF) video encoder, trained via self-supervised learning with hard-negative mining, with a scalable k-means + HNSW retrieval stack that supplies scene-aligned exemplars. Retrieved priors are embedded into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning schedule incrementally aligns the language head to metric spatial cues, physically feasible motion, and temporally conditioned front-view planning. Evaluated on nuScenes dataset, KEPT achieves state-of-the-art performance across open-loop protocols: under NoAvg, it achieves 0.70m average L2 with a 0.21\% collision rate; under TemAvg with lightweight ego status, it attains 0.31m average L2 and a 0.07\% collision rate. Ablation studies show that all three fine-tuning stages contribute complementary benefits, and that using Top-2 retrieved exemplars yields the best accuracy-safety trade-off. The k-means-clustered HNSW index delivers sub-millisecond retrieval latency, supporting practical deployment. These results indicate that retrieval-augmented, CoT-guided VLMs offer a promising, data-efficient pathway toward interpretable and trustworthy autonomous driving.

[42] VQualA 2025 Challenge on Engagement Prediction for Short Videos: Methods and Results

Dasong Li,Sizhuo Ma,Hang Hua,Wenjie Li,Jian Wang,Chris Wei Zhou,Fengbin Guan,Xin Li,Zihao Yu,Yiting Lu,Ru-Ling Liao,Yan Ye,Zhibo Chen,Wei Sun,Linhan Cao,Yuqin Cao,Weixia Zhang,Wen Wen,Kaiwei Zhang,Zijian Chen,Fangfang Lu,Xiongkuo Min,Guangtao Zhai,Erjia Xiao,Lingfeng Zhang,Zhenjie Su,Hao Cheng,Yu Liu,Renjing Xu,Long Chen,Xiaoshuai Hao,Zhenpeng Zeng,Jianqin Wu,Xuxu Wang,Qian Yu,Bo Hu,Weiwei Wang,Pinxin Liu,Yunlong Tang,Luchuan Song,Jinxi He,Jiaru Wu,Hanjia Lyu

Main category: cs.CV

TL;DR: The VQualA 2025 Challenge focused on predicting engagement for short UGC videos using a new dataset and multi-modal features, attracting 97 participants and producing 15 valid submissions.

Details Motivation: The challenge aimed to understand and model the popularity of UGC short videos on social media platforms, promoting robust modeling strategies that capture complex influencing factors. Method: The challenge utilized a new short-form UGC dataset with real-world engagement metrics, encouraging participants to explore various multi-modal features such as visual content, audio, and metadata. Result: The challenge attracted 97 participants and received 15 valid test submissions, demonstrating significant interest and contribution to the field of engagement prediction for short-form UGC videos. Conclusion: The VQualA 2025 Challenge significantly contributed to the progress in predicting engagement for short-form UGC videos, providing a platform for exploring multi-modal features and robust modeling strategies. Abstract: This paper presents an overview of the VQualA 2025 Challenge on Engagement Prediction for Short Videos, held in conjunction with ICCV 2025. The challenge focuses on understanding and modeling the popularity of user-generated content (UGC) short videos on social media platforms. To support this goal, the challenge uses a new short-form UGC dataset featuring engagement metrics derived from real-world user interactions. This objective of the Challenge is to promote robust modeling strategies that capture the complex factors influencing user engagement. Participants explored a variety of multi-modal features, including visual content, audio, and metadata provided by creators. The challenge attracted 97 participants and received 15 valid test submissions, contributing significantly to progress in short-form UGC video engagement prediction.

[43] InstaDA: Augmenting Instance Segmentation Data with Dual-Agent System

Xianbao Hou,Yonghao He,Zeyd Boukhers,John See,Hu Su,Wei Sui,Cong Yang

Main category: cs.CV

TL;DR: InstaDA通过结合大语言模型和扩散模型,提出Prompt Rethink机制和图像条件生成方法,显著提高了实例分割数据集的质量和多样性。

Details Motivation: 解决实例分割数据获取困难和类别不平衡的问题,改进现有方法中大语言模型与扩散模型缺乏深度协作以及训练数据利用率低的局限性。 Method: 提出InstaDA,结合大语言模型和扩散模型,通过Prompt Rethink机制优化提示并生成多样化的数据,同时利用训练图像生成新实例来丰富数据分布。 Result: 在LVIS 1.0验证集上,InstaDA的框平均精度(AP)提高了+4.0,掩膜AP提高了+3.3,并在常见类别和频繁类别上优于DiverGen模型。 Conclusion: InstaDA是一个无需训练的双代理系统,在实例分割数据集的增强方面表现出色,通过文本代理和图像代理的独立自动化工作流程,实现了更高的平均精度和实用性。 Abstract: Acquiring high-quality instance segmentation data is challenging due to the labor-intensive nature of the annotation process and significant class imbalances within datasets. Recent studies have utilized the integration of Copy-Paste and diffusion models to create more diverse datasets. However, these studies often lack deep collaboration between large language models (LLMs) and diffusion models, and underutilize the rich information within the existing training data. To address these limitations, we propose InstaDA, a novel, training-free Dual-Agent system designed to augment instance segmentation datasets. First, we introduce a Text-Agent (T-Agent) that enhances data diversity through collaboration between LLMs and diffusion models. This agent features a novel Prompt Rethink mechanism, which iteratively refines prompts based on the generated images. This process not only fosters collaboration but also increases image utilization and optimizes the prompts themselves. Additionally, we present an Image-Agent (I-Agent) aimed at enriching the overall data distribution. This agent augments the training set by generating new instances conditioned on the training images. To ensure practicality and efficiency, both agents operate as independent and automated workflows, enhancing usability. Experiments conducted on the LVIS 1.0 validation set indicate that InstaDA achieves significant improvements, with an increase of +4.0 in box average precision (AP) and +3.3 in mask AP compared to the baseline. Furthermore, it outperforms the leading model, DiverGen, by +0.3 in box AP and +0.1 in mask AP, with a notable +0.7 gain in box AP on common categories and mask AP gains of +0.2 on common categories and +0.5 on frequent categories.

[44] SPENet: Self-guided Prototype Enhancement Network for Few-shot Medical Image Segmentation

Chao Fan,Xibin Jia,Anqi Xiao,Hongyuan Yu,Zhenghan Yang,Dawei Yang,Hui Xu,Yan Huang,Liang Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的少样本医学图像分割方法SPENet,通过多级原型生成和查询引导的原型增强,有效解决了类内差异问题,并在三个公共医学数据集上展示了优越性能。

Details Motivation: 基于原型的方法在FSMIS中取得了一定进展,但它们通常生成单一全局原型,忽略了类内差异。 Method: 提出了一种自导原型增强网络(SPENet),包括多级原型生成模块(MPG)和查询引导的局部原型增强模块(QLPE) Result: 在三个公共医学数据集上的实验证明了SPENet的优越性能。 Conclusion: SPENet有效地解决了原型生成中的类内差异问题,提高了FSMIS的性能。 Abstract: Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel classes of medical objects using only a few labeled images. Prototype-based methods have made significant progress in addressing FSMIS. However, they typically generate a single global prototype for the support image to match with the query image, overlooking intra-class variations. To address this issue, we propose a Self-guided Prototype Enhancement Network (SPENet). Specifically, we introduce a Multi-level Prototype Generation (MPG) module, which enables multi-granularity measurement between the support and query images by simultaneously generating a global prototype and an adaptive number of local prototypes. Additionally, we observe that not all local prototypes in the support image are beneficial for matching, especially when there are substantial discrepancies between the support and query images. To alleviate this issue, we propose a Query-guided Local Prototype Enhancement (QLPE) module, which adaptively refines support prototypes by incorporating guidance from the query image, thus mitigating the negative effects of such discrepancies. Extensive experiments on three public medical datasets demonstrate that SPENet outperforms existing state-of-the-art methods, achieving superior performance.

[45] SOPSeg: Prompt-based Small Object Instance Segmentation in Remote Sensing Imagery

Chenhao Wang,Yingrui Ji,Yu Meng,Yunjian Zhang,Yao Zhu

Main category: cs.CV

TL;DR: 本文提出了一種稱為SOPSeg的框架,專門用於遙感圖像中的小目標分割,解決了現有方法在小目標分割方面的不足,並建立了一個全面的小目標實例分割數據集。

Details Motivation: 小目標分割在遙感圖像中具有重要應用,但目前缺乏專門的數據集和方法。現有的Segment Anything Model (SAM)在小目標分割上效果不佳,主要由於特徵解析度低導致細節丟失。 Method: 提出SOPSeg框架,包含區域自適應放大策略、定制解碼器結合邊緣預測和逐步優化機制,以及針對遙感應用的定向邊界框提示機制。 Result: SOPSeg在小目標分割上優於現有方法,並成功構建了一個全面的小目標實例分割數據集,促進了遙感任務的高效數據集構建。 Conclusion: SOPSeg框架有效解決了小目標分割的挑戰,並推動了未來相關研究的發展。 Abstract: Extracting small objects from remote sensing imagery plays a vital role in various applications, including urban planning, environmental monitoring, and disaster management. While current research primarily focuses on small object detection, instance segmentation for small objects remains underexplored, with no dedicated datasets available. This gap stems from the technical challenges and high costs of pixel-level annotation for small objects. While the Segment Anything Model (SAM) demonstrates impressive zero-shot generalization, its performance on small-object segmentation deteriorates significantly, largely due to the coarse 1/16 feature resolution that causes severe loss of fine spatial details. To this end, we propose SOPSeg, a prompt-based framework specifically designed for small object segmentation in remote sensing imagery. It incorporates a region-adaptive magnification strategy to preserve fine-grained details, and employs a customized decoder that integrates edge prediction and progressive refinement for accurate boundary delineation. Moreover, we introduce a novel prompting mechanism tailored to the oriented bounding boxes widely adopted in remote sensing applications. SOPSeg outperforms existing methods in small object segmentation and facilitates efficient dataset construction for remote sensing tasks. We further construct a comprehensive small object instance segmentation dataset based on SODA-A, and will release both the model and dataset to support future research.

[46] Enhancing Robustness in Post-Processing Watermarking: An Ensemble Attack Network Using CNNs and Transformers

Tzuhsuan Huang,Cheng Yu Yeo,Tsai-Ling Huang,Hong-Han Shuai,Wen-Huang Cheng,Jun-Cheng Chen

Main category: cs.CV

TL;DR: 本文研究了后处理水印技术,通过引入集成攻击网络提高了水印的鲁棒性,并在WAVES基准测试中验证了其效果。

Details Motivation: 后处理水印提供了更多的灵活性,可以应用于任何生成模型的输出,而不需要访问模型的内部结构,并且允许用户在单个图像中嵌入独特的水印。 Method: 构建了各种版本的攻击网络,使用CNN和Transformer在空间和频率域进行研究,以调查每种组合如何影响水印模型的鲁棒性。 Result: 实验结果显示,在WAVES基准测试中,所提出的集成攻击网络显著增强了基线水印方法在各种压力测试下的鲁棒性。特别是对于WAVES中定义的再生攻击,该方法改进了StegaStamp 18.743%。 Conclusion: 本文通过在训练中结合一个集成攻击网络,提高了后处理水印的鲁棒性。结合空间域中的CNN攻击网络和频率域中的Transformer攻击网络,得到了最稳健的水印模型。 Abstract: Recent studies on deep watermarking have predominantly focused on in-processing watermarking, which integrates the watermarking process into image generation. However, post-processing watermarking, which embeds watermarks after image generation, offers more flexibility. It can be applied to outputs from any generative model (e.g. GANs, diffusion models) without needing access to the model's internal structure. It also allows users to embed unique watermarks into individual images. Therefore, this study focuses on post-processing watermarking and enhances its robustness by incorporating an ensemble attack network during training. We construct various versions of attack networks using CNN and Transformer in both spatial and frequency domains to investigate how each combination influences the robustness of the watermarking model. Our results demonstrate that combining a CNN-based attack network in the spatial domain with a Transformer-based attack network in the frequency domain yields the highest robustness in watermarking models. Extensive evaluation on the WAVES benchmark, using average bit accuracy as the metric, demonstrates that our ensemble attack network significantly enhances the robustness of baseline watermarking methods under various stress tests. In particular, for the Regeneration Attack defined in WAVES, our method improves StegaStamp by 18.743%. The code is released at:https://github.com/aiiu-lab/DeepRobustWatermark.

[47] Lesion-Aware Visual-Language Fusion for Automated Image Captioning of Ulcerative Colitis Endoscopic Examinations

Alexis Ivan Lopez Escamilla,Gilberto Ochoa,Sharib Al

Main category: cs.CV

TL;DR: The paper proposes a lesion-aware image captioning framework for ulcerative colitis that integrates deep learning models and clinical metadata to enhance caption quality and support endoscopic reporting.

Details Motivation: To produce structured, interpretable descriptions aligned with clinical practice for ulcerative colitis (UC) endoscopic reporting. Method: The model integrates ResNet embeddings, Grad-CAM heatmaps, and CBAM-enhanced attention with a T5 decoder, using clinical metadata as natural-language prompts to guide caption generation. Result: The system provides improved caption quality and MES classification accuracy compared to baselines. Conclusion: The lesion-aware image captioning framework for ulcerative colitis enhances caption quality and MES classification accuracy, supporting reliable endoscopic reporting. Abstract: We present a lesion-aware image captioning framework for ulcerative colitis (UC). The model integrates ResNet embeddings, Grad-CAM heatmaps, and CBAM-enhanced attention with a T5 decoder. Clinical metadata (MES score 0-3, vascular pattern, bleeding, erythema, friability, ulceration) is injected as natural-language prompts to guide caption generation. The system produces structured, interpretable descriptions aligned with clinical practice and provides MES classification and lesion tags. Compared with baselines, our approach improves caption quality and MES classification accuracy, supporting reliable endoscopic reporting.

[48] Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Sohee Kim,Soohyun Ryu,Joonhyung Park,Eunho Yang

Main category: cs.CV

TL;DR: 本论文提出了一種新的方法,用於檢測和修正大型視覺-語言模型(LVLMs)在處理缺乏視覺證據的文字輸入時產生的錯誤回應。

Details Motivation: 作者發現大型視覺-語言模型(LVLMs)在缺乏視覺證據的情況下,容易誤將文字輸入視為圖像的一部分,導致錯誤的回應。因此,他們希望探討LVLMs是否具備判斷文字概念是否基於圖像的內部能力。 Method: 研究人員發現了一組特定的前饋網絡(FFN)神經元,稱為視覺缺失感知(VA)神經元,它們通過獨特的激活模式來表示視覺缺失。基於這些模式,他們開發了一個檢測模塊,用於分類輸入標記是否具有視覺基礎。根據該模塊的預測,他們提出了一種方法來重新詮釋問題提示或替換生成過程中檢測到的缺失標記。 Result: 實驗顯示,所提出的方法有效地減少了模型對文字輸入視覺存在的錯誤假設,並在多種LVLMs中表現出良好的通用性。 Conclusion: 這項研究揭示了LVLMs中與視覺缺失感知相關的神經元活動,並提出了一種有效的模塊來提高模型對視覺和文字輸入的準確理解。 Abstract: Large Vision-Language Models (LVLMs) generate contextually relevant responses by jointly interpreting visual and textual inputs. However, our finding reveals they often mistakenly perceive text inputs lacking visual evidence as being part of the image, leading to erroneous responses. In light of this finding, we probe whether LVLMs possess an internal capability to determine if textual concepts are grounded in the image, and discover a specific subset of Feed-Forward Network (FFN) neurons, termed Visual Absence-aware (VA) neurons, that consistently signal the visual absence through a distinctive activation pattern. Leveraging these patterns, we develop a detection module that systematically classifies whether an input token is visually grounded. Guided by its prediction, we propose a method to refine the outputs by reinterpreting question prompts or replacing the detected absent tokens during generation. Extensive experiments show that our method effectively mitigates the models' tendency to falsely presume the visual presence of text input and its generality across various LVLMs.

[49] Background Matters Too: A Language-Enhanced Adversarial Framework for Person Re-Identification

Kaicong Huang,Talha Azfar,Jack M. Reilly,Thomas Guggisberg,Ruimin Ke

Main category: cs.CV

TL;DR: This paper proposes a novel end-to-end multimodal framework for person re-identification that leverages both foreground and background semantics, outperforming existing methods on multiple benchmarks.

Details Motivation: Existing visual-only methods rely on costly annotations and struggle with occlusions, while recent multimodal approaches focus only on foreground information, neglecting the importance of background cues. Inspired by human perception, the authors argue that both foreground and background semantics are critical for ReID. Method: The authors propose an end-to-end framework with a dual-branch cross-modal feature extraction pipeline. They introduce intra-semantic alignment and inter-semantic adversarial learning to align visual and textual features while penalizing similarity between foreground and background features. Result: The proposed method achieves results that match or surpass current state-of-the-art approaches on both holistic and occluded ReID benchmarks, demonstrating its effectiveness and generality. Conclusion: The paper concludes that modeling both foreground and background semantics enhances the performance of person re-identification, as it enables the network to suppress noisy background regions while focusing on identity-relevant cues. Abstract: Person re-identification faces two core challenges: precisely locating the foreground target while suppressing background noise and extracting fine-grained features from the target region. Numerous visual-only approaches address these issues by partitioning an image and applying attention modules, yet they rely on costly manual annotations and struggle with complex occlusions. Recent multimodal methods, motivated by CLIP, introduce semantic cues to guide visual understanding. However, they focus solely on foreground information, but overlook the potential value of background cues. Inspired by human perception, we argue that background semantics are as important as the foreground semantics in ReID, as humans tend to eliminate background distractions while focusing on target appearance. Therefore, this paper proposes an end-to-end framework that jointly models foreground and background information within a dual-branch cross-modal feature extraction pipeline. To help the network distinguish between the two domains, we propose an intra-semantic alignment and inter-semantic adversarial learning strategy. Specifically, we align visual and textual features that share the same semantics across domains, while simultaneously penalizing similarity between foreground and background features to enhance the network's discriminative power. This strategy drives the model to actively suppress noisy background regions and enhance attention toward identity-relevant foreground cues. Comprehensive experiments on two holistic and two occluded ReID benchmarks demonstrate the effectiveness and generality of the proposed method, with results that match or surpass those of current state-of-the-art approaches.

[50] MedLiteNet: Lightweight Hybrid Medical Image Segmentation Model

Pengyang Yu,Haoquan Wang,Gerard Marks,Tahar Kechadi,Laurence T. Yang,Sahraoui Dhelim,Nyothiri Aung

Main category: cs.CV

TL;DR: MedLiteNet is a hybrid model combining CNNs and Transformers for efficient and accurate skin-lesion segmentation in dermatology.

Details Motivation: Accurate skin-lesion segmentation is crucial for diagnosing skin cancer. Existing methods like CNNs and Vision Transformers have limitations in handling long-range dependencies and computational efficiency. Method: The paper introduces MedLiteNet, which combines CNNs and Vision Transformers. It uses Mobile Inverted Bottleneck blocks, a cross-scale token-mixing unit, and a boundary-aware self-attention module. Result: MedLiteNet improves segmentation accuracy while being computationally efficient, making it suitable for small-sample medical datasets in dermatology. Conclusion: MedLiteNet is a lightweight CNN Transformer hybrid that achieves high precision in dermoscopic segmentation through hierarchical feature extraction and multi-scale context aggregation. Abstract: Accurate skin-lesion segmentation remains a key technical challenge for computer-aided diagnosis of skin cancer. Convolutional neural networks, while effective, are constrained by limited receptive fields and thus struggle to model long-range dependencies. Vision Transformers capture global context, yet their quadratic complexity and large parameter budgets hinder use on the small-sample medical datasets common in dermatology. We introduce the MedLiteNet, a lightweight CNN Transformer hybrid tailored for dermoscopic segmentation that achieves high precision through hierarchical feature extraction and multi-scale context aggregation. The encoder stacks depth-wise Mobile Inverted Bottleneck blocks to curb computation, inserts a bottleneck-level cross-scale token-mixing unit to exchange information between resolutions, and embeds a boundary-aware self-attention module to sharpen lesion contours.

[51] DCDB: Dynamic Conditional Dual Diffusion Bridge for Ill-posed Multi-Tasks

Chengjie Huang,Jiafeng Yan,Jing Li,Lu Bai

Main category: cs.CV

TL;DR: This paper proposes a dynamic conditional double diffusion bridge paradigm to address challenges in multi-task image processing scenarios, particularly for ill-posed tasks with limited data, achieving superior performance in experiments.

Details Motivation: Conditional diffusion models face challenges in multi-task scenarios due to the difficulty in exploiting intrinsic task correlations and the limitations of static condition control, particularly in ill-posed tasks with insufficient training data. Method: The paper introduces a dynamic conditional double diffusion bridge paradigm that decouples the diffusion and condition generation processes, utilizing dynamic conditions generated by the same noise schedule to adjust statistical characteristics and embed time-related information. Result: The proposed method achieves the best performance across multiple metrics on public datasets for tasks such as dehazing and visible-infrared fusion, demonstrating its effectiveness through comparisons of network learning objectives and attention weights. Conclusion: The proposed dynamic conditional double diffusion bridge training paradigm provides a general framework for addressing ill-posed multi-task scenarios in image processing, achieving superior performance in tasks like dehazing and visible-infrared fusion. Abstract: Conditional diffusion models have made impressive progress in the field of image processing, but the characteristics of constructing data distribution pathways make it difficult to exploit the intrinsic correlation between tasks in multi-task scenarios, which is even worse in ill-posed tasks with a lack of training data. In addition, traditional static condition control makes it difficult for networks to learn in multi-task scenarios with its dynamically evolving characteristics. To address these challenges, we propose a dynamic conditional double diffusion bridge training paradigm to build a general framework for ill-posed multi-tasks. Firstly, this paradigm decouples the diffusion and condition generation processes, avoiding the dependence of the diffusion model on supervised data in ill-posed tasks. Secondly, generated by the same noise schedule, dynamic conditions are used to gradually adjust their statistical characteristics, naturally embed time-related information, and reduce the difficulty of network learning. We analyze the learning objectives of the network under different conditional forms in the single-step denoising process and compare the changes in its attention weights in the network, demonstrating the superiority of our dynamic conditions. Taking dehazing and visible-infrared fusion as typical ill-posed multi-task scenarios, we achieve the best performance in multiple indicators on public datasets. The code has been publicly released at: https://anonymous.4open.science/r/DCDB-D3C2.

[52] Isolated Bangla Handwritten Character Classification using Transfer Learning

Abdul Karim,S M Rafiuddin,Jahidul Islam Razin,Tahira Alam

Main category: cs.CV

TL;DR: 本文利用迁移学习和深度神经网络(3DCNN、ResNet、MobileNet)对孟加拉语手写字符进行分类,在训练和测试数据上分别达到了99.82%和99.46%的高准确率,优于现有方法。

Details Motivation: 孟加拉语包含50个基本字符和许多复合字符,已有研究在识别这些字符方面取得了一定成果,但仍有改进空间。 Method: 使用深度神经网络技术,包括3D卷积神经网络(3DCNN)、残差网络(ResNet)和MobileNet,结合迁移学习进行分类。 Result: 模型在训练数据上达到99.82%的准确率,在测试数据上达到99.46%的准确率,并优于当前最先进的基准方法。 Conclusion: 该论文提出了一种基于迁移学习的分类方法,有效避免了梯度消失问题,并在孟加拉语手写字符分类中取得了较高的准确率。 Abstract: Bangla language consists of fifty distinct characters and many compound characters. Several notable studies have been performed to recognize Bangla characters, both handwritten and optical. Our approach uses transfer learning to classify the basic, distinct, as well as compound Bangla handwritten characters while avoiding the vanishing gradient problem. Deep Neural Network techniques such as 3D Convolutional Neural Network (3DCNN), Residual Neural Network (ResNet), and MobileNet are applied to generate an end-to-end classification of all possible standard formations of handwritten characters in the Bangla language. The Bangla Lekha Isolated dataset, which contains 166,105 Bangla character image samples categorized into 84 distinct classes, is used for this classification model. The model achieved 99.82% accuracy on training data and 99.46% accuracy on test data. Comparisons with various state-of-the-art benchmarks of Bangla handwritten character classification show that the proposed model achieves better accuracy in classifying the data.

[53] High Cursive Complex Character Recognition using GAN External Classifier

S M Rafiuddin

Main category: cs.CV

TL;DR: ADA-GAN improves classification of complex and cursive handwritten characters by using a GAN-based approach, outperforming traditional convolutional neural networks.

Details Motivation: Classifying complex and cursive handwritten characters is challenging for traditional methods, which perform worse as complexity increases. Method: An external classifier combined with a Generative Adversarial Network (GAN) is used, where the generator produces fake handwritten character images and the discriminator evaluates them, augmenting training data with adversarially perturbed noise. Result: The accuracy of traditional convolutional neural networks decreases with increased character complexity, while the ADA-GAN model maintains higher performance. Conclusion: The proposed ADA-GAN model is more robust and effective in classifying both cursive and complex handwritten characters compared to traditional convolutional neural networks. Abstract: Handwritten characters can be trickier to classify due to their complex and cursive nature compared to simple and non-cursive characters. We present an external classifier along with a Generative Adversarial Network that can classify highly cursive and complex characters. The generator network produces fake handwritten character images, which are then used to augment the training data after adding adversarially perturbed noise and achieving a confidence score above a threshold with the discriminator network. The results show that the accuracy of convolutional neural networks decreases as character complexity increases, but our proposed model, ADA-GAN, remains more robust and effective for both cursive and complex characters.

[54] Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

Shan Wang,Maying Shen,Nadine Chang,Chuong Nguyen,Hongdong Li,Jose M. Alvarez

Main category: cs.CV

TL;DR: This paper proposes a resource-efficient method for reducing hallucinations in multimodal language models by addressing text-visual and co-occurrence biases through a gradient-based self-reflection approach and contrastive decoding framework.

Details Motivation: Existing methods address hallucinations in multimodal models heuristically without considering fluctuating bias levels across instances. This work aims to propose a more effective and resource-efficient solution. Method: A gradient-based self-reflection method is used to estimate token influence, enabling the detection of object-related visual tokens integrated into an influence-aware contrastive decoding framework to mitigate biases. Result: The experiments show that the method significantly reduces hallucinations, achieving up to a 92% accuracy increase on LLaVA-QA90. Conclusion: The proposed method effectively reduces hallucinations in multimodal large language models by mitigating text-visual and co-occurrence biases without requiring additional resources. Abstract: Hallucinations in multimodal large language model are caused by the text-visual bias and the co-occurrence bias. The former reflects an over-reliance on text information in the decision-making process, while the latter arises from the statistical object-pairing patterns abstracted from the training data. Existing mitigation methods heuristically address these biases without understanding the fluctuating bias level across the instances. We first propose estimating the influence of respective token types (visual, prompt, and previous outputs) using a gradient-based self-reflection method. The estimated token influence further enables the detection of object-related visual tokens and their integration into an influence-aware contrastive decoding framework to mitigate both types of biases simultaneously. Our method operates without the need for additional resources, such as costly fine-tuning, extra models, or data statistics. Extensive experiments show it effectively reduces hallucinations, achieving up to a 92% accuracy increase on LLaVA-QA90.

[55] TRELLIS-Enhanced Surface Features for Comprehensive Intracranial Aneurysm Analysis

Clément Hervé,Paul Garnier,Jonathan Viquerat,Elie Hachem

Main category: cs.CV

TL;DR: 该研究提出了一种跨领域特征转移方法,利用TRELLIS(一种在大规模非医疗3D数据集上训练的生成模型)的潜在几何嵌入来增强神经网络,以进行动脉瘤分析。

Details Motivation: 颅内动脉瘤具有显著的临床风险,但由于缺乏带注释的3D数据,其检测、描绘和建模十分困难。因此,作者提出了该研究。 Method: 通过使用TRELLIS表面特征替代传统的点法线或网格描述符,系统性地改进了动脉瘤分析的三个下游任务:动脉瘤与健康血管的分类、3D网格上动脉瘤和血管区域的分割、以及使用图神经网络预测时间演变的血流场。 Result: 实验结果显示,使用这些特征在准确率、F1分数和分割质量上相较最先进的基线方法有了显著提升,并且模拟误差减少了15%。 Conclusion: 该方法展示了从通用生成模型中转移3D表示到专业医疗任务中的广泛潜力。 Abstract: Intracranial aneurysms pose a significant clinical risk yet are difficult to detect, delineate and model due to limited annotated 3D data. We propose a cross-domain feature-transfer approach that leverages the latent geometric embeddings learned by TRELLIS, a generative model trained on large-scale non-medical 3D datasets, to augment neural networks for aneurysm analysis. By replacing conventional point normals or mesh descriptors with TRELLIS surface features, we systematically enhance three downstream tasks: (i) classifying aneurysms versus healthy vessels in the Intra3D dataset, (ii) segmenting aneurysm and vessel regions on 3D meshes, and (iii) predicting time-evolving blood-flow fields using a graph neural network on the AnXplore dataset. Our experiments show that the inclusion of these features yields strong gains in accuracy, F1-score and segmentation quality over state-of-the-art baselines, and reduces simulation error by 15\%. These results illustrate the broader potential of transferring 3D representations from general-purpose generative models to specialized medical tasks.

[56] Backdoor Poisoning Attack Against Face Spoofing Attack Detection Methods

Shota Iwamatsu,Koichi Ito,Takafumi Aoki

Main category: cs.CV

TL;DR: 该论文提出了一种新型后门投毒攻击方法,能够使欺骗攻击绕过人脸识别反欺骗检测系统,实验表明该方法对现有系统构成现实威胁。

Details Motivation: 为了防止人脸识别系统受到欺骗攻击(如使用用户人脸照片进行非法认证尝试),需要在人脸识别前区分输入图像是真实用户图像还是欺骗图像。然而,现有的深度学习方法需要大量训练数据,可能受到恶意数据的威胁,因此研究后门投毒攻击的潜在威胁具有重要意义。 Method: 该论文通过将欺骗攻击的人脸图像特征嵌入到真实人脸图像中,实现了一种新型的后门投毒攻击方法,并在公共数据集上进行了实验验证。 Result: 实验表明,该方法能够在不引起视觉变化的情况下,使特定的欺骗攻击绕过检测,从而对现有反欺骗检测系统构成现实威胁。 Conclusion: 该论文提出了一种新颖的后门投毒攻击方法,展示了在人脸识别反欺骗检测中存在的潜在威胁,并通过实验表明该方法对现有反欺骗检测系统构成现实威胁。 Abstract: Face recognition systems are robust against environmental changes and noise, and thus may be vulnerable to illegal authentication attempts using user face photos, such as spoofing attacks. To prevent such spoofing attacks, it is crucial to discriminate whether the input image is a live user image or a spoofed image prior to the face recognition process. Most existing spoofing attack detection methods utilize deep learning, which necessitates a substantial amount of training data. Consequently, if malicious data is injected into a portion of the training dataset, a specific spoofing attack may be erroneously classified as live, leading to false positives.In this paper, we propose a novel backdoor poisoning attack method to demonstrate the latent threat of backdoor poisoning within face anti-spoofing detection. The proposed method enables certain spoofing attacks to bypass detection by embedding features extracted from the spoofing attack's face image into a live face image without inducing any perceptible visual alterations.Through experiments conducted on public datasets, we demonstrate that the proposed method constitutes a realistic threat to existing spoofing attack detection systems.

[57] Information transmission: Inferring change area from change moment in time series remote sensing images

Jialu Li,Chen Wu,Meiqi Hu

Main category: cs.CV

TL;DR: This paper proposes a time series change detection network CAIM-Net, which infers change area from change moment to ensure consistency between change area and change moment results.

Details Motivation: Change area can be inferred from change moment. CAIM-Net is proposed to ensure consistency between change area and change moment results. Method: A new time series change detection network CAIM-Net is proposed, which infers change area from change moment. The network mainly includes three steps: Difference Extraction and Enhancement, Coarse Change Moment Extraction, and Fine Change Moment Extraction and Change Area Inference. Result: The experimental results show that the proposed CAIM-Net can effectively ensure the consistency between change area and change moment results, and has a good performance in time series change detection. Conclusion: CAIM-Net can effectively ensure consistency between change area and change moment results, and has a good performance in time series change detection. Abstract: Time series change detection is a critical task for exploring ecosystem dynamics using time series remote sensing images, because it can simultaneously indicate where and when change occur. While deep learning has shown excellent performance in this domain, it continues to approach change area detection and change moment identification as distinct tasks. Given that change area can be inferred from change moment, we propose a time series change detection network, named CAIM-Net (Change Area Inference from Moment Network), to ensure consistency between change area and change moment results. CAIM-Net infers change area from change moment based on the intrinsic relationship between time series analysis and spatial change detection. The CAIM-Net comprises three key steps: Difference Extraction and Enhancement, Coarse Change Moment Extraction, and Fine Change Moment Extraction and Change Area Inference. In the Difference Extraction and Enhancement, a lightweight encoder with batch dimension stacking is designed to rapidly extract difference features. Subsequently, boundary enhancement convolution is applied to amplify these difference features. In the Coarse Change Moment Extraction, the enhanced difference features from the first step are used to spatiotemporal correlation analysis, and then two distinct methods are employed to determine coarse change moments. In the Fine Change Moment Extraction and Change Area Inference, a multiscale temporal Class Activation Mapping (CAM) module first increases the weight of the change-occurring moment from coarse change moments. Then the weighted change moment is used to infer change area based on the fact that pixels with the change moment must have undergone a change.

[58] Towards Realistic Hand-Object Interaction with Gravity-Field Based Diffusion Bridge

Miao Xu,Xiangyu Zhu,Xusheng Liang,Zidu Wang,Jinlin Wu,Zhen Lei

Main category: cs.CV

TL;DR: The paper proposes GravityDB, a novel method for simulating hand-object interactions that effectively addresses issues like interpenetration, gaps, and unrealistic hand deformations, offering more physically plausible and semantically meaningful results.

Details Motivation: Existing methods for hand-object pose estimation or reconstruction produce coarse interaction states due to the complex and diverse geometry of human hands and objects, leading to interpenetration or gaps in contact regions. Additionally, previous methods struggle to capture the non-negligible deformations of a real human hand during interaction. Method: The authors formulated hand-object interaction as an attraction-driven process and proposed a Gravity-Field Based Diffusion Bridge (GravityDB) to simulate interactions between a deformable hand surface and rigid objects, incorporating semantic information from textual descriptions to guide the construction of the gravitational field. Result: The GravityDB approach effectively resolves the issues of interpenetration and gaps while capturing realistic hand deformations, demonstrated through extensive qualitative and quantitative experiments on multiple datasets. Conclusion: The proposed GravityDB method successfully addresses the challenges of hand-object interaction modeling by generating physically plausible interactions that are free of interpenetration, ensuring stable grasping, and capturing realistic hand deformations. Abstract: Existing reconstruction or hand-object pose estimation methods are capable of producing coarse interaction states. However, due to the complex and diverse geometry of both human hands and objects, these approaches often suffer from interpenetration or leave noticeable gaps in regions that are supposed to be in contact. Moreover, the surface of a real human hand undergoes non-negligible deformations during interaction, which are difficult to capture and represent with previous methods. To tackle these challenges, we formulate hand-object interaction as an attraction-driven process and propose a Gravity-Field Based Diffusion Bridge (GravityDB) to simulate interactions between a deformable hand surface and rigid objects. Our approach effectively resolves the aforementioned issues by generating physically plausible interactions that are free of interpenetration, ensure stable grasping, and capture realistic hand deformations. Furthermore, we incorporate semantic information from textual descriptions to guide the construction of the gravitational field, enabling more semantically meaningful interaction regions. Extensive qualitative and quantitative experiments on multiple datasets demonstrate the effectiveness of our method.

[59] Temporally-Aware Diffusion Model for Brain Progression Modelling with Bidirectional Temporal Regularisation

Mattia Litrico,Francesco Guarnera,Mario Valerio Giuffrida,Daniele Ravì,Sebastiano Battiato

Main category: cs.CV

TL;DR: 本研究提出了一种新的 3D 时间感知扩散模型(TADM-3D),结合了预训练的 Brain-Age Estimator(BAE)和 Back-In-Time Regularization(BITR)方法,以更准确地预测 MRI 脑部结构的纵向变化。

Details Motivation: 生成逼真的 MRI 图像以准确预测脑部结构的未来变化对于临床医生评估临床结果和分析疾病进展至关重要。然而,现有的方法存在一些限制,如不能准确捕捉时间间隔与结构变化之间的关系、依赖二维切片架构而忽略完整的三维解剖上下文等。 Method: 提出了一个 3D 时间感知扩散模型(TADM-3D),结合了预训练的 Brain-Age Estimator(BAE)和 Back-In-Time Regularization(BITR)方法,以更好地建模时间间隔与脑部变化之间的关系。 Result: TADM-3D 在 OASIS-3 数据集上进行了训练和评估,并在外部测试集 NACC 数据集上验证了其泛化性能。 Conclusion: TADM-3D 使用预训练的 BAE 和 BITR 正则化方法,能够更准确地预测 MRI 脑部结构的纵向变化,为临床评估和疾病进展分析提供了有价值的工具。 Abstract: Generating realistic MRIs to accurately predict future changes in the structure of brain is an invaluable tool for clinicians in assessing clinical outcomes and analysing the disease progression at the patient level. However, current existing methods present some limitations: (i) some approaches fail to explicitly capture the relationship between structural changes and time intervals, especially when trained on age-imbalanced datasets; (ii) others rely only on scan interpolation, which lack clinical utility, as they generate intermediate images between timepoints rather than future pathological progression; and (iii) most approaches rely on 2D slice-based architectures, thereby disregarding full 3D anatomical context, which is essential for accurate longitudinal predictions. We propose a 3D Temporally-Aware Diffusion Model (TADM-3D), which accurately predicts brain progression on MRI volumes. To better model the relationship between time interval and brain changes, TADM-3D uses a pre-trained Brain-Age Estimator (BAE) that guides the diffusion model in the generation of MRIs that accurately reflect the expected age difference between baseline and generated follow-up scans. Additionally, to further improve the temporal awareness of TADM-3D, we propose the Back-In-Time Regularisation (BITR), by training TADM-3D to predict bidirectionally from the baseline to follow-up (forward), as well as from the follow-up to baseline (backward). Although predicting past scans has limited clinical applications, this regularisation helps the model generate temporally more accurate scans. We train and evaluate TADM-3D on the OASIS-3 dataset, and we validate the generalisation performance on an external test set from the NACC dataset. The code will be available upon acceptance.

[60] Preserving instance continuity and length in segmentation through connectivity-aware loss computation

Karol Szustakowski,Luk Frank,Julia Esser,Jan Gründemann,Marie Piraud

Main category: cs.CV

TL;DR: 本文提出了一种用于生物医学分割任务的新型损失函数,以更好地保持输出实例的连续性。

Details Motivation: 在许多生物医学分割任务中,保持结构的连续性和长度比体素级精度更重要。 Method: 提出了两种新的损失函数Negative Centerline Loss和Simplified Topology Loss,并讨论了实验设计的特征,如缩小和间距校正。 Result: 与标准CNN和现有的拓扑感知损失相比,该方法减少了分割不连续的数量,尤其是在输入信号缺失的区域。 Conclusion: 嵌入损失设计的结构先验可以显著提高生物应用分割的可靠性。 Abstract: In many biomedical segmentation tasks, the preservation of elongated structure continuity and length is more important than voxel-wise accuracy. We propose two novel loss functions, Negative Centerline Loss and Simplified Topology Loss, that, applied to Convolutional Neural Networks (CNNs), help preserve connectivity of output instances. Moreover, we discuss characteristics of experiment design, such as downscaling and spacing correction, that help obtain continuous segmentation masks. We evaluate our approach on a 3D light-sheet fluorescence microscopy dataset of axon initial segments (AIS), a task prone to discontinuity due to signal dropout. Compared to standard CNNs and existing topology-aware losses, our methods reduce the number of segmentation discontinuities per instance, particularly in regions with missing input signal, resulting in improved instance length calculation in downstream applications. Our findings demonstrate that structural priors embedded in the loss design can significantly enhance the reliability of segmentation for biological applications.

[61] Count2Density: Crowd Density Estimation without Location-level Annotations

Mattia Litrico,Feng Chen,Michael Pound,Sotirios A Tsaftaris,Sebastiano Battiato,Mario Valerio Giuffrida

Main category: cs.CV

TL;DR: Count2Density is a novel pipeline that predicts meaningful density maps using only count-level annotations during training, significantly outperforming existing methods in semi-supervised settings.

Details Motivation: The main challenge in crowd density estimation is the reliance on fine-grained location-level annotations to train deep networks. Collecting such detailed annotations is tedious, time-consuming, and poses a barrier to scalability for real-world applications. Method: Count2Density generates pseudo-density maps leveraging past predictions stored in a Historical Map Bank, initialized using an unsupervised saliency estimator, and iteratively updated with an EMA of predicted density maps. It samples locations from estimated crowd areas using a hypergeometric distribution and adds a self-supervised contrastive spatial regularizer to enhance the model's spatial awareness. Result: Experimental results demonstrate that Count2Density significantly outperforms cross-domain adaptation methods and achieves better results than recent state-of-the-art approaches in semi-supervised settings across several datasets. Additional analyses validate the effectiveness of each individual component of the pipeline. Conclusion: Count2Density effectively retrieves spatial information from count-level annotations and enables accurate subregion counting, outperforming cross-domain adaptation methods and recent state-of-the-art approaches in semi-supervised settings. Abstract: Crowd density estimation is a well-known computer vision task aimed at estimating the density distribution of people in an image. The main challenge in this domain is the reliance on fine-grained location-level annotations, (i.e. points placed on top of each individual) to train deep networks. Collecting such detailed annotations is both tedious, time-consuming, and poses a significant barrier to scalability for real-world applications. To alleviate this burden, we present Count2Density: a novel pipeline designed to predict meaningful density maps containing quantitative spatial information using only count-level annotations (i.e., total number of people) during training. To achieve this, Count2Density generates pseudo-density maps leveraging past predictions stored in a Historical Map Bank, thereby reducing confirmation bias. This bank is initialised using an unsupervised saliency estimator to provide an initial spatial prior and is iteratively updated with an EMA of predicted density maps. These pseudo-density maps are obtained by sampling locations from estimated crowd areas using a hypergeometric distribution, with the number of samplings determined by the count-level annotations. To further enhance the spatial awareness of the model, we add a self-supervised contrastive spatial regulariser to encourage similar feature representations within crowded regions while maximising dissimilarity with background regions. Experimental results demonstrate that our approach significantly outperforms cross-domain adaptation methods and achieves better results than recent state-of-the-art approaches in semi-supervised settings across several datasets. Additional analyses validate the effectiveness of each individual component of our pipeline, confirming the ability of Count2Density to effectively retrieve spatial information from count-level annotations and enabling accurate subregion counting.

[62] AutoDetect: Designing an Autoencoder-based Detection Method for Poisoning Attacks on Object Detection Applications in the Military Domain

Alma M. Liezenga,Stefan Wijnja,Puck de Haan,Niels W. T. Brink,Jip J. van Stijn,Yori Kamphuis,Klamer Schutte

Main category: cs.CV

TL;DR: 本研究探讨了投毒攻击对军事目标检测系统的影响,并提出了一种新的检测方法AutoDetect。

Details Motivation: 由于在军事领域人工智能系统的安全性与鲁棒性受到投毒攻击的威胁,且开源数据集和预训练模型的广泛使用加剧了这一风险,因此有必要研究投毒攻击对军事目标检测系统的影响以及最佳检测方法。 Method: 研究者创建了军事车辆数据集MilCivVeh,并实现了BadDet攻击的修改版本,即基于补丁的投毒攻击。此外,提出了AutoDetect方法,一种基于自编码器的简单快速的检测方法。 Result: 发现虽然可以实现较高的攻击成功率,但需要大量数据被污染。同时,测试了专门的投毒检测方法和工业检测领域的异常检测方法,但效果不佳,因此提出了AutoDetect方法,该方法在区分干净和污染样本方面表现出色。 Conclusion: 研究得出军事领域的目标检测系统容易受到投毒攻击,但实际应用中需要大量数据中毒才能取得成功。此外,现有的检测方法不足,AutoDetect方法在检测投毒攻击方面表现良好。 Abstract: Poisoning attacks pose an increasing threat to the security and robustness of Artificial Intelligence systems in the military domain. The widespread use of open-source datasets and pretrained models exacerbates this risk. Despite the severity of this threat, there is limited research on the application and detection of poisoning attacks on object detection systems. This is especially problematic in the military domain, where attacks can have grave consequences. In this work, we both investigate the effect of poisoning attacks on military object detectors in practice, and the best approach to detect these attacks. To support this research, we create a small, custom dataset featuring military vehicles: MilCivVeh. We explore the vulnerability of military object detectors for poisoning attacks by implementing a modified version of the BadDet attack: a patch-based poisoning attack. We then assess its impact, finding that while a positive attack success rate is achievable, it requires a substantial portion of the data to be poisoned -- raising questions about its practical applicability. To address the detection challenge, we test both specialized poisoning detection methods and anomaly detection methods from the visual industrial inspection domain. Since our research shows that both classes of methods are lacking, we introduce our own patch detection method: AutoDetect, a simple, fast, and lightweight autoencoder-based method. Our method shows promising results in separating clean from poisoned samples using the reconstruction error of image slices, outperforming existing methods, while being less time- and memory-intensive. We urge that the availability of large, representative datasets in the military domain is a prerequisite to further evaluate risks of poisoning attacks and opportunities patch detection.

[63] PPORLD-EDNetLDCT: A Proximal Policy Optimization-Based Reinforcement Learning Framework for Adaptive Low-Dose CT Denoising

Debopom Sutradhar,Ripon Kumar Debnath,Mohaimenul Azam Khan Raiaan,Yan Zhang,Reem E. Mohamed,Sami Azam

Main category: cs.CV

TL;DR: The paper introduces PPORLD-EDNetLDCT, an RL-based encoder-decoder approach for low-dose CT image denoising, outperforming traditional and other deep learning methods in preserving image quality while reducing noise.

Details Motivation: Low-dose computed tomography (LDCT) is essential for minimizing radiation exposure but often results in increased noise and reduced image quality. Traditional denoising methods such as iterative optimization or supervised learning struggle to maintain image quality while reducing noise. Method: PPORLD-EDNetLDCT, a reinforcement learning-based (RL) approach with Encoder-Decoder for LDCT, was developed. It uses a dynamic RL-based approach with the posterior policy optimization (PPO) algorithm to optimize denoising policies in real time based on image quality feedback, trained via a custom gym environment. Result: Experimental results showed that PPORLD-EDNetLDCT outperformed traditional denoising techniques and other deep learning-based methods. It achieved a peak signal-to-noise ratio (PSNR) of 41.87, a structural similarity index measure (SSIM) of 0.9814, and a root mean squared error (RMSE) of 0.00236 on the low dose CT image and projection dataset. On the NIH-AAPM-Mayo Clinic Low Dose CT Challenge dataset, it achieved a PSNR of 41.52, SSIM of 0.9723, and RMSE of 0.0051. Additionally, on the COVID-19 LDCT dataset, the method improved classification accuracy to 94%, 4% higher than without RL-based denoising. Conclusion: PPORLD-EDNetLDCT offers a promising solution for safer and more accurate low-dose CT imaging by effectively reducing noise while preserving image quality, outperforming traditional and other deep learning-based denoising methods. Abstract: Low-dose computed tomography (LDCT) is critical for minimizing radiation exposure, but it often leads to increased noise and reduced image quality. Traditional denoising methods, such as iterative optimization or supervised learning, often fail to preserve image quality. To address these challenges, we introduce PPORLD-EDNetLDCT, a reinforcement learning-based (RL) approach with Encoder-Decoder for LDCT. Our method utilizes a dynamic RL-based approach in which an advanced posterior policy optimization (PPO) algorithm is used to optimize denoising policies in real time, based on image quality feedback, trained via a custom gym environment. The experimental results on the low dose CT image and projection dataset demonstrate that the proposed PPORLD-EDNetLDCT model outperforms traditional denoising techniques and other DL-based methods, achieving a peak signal-to-noise ratio of 41.87, a structural similarity index measure of 0.9814 and a root mean squared error of 0.00236. Moreover, in NIH-AAPM-Mayo Clinic Low Dose CT Challenge dataset our method achived a PSNR of 41.52, SSIM of 0.9723 and RMSE of 0.0051. Furthermore, we validated the quality of denoising using a classification task in the COVID-19 LDCT dataset, where the images processed by our method improved the classification accuracy to 94\%, achieving 4\% higher accuracy compared to denoising without RL-based denoising. This method offers a promising solution for safer and more accurate LDCT imaging.

[64] AIVA: An AI-based Virtual Companion for Emotion-aware Interaction

Chenxi Li

Main category: cs.CV

TL;DR: This paper presents \ours, an AI virtual companion that integrates multimodal sentiment perception into Large Language Models to enhance Human-Computer Interaction with emotional cues.

Details Motivation: The motivation stems from the limitations of Large Language Models (LLMs) in interpreting emotional cues from non-verbal signals, which hinders more immersive and empathetic Human-Computer Interactions (HCI). Method: The paper proposes a framework called \ours, which includes a Multimodal Sentiment Perception Network (MSPN) using a cross-modal fusion transformer and supervised contrastive learning, along with an emotion-aware prompt engineering strategy. Additionally, a Text-to-Speech (TTS) system and an animated avatar module are integrated for expressive interactions. Result: The result is the development of an AI-based virtual companion named \ours, capable of capturing multimodal sentiment cues, thereby enabling emotionally aligned and animated HCI. Conclusion: The paper concludes that integrating multimodal sentiment perception into LLMs can lead to emotion-aware agents, enhancing Human-Computer Interaction in various fields such as companion robotics, social care, and mental health. Abstract: Recent advances in Large Language Models (LLMs) have significantly improved natural language understanding and generation, enhancing Human-Computer Interaction (HCI). However, LLMs are limited to unimodal text processing and lack the ability to interpret emotional cues from non-verbal signals, hindering more immersive and empathetic interactions. This work explores integrating multimodal sentiment perception into LLMs to create emotion-aware agents. We propose \ours, an AI-based virtual companion that captures multimodal sentiment cues, enabling emotionally aligned and animated HCI. \ours introduces a Multimodal Sentiment Perception Network (MSPN) using a cross-modal fusion transformer and supervised contrastive learning to provide emotional cues. Additionally, we develop an emotion-aware prompt engineering strategy for generating empathetic responses and integrate a Text-to-Speech (TTS) system and animated avatar module for expressive interactions. \ours provides a framework for emotion-aware agents with applications in companion robotics, social care, mental health, and human-centered AI.

[65] RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion

Junhao Jia,Yifei Sun,Yunyou Liu,Cheng Yang,Changmiao Wang,Feiwei Qin,Yong Peng,Wenwen Min

Main category: cs.CV

TL;DR: This paper proposes RTGMFF, a new framework for diagnosing brain disorders using fMRI data by combining text generation, feature fusion, and semantic alignment techniques, showing improved diagnostic accuracy.

Details Motivation: The motivation is to enhance clinical diagnosis reliability by addressing issues such as low signal-to-noise ratios, inter-subject variability, and lack of textual annotations in existing fMRI datasets. Method: The study introduces RTGMFF, which combines ROI-level text generation, multimodal feature fusion, and semantic alignment techniques to analyze fMRI data. Result: Experiments on ADHD-200 and ABIDE benchmarks showed that RTGMFF outperforms current methods in diagnostic accuracy, sensitivity, specificity, and area under the ROC curve. Conclusion: RTGMFF provides a promising approach for improving the accuracy of brain-disorder diagnosis using fMRI data. Abstract: Functional magnetic resonance imaging (fMRI) is a powerful tool for probing brain function, yet reliable clinical diagnosis is hampered by low signal-to-noise ratios, inter-subject variability, and the limited frequency awareness of prevailing CNN- and Transformer-based models. Moreover, most fMRI datasets lack textual annotations that could contextualize regional activation and connectivity patterns. We introduce RTGMFF, a framework that unifies automatic ROI-level text generation with multimodal feature fusion for brain-disorder diagnosis. RTGMFF consists of three components: (i) ROI-driven fMRI text generation deterministically condenses each subject's activation, connectivity, age, and sex into reproducible text tokens; (ii) Hybrid frequency-spatial encoder fuses a hierarchical wavelet-mamba branch with a cross-scale Transformer encoder to capture frequency-domain structure alongside long-range spatial dependencies; and (iii) Adaptive semantic alignment module embeds the ROI token sequence and visual features in a shared space, using a regularized cosine-similarity loss to narrow the modality gap. Extensive experiments on the ADHD-200 and ABIDE benchmarks show that RTGMFF surpasses current methods in diagnostic accuracy, achieving notable gains in sensitivity, specificity, and area under the ROC curve. Code is available at https://github.com/BeistMedAI/RTGMFF.

[66] LGBP-OrgaNet: Learnable Gaussian Band Pass Fusion of CNN and Transformer Features for Robust Organoid Segmentation and Tracking

Jing Zhang,Siying Tao,Jiao Li,Tianhe Wang,Junchen Wu,Ruqian Hao,Xiaohui Du,Ruirong Tan,Rui Li

Main category: cs.CV

TL;DR: This paper presents LGBP-OrgaNet, a deep learning method for non-destructive segmentation and tracking of organoids by combining CNN and Transformer modules with innovative fusion techniques.

Details Motivation: Traditional fluorescence labeling methods can damage organoid structure, so there is a need for an automated, non-destructive segmentation and tracking approach. Method: LGBP-OrgaNet combines CNN and Transformer modules with a novel feature fusion module and a Bidirectional Cross Fusion Block for effective segmentation. Result: SROrga achieves good segmentation accuracy and robustness on organoid segmentation datasets. Conclusion: The proposed LGBP-OrgaNet offers a promising solution for organoid segmentation and tracking. Abstract: Organoids replicate organ structure and function, playing a crucial role in fields such as tumor treatment and drug screening. Their shape and size can indicate their developmental status, but traditional fluorescence labeling methods risk compromising their structure. Therefore, this paper proposes an automated, non-destructive approach to organoid segmentation and tracking. We introduced the LGBP-OrgaNet, a deep learning-based system proficient in accurately segmenting, tracking, and quantifying organoids. The model leverages complementary information extracted from CNN and Transformer modules and introduces the innovative feature fusion module, Learnable Gaussian Band Pass Fusion, to merge data from two branches. Additionally, in the decoder, the model proposes a Bidirectional Cross Fusion Block to fuse multi-scale features, and finally completes the decoding through progressive concatenation and upsampling. SROrga demonstrates satisfactory segmentation accuracy and robustness on organoids segmentation datasets, providing a potent tool for organoid research.

[67] PI3DETR: Parametric Instance Detection of 3D Point Cloud Edges with a Geometry-Aware 3DETR

Fabio F. Oberweger,Michael Schwingshackl,Vanessa Staderini

Main category: cs.CV

TL;DR: PI3DETR是一个直接从原始点云预测3D参数曲线实例的端到端框架,避免了中间表示和多阶段处理,引入了几何感知匹配策略和专用损失函数,实现了统一检测不同参数化的曲线类型。

Details Motivation: 避免传统方法中常见的中间表示和多阶段处理,提高对噪声和不同采样密度的鲁棒性,解决实际LiDAR和3D传感场景中的关键挑战。 Method: 基于3DETR模型,引入几何感知匹配策略和专用损失函数,支持统一检测多种参数化曲线类型,如三次贝塞尔曲线、线段、圆和弧,可选的后处理步骤进一步优化预测结果。 Result: PI3DETR在ABC数据集上达到了最先进的性能,并能有效泛化到真实传感器数据,提供简单而强大的3D边缘和曲线估计解决方案。 Conclusion: PI3DETR是一种高效的3D参数曲线预测框架,具有良好的鲁棒性和泛化能力,适用于实际的LiDAR和3D传感应用。 Abstract: We present PI3DETR, an end-to-end framework that directly predicts 3D parametric curve instances from raw point clouds, avoiding the intermediate representations and multi-stage processing common in prior work. Extending 3DETR, our model introduces a geometry-aware matching strategy and specialized loss functions that enable unified detection of differently parameterized curve types, including cubic B\'ezier curves, line segments, circles, and arcs, in a single forward pass. Optional post-processing steps further refine predictions without adding complexity. This streamlined design improves robustness to noise and varying sampling densities, addressing critical challenges in real world LiDAR and 3D sensing scenarios. PI3DETR sets a new state-of-the-art on the ABC dataset and generalizes effectively to real sensor data, offering a simple yet powerful solution for 3D edge and curve estimation.

[68] SynBT: High-quality Tumor Synthesis for Breast Tumor Segmentation by 3D Diffusion Model

Hongxu Yang,Edina Timko,Levente Lippenszky,Vanda Czipczer,Lehel Ferenczi

Main category: cs.CV

TL;DR: 本文提出了一种用于生成高质量乳腺肿瘤图像的3D医学扩散模型SynBT,该模型能够在保留大视野体积分辨率的同时压缩高分辨率MRI,从而改善MRI图像中的肿瘤分割性能。

Details Motivation: 现有的肿瘤合成方法在肿瘤占据较大空间体积时表现不佳,尤其是在大视野MRI的乳腺肿瘤分割中。 Method: 提出了一种包含从块到体积自编码器的3D医学扩散模型SynBT,通过压缩高分辨率MRI为紧凑的潜在空间,再利用掩码条件扩散模型合成肿瘤。 Result: 在肿瘤分割任务中,该方法使常见分割模型的Dice Score性能提高了2-3%。 Conclusion: SynBT模型能够生成高质量的乳腺肿瘤图像,为MRI图像中的肿瘤分割提供了显著的性能提升。 Abstract: Synthetic tumors in medical images offer controllable characteristics that facilitate the training of machine learning models, leading to an improved segmentation performance. However, the existing methods of tumor synthesis yield suboptimal performances when tumor occupies a large spatial volume, such as breast tumor segmentation in MRI with a large field-of-view (FOV), while commonly used tumor generation methods are based on small patches. In this paper, we propose a 3D medical diffusion model, called SynBT, to generate high-quality breast tumor (BT) in contrast-enhanced MRI images. The proposed model consists of a patch-to-volume autoencoder, which is able to compress the high-resolution MRIs into compact latent space, while preserving the resolution of volumes with large FOV. Using the obtained latent space feature vector, a mask-conditioned diffusion model is used to synthesize breast tumors within selected regions of breast tissue, resulting in realistic tumor appearances. We evaluated the proposed method for a tumor segmentation task, which demonstrated the proposed high-quality tumor synthesis method can facilitate the common segmentation models with performance improvement of 2-3% Dice Score on a large public dataset, and therefore provides benefits for tumor segmentation in MRI images.

[69] PointAD+: Learning Hierarchical Representations for Zero-shot 3D Anomaly Detection

Qihang Zhou,Shibo He,Jiangtao Yan,Wenchao Meng,Jiming Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为PointAD+的框架,用于在未见过的具有高度多样类别语义的对象上进行零样本(ZS)3D异常检测。PointAD+通过结合渲染像素异常和空间异常,实现了对3D异常的全面理解,并在测试过程中引入RGB信息以提升检测性能。

Details Motivation: 本文旨在将CLIP模型在2D数据上的强大泛化能力扩展到3D异常检测中,以识别具有高度多样类别语义的未见过对象中的3D异常。 Method: 作者提出了PointAD框架,利用点-像素对应关系,通过像素表示来识别3D异常;进一步提出PointAD+,引入显式3D表示以捕捉空间异常关系。此外,PointAD+通过G-聚合引入几何信息,并采用分层表示学习和跨层次对比对齐方法,结合渲染和几何层次的异常语义。 Result: 实验结果表明,PointAD+在未见过对象的ZS 3D异常检测中表现出色,能够全面理解异常性,并在引入RGB信息后进一步提升检测性能。 Conclusion: PointAD+是一种有效的零样本3D异常检测框架,能够处理具有高度多样类别语义的未见过对象,并通过结合渲染像素异常和空间异常提升检测效果。 Abstract: In this paper, we aim to transfer CLIP's robust 2D generalization capabilities to identify 3D anomalies across unseen objects of highly diverse class semantics. To this end, we propose a unified framework to comprehensively detect and segment 3D anomalies by leveraging both point- and pixel-level information. We first design PointAD, which leverages point-pixel correspondence to represent 3D anomalies through their associated rendering pixel representations. This approach is referred to as implicit 3D representation, as it focuses solely on rendering pixel anomalies but neglects the inherent spatial relationships within point clouds. Then, we propose PointAD+ to further broaden the interpretation of 3D anomalies by introducing explicit 3D representation, emphasizing spatial abnormality to uncover abnormal spatial relationships. Hence, we propose G-aggregation to involve geometry information to enable the aggregated point representations spatially aware. To simultaneously capture rendering and spatial abnormality, PointAD+ proposes hierarchical representation learning, incorporating implicit and explicit anomaly semantics into hierarchical text prompts: rendering prompts for the rendering layer and geometry prompts for the geometry layer. A cross-hierarchy contrastive alignment is further introduced to promote the interaction between the rendering and geometry layers, facilitating mutual anomaly learning. Finally, PointAD+ integrates anomaly semantics from both layers to capture the generalized anomaly semantics. During the test, PointAD+ can integrate RGB information in a plug-and-play manner and further improve its detection performance. Extensive experiments demonstrate the superiority of PointAD+ in ZS 3D anomaly detection across unseen objects with highly diverse class semantics, achieving a holistic understanding of abnormality.

[70] Empowering Lightweight MLLMs with Reasoning via Long CoT SFT

Linyu Ou

Main category: cs.CV

TL;DR: 本研究发现,通过长链式思维数据进行监督微调是提升轻量级多模态语言模型推理能力的关键步骤。

Details Motivation: 尽管具有可验证奖励的强化学习增强了大规模语言模型(LLM)的推理能力,但其在轻量级多模态语言模型(MLLM)中的有效性尚未得到充分探索。 Method: 对具有少于70亿参数的轻量级多模态语言模型(MLLM)进行监督微调(SFT)和强化学习(RL)实验,评估长链式思维(long CoT)数据对模型推理能力的影响。 Result: 使用长CoT数据进行监督微调显著提升了MLLM的推理能力,并且在后续强化学习阶段进一步提高了性能。 Conclusion: SFT阶段结合长CoT数据是开发轻量级MLLM推理能力的关键前提。 Abstract: While Reinforcement Learning with Verifiable Rewards has enhanced the reasoning of large-scale language models (LLMs), its efficacy for lightweight multimodal language models (MLLMs) with fewer than seven billion parameters remains underexplored. This paper investigates the role of long Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of such MLLMs. Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT data significantly improves MLLM reasoning. Furthermore, we observe that after this initial SFT phase, MLLMs can achieve additional performance gains through a subsequent RL stage. We conclude that a SFT stage with long CoT data is a critical prerequisite for developing the reasoning capabilities of lightweight MLLMs.

[71] Heatmap Guided Query Transformers for Robust Astrocyte Detection across Immunostains and Resolutions

Xizhe Zhang,Jiayang Zhu

Main category: cs.CV

TL;DR: This paper introduces a hybrid CNN Transformer detector for improved automated detection of astrocytes in histological images, showing better performance than existing methods.

Details Motivation: Astrocytes are important glial cells affected in many neurological disorders. Their complex morphology and variability across stains make automated detection challenging, necessitating a more effective detection method. Method: A hybrid CNN Transformer detector was developed, integrating local feature extraction and global contextual reasoning. It uses a heatmap-guided query mechanism and a lightweight Transformer module to improve detection accuracy. Result: The model outperformed Faster R-CNN, YOLOv11, and DETR on ALDH1L1 and GFAP stained datasets, showing higher sensitivity and fewer false positives based on FROC analysis. Conclusion: The study concludes that the proposed hybrid CNN Transformer detector demonstrates high performance in detecting astrocytes in histological images, offering potential for robust detection and advanced computational pathology tools. Abstract: Astrocytes are critical glial cells whose altered morphology and density are hallmarks of many neurological disorders. However, their intricate branching and stain dependent variability make automated detection of histological images a highly challenging task. To address these challenges, we propose a hybrid CNN Transformer detector that combines local feature extraction with global contextual reasoning. A heatmap guided query mechanism generates spatially grounded anchors for small and faint astrocytes, while a lightweight Transformer module improves discrimination in dense clusters. Evaluated on ALDH1L1 and GFAP stained astrocyte datasets, the model consistently outperformed Faster R-CNN, YOLOv11 and DETR, achieving higher sensitivity with fewer false positives, as confirmed by FROC analysis. These results highlight the potential of hybrid CNN Transformer architectures for robust astrocyte detection and provide a foundation for advanced computational pathology tools.

[72] InfraDiffusion: zero-shot depth map restoration with diffusion models and prompted segmentation from sparse infrastructure point clouds

Yixiong Jing,Cheng Zhang,Haibing Wu,Guangming Wang,Olaf Wysocki,Brian Sheil

Main category: cs.CV

TL;DR: 本文提出 InfraDiffusion,一种无需任务特定训练即可提升砖石点云深度图视觉清晰度和几何一致性的零样本框架。

Details Motivation: 点云在低光环境下比高分辨率图像更适合基础设施监测,但其非结构化、稀疏性和噪声限制了砖级分割的精细化程度。 Method: 通过虚拟相机将砖石点云投影到深度图中,并采用 Denoising Diffusion Null-space Model (DDNM) 进行恢复。 Result: 实验表明,InfraDiffusion 在砖石桥梁和隧道点云数据集上显著改善了砖级分割效果。 Conclusion: InfraDiffusion 是一种零样本框架,能够提升砖级分割的表现,为砖石资产的自动化检测提供了潜在的可能性。 Abstract: Point clouds are widely used for infrastructure monitoring by providing geometric information, where segmentation is required for downstream tasks such as defect detection. Existing research has automated semantic segmentation of structural components, while brick-level segmentation (identifying defects such as spalling and mortar loss) has been primarily conducted from RGB images. However, acquiring high-resolution images is impractical in low-light environments like masonry tunnels. Point clouds, though robust to dim lighting, are typically unstructured, sparse, and noisy, limiting fine-grained segmentation. We present InfraDiffusion, a zero-shot framework that projects masonry point clouds into depth maps using virtual cameras and restores them by adapting the Denoising Diffusion Null-space Model (DDNM). Without task-specific training, InfraDiffusion enhances visual clarity and geometric consistency of depth maps. Experiments on masonry bridge and tunnel point cloud datasets show significant improvements in brick-level segmentation using the Segment Anything Model (SAM), underscoring its potential for automated inspection of masonry assets. Our code and data is available at https://github.com/Jingyixiong/InfraDiffusion-official-implement.

[73] Transformer-Guided Content-Adaptive Graph Learning for Hyperspectral Unmixing

Hui Chen,Liangyu Liu,Xianchao Xiu,Wanquan Liu

Main category: cs.CV

TL;DR: T-CAGU improves hyperspectral unmixing by combining transformers and graph networks to better capture global and local features, outperforming existing methods.

Details Motivation: Existing deep learning methods for hyperspectral unmixing struggle to simultaneously capture global dependencies and local consistency, affecting the preservation of long-range interactions and boundary details. Method: T-CAGU combines a transformer to capture global dependencies and a content-adaptive graph neural network for enhancing local relationships, integrating multiple propagation orders and a graph residual mechanism. Result: Experimental results demonstrate that T-CAGU outperforms state-of-the-art methods in hyperspectral unmixing. Conclusion: T-CAGU effectively overcomes the limitations of existing HU methods by simultaneously capturing global dependencies and local consistency, showing superior performance. Abstract: Hyperspectral unmixing (HU) targets to decompose each mixed pixel in remote sensing images into a set of endmembers and their corresponding abundances. Despite significant progress in this field using deep learning, most methods fail to simultaneously characterize global dependencies and local consistency, making it difficult to preserve both long-range interactions and boundary details. This letter proposes a novel transformer-guided content-adaptive graph unmixing framework (T-CAGU), which overcomes these challenges by employing a transformer to capture global dependencies and introducing a content-adaptive graph neural network to enhance local relationships. Unlike previous work, T-CAGU integrates multiple propagation orders to dynamically learn the graph structure, ensuring robustness against noise. Furthermore, T-CAGU leverages a graph residual mechanism to preserve global information and stabilize training. Experimental results demonstrate its superiority over the state-of-the-art methods. Our code is available at https://github.com/xianchaoxiu/T-CAGU.

[74] TinyDrop: Tiny Model Guided Token Dropping for Vision Transformers

Guoxin Wang,Qingyuan Wang,Binhua Huang,Shaowu Chen,Deepu John

Main category: cs.CV

TL;DR: TinyDrop是一种高效的标记丢弃框架,能够在不显著影响准确率的情况下大幅降低ViT的推理计算成本。

Details Motivation: Vision Transformers (ViTs)在图像分类中表现出色,但由于需要处理所有图像标记,计算成本较高。为了在不影响准确性的前提下降低推理成本,研究者提出了TinyDrop。 Method: 通过一个轻量级的视觉模型作为指导模型,在推理过程中估计图像标记的重要性,并选择性地丢弃低重要性的标记,以减少ViT进行注意力计算时的计算量。 Result: 实验表明,TinyDrop可以在标准图像分类基准上将ViT的FLOPs减少高达80%,同时仅造成最小的准确率下降,显示出其良好的泛化能力和实用性。 Conclusion: TinyDrop是一个无需训练的标记丢弃框架,可以显著降低ViT的推理计算成本,同时保持较高的分类准确性。 Abstract: Vision Transformers (ViTs) achieve strong performance in image classification but incur high computational costs from processing all image tokens. To reduce inference costs in large ViTs without compromising accuracy, we propose TinyDrop, a training-free token dropping framework guided by a lightweight vision model. The guidance model estimates the importance of tokens while performing inference, thereby selectively discarding low-importance tokens if large vit models need to perform attention calculations. The framework operates plug-and-play, requires no architectural modifications, and is compatible with diverse ViT architectures. Evaluations on standard image classification benchmarks demonstrate that our framework reduces FLOPs by up to 80% for ViTs with minimal accuracy degradation, highlighting its generalization capability and practical utility for efficient ViT-based classification.

[75] Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation

Reina Ishikawa,Ryo Fujii,Hideo Saito,Ryo Hachiuma

Main category: cs.CV

TL;DR: 本文提出了一种新的评估概念定制的方法D-GPTScore,并发布了一个新的基准数据集CC-AlignBench,用于包含单概念和多概念任务的分阶段评估。

Details Motivation: 评估概念定制具有挑战性,因为需要对生成提示和概念图像的保真度进行综合评估;此外,现有指标往往提供过于狭窄或过于泛化的评估结果,导致与人类偏好不一致。 Method: 提出了分解GPT评分(D-GPTScore),一种将评估标准分解为更细粒度方面,并利用多模态大语言模型(MLLM)进行分面评估的新方法。 Result: 该方法在这一基准测试中显著优于现有方法,并且与人类偏好具有更高的相关性。 Conclusion: 这项工作提出了一种新的评估概念定制的方法,并强调了未来研究中的关键挑战。 Abstract: Evaluating concept customization is challenging, as it requires a comprehensive assessment of fidelity to generative prompts and concept images. Moreover, evaluating multiple concepts is considerably more difficult than evaluating a single concept, as it demands detailed assessment not only for each individual concept but also for the interactions among concepts. While humans can intuitively assess generated images, existing metrics often provide either overly narrow or overly generalized evaluations, resulting in misalignment with human preference. To address this, we propose Decomposed GPT Score (D-GPTScore), a novel human-aligned evaluation method that decomposes evaluation criteria into finer aspects and incorporates aspect-wise assessments using Multimodal Large Language Model (MLLM). Additionally, we release Human Preference-Aligned Concept Customization Benchmark (CC-AlignBench), a benchmark dataset containing both single- and multi-concept tasks, enabling stage-wise evaluation across a wide difficulty range -- from individual actions to multi-person interactions. Our method significantly outperforms existing approaches on this benchmark, exhibiting higher correlation with human preferences. This work establishes a new standard for evaluating concept customization and highlights key challenges for future research. The benchmark and associated materials are available at https://github.com/ReinaIshikawa/D-GPTScore.

[76] Scalable and Loosely-Coupled Multimodal Deep Learning for Breast Cancer Subtyping

Mohammed Amer,Mohamed A. Suliman,Tu Bui,Nuria Garcia,Serban Georgescu

Main category: cs.CV

TL;DR: This paper proposes a scalable multimodal framework for breast cancer subtyping that integrates diverse data sources and demonstrates significant performance improvements.

Details Motivation: Breast cancer molecular subtyping is crucial for personalized treatment, and multimodal integration of diverse data sources can significantly improve clinical outcomes. Method: A scalable and loosely-coupled multimodal framework combining image-based and graph-based representations for WSIs, along with a new multimodal fusion strategy. Result: The dual-based WSI representation combined with CNV and clinical records outperformed state-of-the-art methods in breast cancer subtyping. Conclusion: The proposed multimodal framework enhances breast cancer subtyping and can be extended to other cancers with minimal adjustments. Abstract: Healthcare applications are inherently multimodal, benefiting greatly from the integration of diverse data sources. However, the modalities available in clinical settings can vary across different locations and patients. A key area that stands to gain from multimodal integration is breast cancer molecular subtyping, an important clinical task that can facilitate personalized treatment and improve patient prognosis. In this work, we propose a scalable and loosely-coupled multimodal framework that seamlessly integrates data from various modalities, including copy number variation (CNV), clinical records, and histopathology images, to enhance breast cancer subtyping. While our primary focus is on breast cancer, our framework is designed to easily accommodate additional modalities, offering the flexibility to scale up or down with minimal overhead without requiring re-training of existing modalities, making it applicable to other types of cancers as well. We introduce a dual-based representation for whole slide images (WSIs), combining traditional image-based and graph-based WSI representations. This novel dual approach results in significant performance improvements. Moreover, we present a new multimodal fusion strategy, demonstrating its ability to enhance performance across a range of multimodal conditions. Our comprehensive results show that integrating our dual-based WSI representation with CNV and clinical health records, along with our pipeline and fusion strategy, outperforms state-of-the-art methods in breast cancer subtyping.

[77] Time-Scaling State-Space Models for Dense Video Captioning

AJ Piergiovanni,Ganesh Satish Mallya,Dahun Kim,Anelia Angelova

Main category: cs.CV

TL;DR: This paper introduces a modified State-Space Model that efficiently handles long videos for dense captioning, enabling online processing with reduced computational cost.

Details Motivation: The motivation is to address the challenges of computational complexity, memory limitations, and the inability to process videos online in existing methods for dense video captioning. Method: The method involves enhancing State-Space Models (SSMs) by incorporating Transfer State, which enables the model to sustain its state over longer contexts, making it suitable for long video processing and online caption generation. Result: The proposed approach scales well with video lengths, uses 7x fewer FLOPs, and enables on-the-fly caption generation without requiring the full video as input. Conclusion: The paper concludes that the proposed State-Space Model with Transfer State effectively overcomes the limitations of traditional methods in handling long videos for dense video captioning, allowing for more efficient and practical online processing. Abstract: Dense video captioning is a challenging video understanding task which aims to simultaneously segment the video into a sequence of meaningful consecutive events and to generate detailed captions to accurately describe each event. Existing methods often encounter difficulties when working with the long videos associated with dense video captioning, due to the computational complexity and memory limitations. Furthermore, traditional approaches require the entire video as input, in order to produce an answer, which precludes online processing of the video. We address these challenges by time-scaling State-Space Models (SSMs) to even longer sequences than before. Our approach, State-Space Models with Transfer State, combines both the long-sequence and recurrent properties of SSMs and addresses the main limitation of SSMs which are otherwise not able to sustain their state for very long contexts, effectively scaling SSMs further in time. The proposed model is particularly suitable for generating captions on-the-fly, in an online or streaming manner, without having to wait for the full video to be processed, which is more beneficial in practice. When applied to dense video captioning, our approach scales well with video lengths and uses 7x fewer FLOPs.

[78] Decoding Visual Neural Representations by Multimodal with Dynamic Balancing

Kaili sun,Xingyu Miao,Bing Zhai,Haoran Duan,Yang Long

Main category: cs.CV

TL;DR: 本文提出了一种融合EEG、图像和文本数据的新框架,通过引入文本模态增强语义对应关系,成功提升了低信噪比EEG信号中视觉内容解码的准确率。

Details Motivation: 旨在解决低信噪比EEG信号中视觉神经表征解码的挑战,并通过引入文本模态提高跨模态语义一致性。 Method: 引入文本模态以增强EEG信号与视觉内容之间的语义一致性,并提出了适配器模块、模态一致性动态平衡策略和随机扰动正则化项来优化跨模态特征的对齐、融合与泛化能力。 Result: 在ThingsEEG数据集上的评估结果显示,本文方法在Top-1和Top-5准确率指标上分别提高了2.0%和4.7%,超越了之前的最先进方法。 Conclusion: 本文提出了一种创新框架,通过引入文本模态增强EEG信号与视觉内容之间的语义对应关系,有效解码了低信噪比EEG信号中的视觉神经表征。 Abstract: In this work, we propose an innovative framework that integrates EEG, image, and text data, aiming to decode visual neural representations from low signal-to-noise ratio EEG signals. Specifically, we introduce text modality to enhance the semantic correspondence between EEG signals and visual content. With the explicit semantic labels provided by text, image and EEG features of the same category can be more closely aligned with the corresponding text representations in a shared multimodal space. To fully utilize pre-trained visual and textual representations, we propose an adapter module that alleviates the instability of high-dimensional representation while facilitating the alignment and fusion of cross-modal features. Additionally, to alleviate the imbalance in multimodal feature contributions introduced by the textual representations, we propose a Modal Consistency Dynamic Balance (MCDB) strategy that dynamically adjusts the contribution weights of each modality. We further propose a stochastic perturbation regularization (SPR) term to enhance the generalization ability of semantic perturbation-based models by introducing dynamic Gaussian noise in the modality optimization process. The evaluation results on the ThingsEEG dataset show that our method surpasses previous state-of-the-art methods in both Top-1 and Top-5 accuracy metrics, improving by 2.0\% and 4.7\% respectively.

[79] Joint Training of Image Generator and Detector for Road Defect Detection

Kuan-Chuan Peng

Main category: cs.CV

TL;DR: 本文提出JTGD,一种通过联合训练图像生成器和检测器以提升道路缺陷检测性能的方法,适用于边缘设备部署且性能优于现有技术。

Details Motivation: 道路缺陷检测对于减少车辆损坏至关重要,但实际场景中,边缘设备内存和计算资源有限,因此需要一种不依赖集成方法或测试时增强的高效检测方法。 Method: 提出了一种名为JTGD的方法,通过联合训练图像生成器和检测器来增强数据,其中生成器设计了双判别器以提高生成缺陷补丁和整体图像的真实性,并引入了基于CLIP的Fréchet Inception Distance损失以提高图像质量。 Result: JTGD在RDD2022道路缺陷检测基准测试中优于现有技术,且参数数量不到对比基线的20%。 Conclusion: JTGD是一种适合在边缘设备上部署的高效道路缺陷检测方法,其通过引入生成器与检测器的联合训练策略,在不使用集成方法或测试时增强的情况下优于现有技术。 Abstract: Road defect detection is important for road authorities to reduce the vehicle damage caused by road defects. Considering the practical scenarios where the defect detectors are typically deployed on edge devices with limited memory and computational resource, we aim at performing road defect detection without using ensemble-based methods or test-time augmentation (TTA). To this end, we propose to Jointly Train the image Generator and Detector for road defect detection (dubbed as JTGD). We design the dual discriminators for the generative model to enforce both the synthesized defect patches and overall images to look plausible. The synthesized image quality is improved by our proposed CLIP-based Fr\'echet Inception Distance loss. The generative model in JTGD is trained jointly with the detector to encourage the generative model to synthesize harder examples for the detector. Since harder synthesized images of better quality caused by the aforesaid design are used in the data augmentation, JTGD outperforms the state-of-the-art method in the RDD2022 road defect detection benchmark across various countries under the condition of no ensemble and TTA. JTGD only uses less than 20% of the number of parameters compared with the competing baseline, which makes it more suitable for deployment on edge devices in practice.

[80] Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA

Yahya Benmahane,Mohammed El Hassouni

Main category: cs.CV

TL;DR: This paper introduces a novel, parameter-efficient method for No-Reference Image Quality Assessment using pixel-space visual prompts that train very few parameters while achieving strong performance across multiple datasets.

Details Motivation: The motivation is to develop an efficient adaptation method for No-Reference Image Quality Assessment that significantly reduces the number of parameters needed for training, thereby making the process more efficient and less resource-intensive. Method: The method involves a parameter-efficient adaptation technique that trains only a small number of parameters (up to 600K) while keeping the base model frozen. Visual prompts are combined with images during inference to be processed by mPLUG-Owl2 with a specific textual query. Result: The proposed method achieved competitive performance against fully fine-tuned methods and specialized NR-IQA models, with a result of 0.93 SRCC on the KADID-10k dataset across various distortion types. Conclusion: The paper concludes that their proposed method of using pixel-space visual prompts is effective for No-Reference Image Quality Assessment, offering efficient adaptation of Multimodal Large Language Models for low-level vision tasks. Abstract: In this paper, we propose a novel parameter-efficient adaptation method for No- Reference Image Quality Assessment (NR-IQA) using visual prompts optimized in pixel-space. Unlike full fine-tuning of Multimodal Large Language Models (MLLMs), our approach trains only 600K parameters at most (< 0.01% of the base model), while keeping the underlying model fully frozen. During inference, these visual prompts are combined with images via addition and processed by mPLUG-Owl2 with the textual query "Rate the technical quality of the image." Evaluations across distortion types (synthetic, realistic, AI-generated) on KADID- 10k, KonIQ-10k, and AGIQA-3k demonstrate competitive performance against full finetuned methods and specialized NR-IQA models, achieving 0.93 SRCC on KADID-10k. To our knowledge, this is the first work to leverage pixel-space visual prompts for NR-IQA, enabling efficient MLLM adaptation for low-level vision tasks. The source code is publicly available at https: // github. com/ yahya-ben/ mplug2-vp-for-nriqa .

[81] OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

Han Li,Xinyu Peng,Yaoming Wang,Zelin Peng,Xin Chen,Rongxiang Weng,Jingang Wang,Xunliang Cai,Wenrui Dai,Hongkai Xiong

Main category: cs.CV

TL;DR: OneCAT is a pure decoder-only transformer model that integrates multimodal understanding, generation, and editing efficiently through a modality-specific MoE structure and a multi-scale visual autoregressive mechanism, setting new performance standards in unified multimodal intelligence.

Details Motivation: The motivation is to create a unified multimodal model that integrates understanding, generation, and editing efficiently without relying on external components like Vision Transformers (ViT) or vision tokenizers. Method: OneCAT utilizes a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective and introduces a multi-scale visual autoregressive mechanism within the Large Language Model (LLM). Result: The model achieves significant efficiency gains, especially for high-resolution inputs, drastically reduces decoding steps compared to diffusion-based methods, and maintains state-of-the-art performance. Conclusion: OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding. Abstract: We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.

[82] DeepSea MOT: A benchmark dataset for multi-object tracking on deep-sea video

Kevin Barnard,Elaine Liu,Kristine Walz,Brian Schlining,Nancy Jacobsen Stout,Lonny Lundsten

Main category: cs.CV

TL;DR: This study introduces the first benchmark dataset for evaluating multi-object tracking models on deep-sea video footage, along with tools and workflows for performance assessment.

Details Motivation: Benchmarking model performance is crucial for consistent comparisons and optimization in machine learning. The lack of publicly available benchmarks for multi-object tracking in deep-sea video footage motivated this study. Method: The researchers created a benchmark dataset comprising four video sequences representing deep-sea environments. They evaluated model performance using Higher Order Tracking Accuracy, a comprehensive metric that balances detection, localization, and association accuracy. Result: The study resulted in the first publicly available benchmark dataset for multi-object tracking in deep-sea video footage. It includes a documented workflow for generating additional benchmark videos and example Python notebooks for metric computation. Conclusion: The study successfully developed a novel benchmark video dataset for evaluating multi-object tracking and object detection models, providing a foundation for consistent comparisons and performance optimization in deep-sea video analysis. Abstract: Benchmarking multi-object tracking and object detection model performance is an essential step in machine learning model development, as it allows researchers to evaluate model detection and tracker performance on human-generated 'test' data, facilitating consistent comparisons between models and trackers and aiding performance optimization. In this study, a novel benchmark video dataset was developed and used to assess the performance of several Monterey Bay Aquarium Research Institute object detection models and a FathomNet single-class object detection model together with several trackers. The dataset consists of four video sequences representing midwater and benthic deep-sea habitats. Performance was evaluated using Higher Order Tracking Accuracy, a metric that balances detection, localization, and association accuracy. To the best of our knowledge, this is the first publicly available benchmark for multi-object tracking in deep-sea video footage. We provide the benchmark data, a clearly documented workflow for generating additional benchmark videos, as well as example Python notebooks for computing metrics.

[83] Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Honglu Zhou,Xiangyu Peng,Shrikant Kendre,Michael S. Ryoo,Silvio Savarese,Caiming Xiong,Juan Carlos Niebles

Main category: cs.CV

TL;DR: Strefer is a framework designed to enhance Video LLMs with spatiotemporal reasoning by generating synthetic instruction-tuning data without costly annotations.

Details Motivation: Existing Video LLMs struggle with fine-grained spatiotemporal reasoning, which is essential for real-world AI companions. Method: A synthetic instruction data generation framework called Strefer was developed to produce diverse instruction-tuning data by pseudo-annotating fine-grained video metadata. Result: Models trained with Strefer outperformed baselines in tasks requiring spatial and temporal disambiguation and showed improved space-time-aware reasoning. Conclusion: Strefer provides a framework to enhance Video LLMs with spatiotemporal reasoning capabilities without costly annotations or models. Abstract: Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, spatiotemporal reasoning, especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata, capturing rich spatial and temporal information in a structured manner, including subjects, objects, their locations as masklets, and their action descriptions and timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.

[84] A comprehensive Persian offline handwritten database for investigating the effects of heritability and family relationships on handwriting

Abbas Zohrevand,Javad Sadri,Zahra Imani

Main category: cs.CV

TL;DR: 论文介绍了一个用于研究遗传对手写影响的全面数据库,发现家庭成员之间在手写特征和风格上存在相似性。

Details Motivation: 论文的动机是探索手写是否存在遗传成分,以及家庭关系是否会影响手写风格。 Method: 论文的方法是创建一个全面的数据库,收集包括数字、字母、形状和自由段落等手写样本,并通过专门设计的表格捕捉所有书写者的家庭关系。 Result: 研究结果表明,家庭成员的手写特征和风格存在相似性,并且该数据库是目前唯一可用的此类资源。 Conclusion: 该论文的结论是,通过创建一个包含210个家庭成员手写样本的数据库,研究发现家庭成员之间在手写特征和风格上存在相似性,且该数据库将为研究遗传和家庭关系对手写的影响铺平道路。 Abstract: This paper introduces a comprehensive database for research and investigation on the effects of inheritance on handwriting. A database has been created that can be used to answer questions such as: Is there a genetic component to handwriting? Is handwriting inherited? Do family relationships affect handwriting? Varieties of samples of handwritten components such as: digits, letters, shapes and free paragraphs of 210 families including (grandparents, parents, uncles, aunts, siblings, cousins, nephews and nieces) have been collected using specially designed forms, and family relationships of all writers are captured. To the best of our knowledge, no such database is presently available. Based on comparisons and investigation of features of handwritings of family members, similarities among their features and writing styles are detected. Our database is freely available to the pattern recognition community and hope it will pave the way for investigations on the effects of inheritance and family relationships on handwritings.

[85] Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Ouxiang Li,Yuan Wang,Xinting Hu,Huijuan Huang,Rui Chen,Jiarong Ou,Xin Tao,Pengfei Wan,Fuli Feng

Main category: cs.CV

TL;DR: T2I-CoReBench is a new comprehensive benchmark that evaluates both composition and reasoning capabilities of text-to-image generation models, revealing that current models struggle with complex scenes and reasoning tasks.

Details Motivation: Existing benchmarks are limited in comprehensively evaluating the composition and reasoning capabilities of T2I models, especially as these models advance and can handle more complex prompts. There is a need for a benchmark that reflects real-world complexities and evaluates both core capabilities effectively. Method: T2I-CoReBench structures composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), creating a 12-dimensional taxonomy. Prompts are curated with high compositional density and multi-step inference, and each prompt is paired with a checklist for fine-grained evaluation. Result: Experiments across 27 T2I models show that while composition capability remains limited in complex high-density scenarios, reasoning capability is a critical bottleneck, with all models struggling to infer implicit elements from prompts. Conclusion: The proposed T2I-CoReBench addresses the limitations of existing benchmarks by offering a comprehensive and complex evaluation of both composition and reasoning capabilities in T2I models, revealing that current models struggle with complex, high-density scenarios and reasoning tasks. Abstract: Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts. Our project page: https://t2i-corebench.github.io/.