Table of Contents
cs.CL [Back]
[1] Adverse Event Extraction from Discharge Summaries: A New Dataset, Annotation Scheme, and Initial Findings
Imane Guellil,Salomé Andres,Atul Anand,Bruce Guthrie,Huayu Zhang,Abul Hasan,Honghan Wu,Beatrice Alex
Main category: cs.CL
TL;DR: A manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients is presented, featuring clinically significant AEs and contextual attributes. Transformer-based models perform well on coarse-grained tasks but struggle with fine-grained entity-level tasks, highlighting challenges in detecting underrepresented AEs.
Details
Motivation: To address the underrepresentation of elderly patients in clinical NLP resources and create a robust benchmark for evaluating AE extraction methods. Method: The dataset includes 14 clinically significant AEs along with contextual attributes like negation, diagnosis type, and in-hospital occurrence. The annotation schema supports both discontinuous and overlapping entities. Multiple models were evaluated using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. Result: Transformer-based models achieved strong performance on document-level coarse-grained extraction (F1 = 0.943), but performance dropped significantly for fine-grained entity-level tasks (e.g., F1 = 0.675), especially for rare events and complex attributes. Conclusion: Despite high-level scores, significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language. Abstract: In this work, we present a manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients, a population often underrepresented in clinical NLP resources. The dataset includes 14 clinically significant AEs-such as falls, delirium, and intracranial haemorrhage, along with contextual attributes like negation, diagnosis type, and in-hospital occurrence. Uniquely, the annotation schema supports both discontinuous and overlapping entities, addressing challenges rarely tackled in prior work. We evaluate multiple models using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. While transformer-based models (e.g., BERT-cased) achieve strong performance on document-level coarse-grained extraction (F1 = 0.943), performance drops notably for fine-grained entity-level tasks (e.g., F1 = 0.675), particularly for rare events and complex attributes. These results demonstrate that despite high-level scores, significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language. Developed within a Trusted Research Environment (TRE), the dataset is available upon request via DataLoch and serves as a robust benchmark for evaluating AE extraction methods and supporting future cross-dataset generalisation.[2] Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction
Marija Šakota,Robert West
Main category: cs.CL
TL;DR: Boosted Constrained Decoding (BoostCD) is proposed to improve the quality of constrained decoding in structured NLP tasks by combining constrained and unconstrained decoding, which leads to better performance in closed information extraction.
Details
Motivation: Many recent structured NLP approaches use autoregressive language models with constrained decoding for mapping unstructured input text to structured output. However, during test time, this can lead to low-quality outputs due to the model's lack of awareness of constraints during training. Method: The method involves two phases: Phase 1 decodes from the base model twice, once in constrained mode and once in unconstrained mode, to obtain two weak predictions. Phase 2 uses a learned autoregressive boosted model to combine these weak predictions into one final prediction. Result: BoostCD was applied to closed information extraction, resulting in a model called BoostIE. BoostIE outperforms prior approaches both in-distribution and out-of-distribution, addressing several common errors found in previous methods. Conclusion: BoostCD effectively improves the quality of constrained decoding in structured NLP tasks by leveraging the complementary mistakes made by the base model in constrained and unconstrained modes. Abstract: Many recent approaches to structured NLP tasks use an autoregressive language model $M$ to map unstructured input text $x$ to output text $y$ representing structured objects (such as tuples, lists, trees, code, etc.), where the desired output structure is enforced via constrained decoding. During training, these approaches do not require the model to be aware of the constraints, which are merely implicit in the training outputs $y$. This is advantageous as it allows for dynamic constraints without requiring retraining, but can lead to low-quality output during constrained decoding at test time. We overcome this problem with Boosted Constrained Decoding (BoostCD), which combines constrained and unconstrained decoding in two phases: Phase 1 decodes from the base model $M$ twice, in constrained and unconstrained mode, obtaining two weak predictions. In phase 2, a learned autoregressive boosted model combines the two weak predictions into one final prediction. The mistakes made by the base model with vs. without constraints tend to be complementary, which the boosted model learns to exploit for improved performance. We demonstrate the power of BoostCD by applying it to closed information extraction. Our model, BoostIE, outperforms prior approaches both in and out of distribution, addressing several common errors identified in those approaches.[3] CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision
Dyah Adila,Shuai Zhang,Boran Han,Bonan Min,Yuyang Wang
Main category: cs.CL
TL;DR: This paper presents CrEst, a weakly supervised framework to assess context document credibility for LLMs inference without manual annotations. It proposes two integration strategies and shows significant improvements in accuracy and F1 score.
Details
Motivation: The motivation of this paper is the challenge that current methods often ignore the credibility of context documents which can vary widely, potentially leading to the propagation of unreliable information in LLMs. Method: The method proposed is CrEst, a novel weakly supervised framework assessing the credibility of context documents during LLM inference. The approach relies on the insight that credible documents tend to exhibit higher semantic coherence with other credible documents. Two integration strategies are also proposed: a black-box approach for models without access to internal weights or activations, and a white-box method that directly modifies attention mechanisms. Result: Extensive experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines, achieving up to a 26.86% improvement in accuracy and a 3.49% increase in F1 score. Further analysis shows that CrEst maintains robust performance even under high-noise conditions. Conclusion: CrEst is an effective framework to enhance the performance of LLMs by assessing the credibility of context documents. Abstract: The integration of contextual information has significantly enhanced the performance of large language models (LLMs) on knowledge-intensive tasks. However, existing methods often overlook a critical challenge: the credibility of context documents can vary widely, potentially leading to the propagation of unreliable information. In this paper, we introduce CrEst, a novel weakly supervised framework for assessing the credibility of context documents during LLM inference--without requiring manual annotations. Our approach is grounded in the insight that credible documents tend to exhibit higher semantic coherence with other credible documents, enabling automated credibility estimation through inter-document agreement. To incorporate credibility into LLM inference, we propose two integration strategies: a black-box approach for models without access to internal weights or activations, and a white-box method that directly modifies attention mechanisms. Extensive experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines, achieving up to a 26.86% improvement in accuracy and a 3.49% increase in F1 score. Further analysis shows that CrEst maintains robust performance even under high-noise conditions.[4] MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance
Joseph J. Peper,Wenzhao Qiu,Ali Payani,Lu Wang
Main category: cs.CL
TL;DR: MDBench是一个新的数据集,用于评估大型语言模型在多文档推理任务中的表现。通过创新的合成生成过程,可以有效地创建具有挑战性的文档集和相关的问题-答案示例,揭示现有模型和技术在处理多文档推理时面临的重大挑战。
Details
Motivation: 自然语言处理评估随着强大大型语言模型的发展取得了显著进展,然而,针对这些模型在多文档推理方面的基准测试较少,且由于注释长输入的成本昂贵,创建这样的基准测试具有历史性的挑战。因此,需要一个新数据集来严格评估大型语言模型在多文档推理方面的能力。 Method: 通过一种新颖的合成生成过程创建MDBench数据集,该过程基于浓缩的结构化种子知识,并通过大型语言模型辅助编辑对其进行修改以引入多文档特定的推理挑战。然后将这种结构化知识转换为自然文本形式,生成文档集和相应的问答示例。 Result: 发现MDBench对所有方法(包括相对简短的文档集)都构成了重大挑战;知识引导的生成技术能够进行有针对性的多文档特定推理能力分析,并可迅速适应新挑战和未来的建模改进。 Conclusion: MDBench提供了一个有效的方法来评估和分析大型语言模型在多文档推理任务中的表现,揭示了当前模型的局限性,并展示了其适应未来改进的能力。 Abstract: Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs). New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs are expanding at a rapid pace. In particular, while multi-document (MD) reasoning is an area of extreme relevance given LLM capabilities in handling longer-context inputs, few benchmarks exist to rigorously examine model behavior in this setting. Moreover, the multi-document setting is historically challenging for benchmark creation due to the expensive cost of annotating long inputs. In this work, we introduce MDBench, a new dataset for evaluating LLMs on the task of multi-document reasoning. Notably, MDBench is created through a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets and the corresponding question-answer (QA) examples. Our novel technique operates on condensed structured seed knowledge, modifying it through LLM-assisted edits to induce MD-specific reasoning challenges. We then convert this structured knowledge into a natural text surface form, generating a document set and corresponding QA example. We analyze the behavior of popular LLMs and prompting techniques, finding that MDBENCH poses significant challenges for all methods, even with relatively short document sets. We also see our knowledge-guided generation technique (1) allows us to readily perform targeted analysis of MD-specific reasoning capabilities and (2) can be adapted quickly to account for new challenges and future modeling improvements.[5] From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction?
Shadman Sakib,Oishy Fatema Akhand,Ajwad Abrar
Main category: cs.CL
TL;DR: This paper explores the use of LLMs for diabetes prediction using structured numerical data, comparing open-source and proprietary models against traditional ML methods. Proprietary LLMs outperform open-source ones, with Gemma-2-27B showing notable results. Issues like prompting strategy variation exist, but LLMs show potential for medical predictions.
Details
Motivation: To investigate the effectiveness of Large Language Models (LLMs) in predicting diabetes using structured numerical data, an area not yet well explored, compared to traditional machine learning and deep learning approaches. Method: Empirical analysis was conducted on the Pima Indian Diabetes Database (PIDD) using zero-shot, one-shot, and three-shot prompting methods. Six LLMs were evaluated, including four open-source models and two proprietary models, alongside three traditional machine learning models. Evaluation metrics included accuracy, precision, recall, and F1-score. Result: Proprietary LLMs performed better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings. Gemma-2-27B also outperformed traditional ML models in terms of F1-score. However, performance varied across prompting strategies and domain-specific fine-tuning was needed. Conclusion: LLMs can be useful for medical prediction tasks, suggesting potential future work in prompt engineering and hybrid approaches to improve healthcare predictions. Abstract: While Machine Learning (ML) and Deep Learning (DL) models have been widely used for diabetes prediction, the use of Large Language Models (LLMs) for structured numerical data is still not well explored. In this study, we test the effectiveness of LLMs in predicting diabetes using zero-shot, one-shot, and three-shot prompting methods. We conduct an empirical analysis using the Pima Indian Diabetes Database (PIDD). We evaluate six LLMs, including four open-source models: Gemma-2-27B, Mistral-7B, Llama-3.1-8B, and Llama-3.2-2B. We also test two proprietary models: GPT-4o and Gemini Flash 2.0. In addition, we compare their performance with three traditional machine learning models: Random Forest, Logistic Regression, and Support Vector Machine (SVM). We use accuracy, precision, recall, and F1-score as evaluation metrics. Our results show that proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings. Notably, Gemma-2-27B also outperforms the traditional ML models in terms of F1-score. However, there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning. This study shows that LLMs can be useful for medical prediction tasks and encourages future work on prompt engineering and hybrid approaches to improve healthcare predictions.[6] Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings
Ignacio Sastre,Aiala Rosá
Main category: cs.CL
TL;DR: The paper explores reversible sentence embeddings enabling LLMs to reconstruct original text exactly using a special memory token, without altering model weights. This is evaluated across languages, sequence lengths, and model scales, with notable success from Llama 3.1 8B. The findings suggest applications in retrieval, compression, and controlled text generation.
Details
Motivation: To investigate the capability of large language models (LLMs) to reconstruct original text exactly from sentence embeddings without modifying the model's weights. Method: Introduce a special memory token whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. Result: Successful reconstruction of sequences up to approximately 240 tokens in English and Spanish datasets, across model scales ranging from 100M to 8B parameters. Notably, Llama 3.1 8B successfully reconstructs all tested sequences. Conclusion: This study reveals an intriguing capability of LLMs and proposes potential applications in memory-based retrieval, compression, and controlled text generation. Abstract: In this work, we observe an interesting phenomenon: it is possible to generate reversible sentence embeddings that allow an LLM to reconstruct the original text exactly, without modifying the model's weights. This is achieved by introducing a special memory token, whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. We evaluate this phenomenon across English and Spanish datasets, sequences of up to approximately 240 tokens, and model scales ranging from 100M to 8B parameters. Notably, Llama 3.1 8B successfully reconstructs all tested sequences. Our findings highlight an interesting capability of LLMs and suggest potential applications in memory-based retrieval, compression, and controlled text generation.[7] Identifying social isolation themes in NVDRS text narratives using topic modeling and text-classification methods
Drew Walker,Swati Rajwal,Sudeshna Das,Snigdha Peddireddy,Abeed Sarker
Main category: cs.CL
TL;DR: The paper uses NLP techniques to identify social isolation and loneliness in narratives related to suicide cases, developing high-quality classifiers and finding significant predictors for social isolation.
Details
Motivation: Social isolation and loneliness contribute significantly to suicide rates and are not currently recorded within the US NVDRS structured variables, so there is a need to identify these constructs in narratives. Method: Used topic modeling for lexicon development and supervised learning classifiers to identify social isolation and loneliness in law enforcement and coroner medical examiner narratives. Result: Developed high-quality classifiers (average F1: .86, accuracy: .82) and identified 1,198 suicides mentioning chronic social isolation from over 300,000 cases. Found significant predictors such as being male, gay, or divorced. Conclusion: The methods can improve surveillance and prevention of social isolation and loneliness in the United States. Abstract: Social isolation and loneliness, which have been increasing in recent years strongly contribute toward suicide rates. Although social isolation and loneliness are not currently recorded within the US National Violent Death Reporting System's (NVDRS) structured variables, natural language processing (NLP) techniques can be used to identify these constructs in law enforcement and coroner medical examiner narratives. Using topic modeling to generate lexicon development and supervised learning classifiers, we developed high-quality classifiers (average F1: .86, accuracy: .82). Evaluating over 300,000 suicides from 2002 to 2020, we identified 1,198 mentioning chronic social isolation. Decedents had higher odds of chronic social isolation classification if they were men (OR = 1.44; CI: 1.24, 1.69, p<.0001), gay (OR = 3.68; 1.97, 6.33, p<.0001), or were divorced (OR = 3.34; 2.68, 4.19, p<.0001). We found significant predictors for other social isolation topics of recent or impending divorce, child custody loss, eviction or recent move, and break-up. Our methods can improve surveillance and prevention of social isolation and loneliness in the United States.[8] Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation
Zongxia Li,Yapei Chang,Yuhang Zhou,Xiyang Wu,Zichao Liang,Yoo Yeon Sung,Jordan Lee Boyd-Graber
Main category: cs.CL
TL;DR: PrefBERT is a scoring model designed for evaluating open-ended long-form generation and guiding training with rewards. It surpasses traditional metrics in providing semantic feedback, aligns well with human preferences, and proves reliable through various evaluations.
Details
Motivation: Evaluating open-ended long-form generation is challenging due to the difficulty in distinguishing good from bad outputs and the limitations of existing methods which often miss key aspects or are biased. Method: PrefBERT is trained on two response evaluation datasets with diverse styles and Likert-rated quality. It evaluates open-ended long-form generation and guides GRPO training with distinct rewards for different quality outputs. Result: PrefBERT provides better semantic reward feedback than ROUGE-L and BERTScore, remains reliable across varied long passages, and aligns well with verifiable rewards GRPO needs. Human evaluations confirm its effectiveness in producing responses aligned with human preferences. Conclusion: PrefBERT effectively supports GRPO by offering superior semantic feedback compared to traditional metrics, proving reliable and aligning well with human preferences. Abstract: Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.[9] Learning-Time Encoding Shapes Unlearning in LLMs
Ruihan Wu,Konstantin Garov,Kamalika Chaudhuri
Main category: cs.CL
TL;DR: Large language models' ability to 'unlearn' knowledge is crucial for various reasons. This study explores how learning-time choices in knowledge encoding affect unlearning effectiveness, revealing that paraphrased descriptions improve unlearning and removing individual pieces of knowledge is difficult.
Details
Motivation: Unlearning, or removing specific knowledge from large language models, is becoming essential due to privacy regulations and the need to correct outdated or harmful content. Method: The researchers empirically investigate the impact of learning-time choices in knowledge encoding on the effectiveness of unlearning factual knowledge through experiments. Result: Experiments showed that learning with paraphrased descriptions enhances unlearning performance, but unlearning individual pieces of knowledge from a text chunk is challenging. Conclusion: Learning-time knowledge encoding may be crucial for enabling reliable post-hoc unlearning in large language models. Abstract: As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn'', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge. Our experiments reveal two key findings: (1) learning with paraphrased descriptions improves unlearning performance and (2) unlearning individual piece of knowledge from a chunk of text is challenging. Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.[10] Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification
Yaxin Fan,Peifeng Li,Qiaoming Zhu
Main category: cs.CL
TL;DR: A new Discourse-aware Clarification Module (DCM) and Contribution-aware Preference Optimization (CPO) are proposed to improve dialogue discourse parsing by resolving ambiguities caused by linguistic features. Experiments show significant improvements over SOTA baselines.
Details
Motivation: Dialogue discourse parsing faces challenges due to ambiguities introduced by linguistic features such as omission and idiom. Method: Propose DCM with two reasoning processes (clarification type reasoning and discourse goal reasoning) and CPO for optimizing clarifications and reducing errors. Result: Experiments on STAC and Molweni datasets show effective ambiguity resolution and significant outperformance of SOTA baselines. Conclusion: The approach using DCM and CPO successfully enhances dialogue discourse parsing. Abstract: Dialogue discourse parsing aims to identify and analyze discourse relations between the utterances within dialogues. However, linguistic features in dialogues, such as omission and idiom, frequently introduce ambiguities that obscure the intended discourse relations, posing significant challenges for parsers. To address this issue, we propose a Discourse-aware Clarification Module (DCM) to enhance the performance of the dialogue discourse parser. DCM employs two distinct reasoning processes: clarification type reasoning and discourse goal reasoning. The former analyzes linguistic features, while the latter distinguishes the intended relation from the ambiguous one. Furthermore, we introduce Contribution-aware Preference Optimization (CPO) to mitigate the risk of erroneous clarifications, thereby reducing cascading errors. CPO enables the parser to assess the contributions of the clarifications from DCM and provide feedback to optimize the DCM, enhancing its adaptability and alignment with the parser's requirements. Extensive experiments on the STAC and Molweni datasets demonstrate that our approach effectively resolves ambiguities and significantly outperforms the state-of-the-art (SOTA) baselines.[11] CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records
Junke Wang,Hongshun Ling,Li Zhang,Longqian Zhang,Fang Wang,Yuan Gao,Zhi Li
Main category: cs.CL
TL;DR: The paper proposes CKD-EHR framework which uses knowledge distillation techniques to enhance disease risk prediction from EHR data, significantly improving accuracy and efficiency.
Details
Motivation: Existing large language models have insufficient representation of medical knowledge and low efficiency in clinical deployment when applied to EHR-based disease prediction. Method: CKD-EHR framework fine-tunes Qwen2.5-7B on medical knowledge-enhanced data as teacher model, generates interpretable soft labels via multi-granularity attention distillation mechanism, and transfers distilled knowledge to a lightweight BERT student model. Result: On MIMIC-III dataset, CKD-EHR increases diagnostic accuracy by 9%, F1-score by 27%, and achieves 22.2 times inference speedup compared to baseline model. Conclusion: CKD-EHR greatly improves resource utilization efficiency and enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for clinical settings. Abstract: Electronic Health Records (EHR)-based disease prediction models have demonstrated significant clinical value in promoting precision medicine and enabling early intervention. However, existing large language models face two major challenges: insufficient representation of medical knowledge and low efficiency in clinical deployment. To address these challenges, this study proposes the CKD-EHR (Clinical Knowledge Distillation for EHR) framework, which achieves efficient and accurate disease risk prediction through knowledge distillation techniques. Specifically, the large language model Qwen2.5-7B is first fine-tuned on medical knowledge-enhanced data to serve as the teacher model.It then generates interpretable soft labels through a multi-granularity attention distillation mechanism. Finally, the distilled knowledge is transferred to a lightweight BERT student model. Experimental results show that on the MIMIC-III dataset, CKD-EHR significantly outperforms the baseline model:diagnostic accuracy is increased by 9%, F1-score is improved by 27%, and a 22.2 times inference speedup is achieved. This innovative solution not only greatly improves resource utilization efficiency but also significantly enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for resource optimization in clinical settings. The code and data for this research are available athttps://github.com/209506702/CKD_EHR.[12] Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs
Jing Yang Lee,Kong-Aik Lee,Woon-Seng Gan
Main category: cs.CL
TL;DR: The paper addresses the one-to-many (o2m) property in open-domain dialogue by proposing a two-stage framework involving Multi-Response Generation (MRG) and Preference-based Selection (PS). It introduces o2mDial, a new dialogue corpus, and strategies for in-context learning and instruction-tuning to improve response diversity and quality in smaller LLMs.
Details
Motivation: To enhance response diversity in open-domain dialogue systems by explicitly modeling the one-to-many property that allows multiple appropriate responses for a single dialogue context. Method: Decomposing open-domain dialogue generation into Multi-Response Generation (MRG) and Preference-based Selection (PS). MRG generates a set of diverse high-quality responses, while PS selects a single response based on human preference. The approach leverages o2mDial, a new dialogue corpus designed to capture the o2m property, with new in-context learning and instruction-tuning strategies as well as novel evaluation metrics. Result: Empirical results indicate that the two-stage framework applied to smaller LLMs improves overall response diversity while maintaining contextual coherence, enhancing response quality by up to 90%. This brings smaller models closer to the performance of larger ones. Conclusion: Modeling the one-to-many property through the proposed framework significantly boosts response diversity and quality in smaller LLMs for open-domain dialogue, potentially reducing the need for larger models. Abstract: Open-domain Dialogue (OD) exhibits a one-to-many (o2m) property, whereby multiple appropriate responses exist for a single dialogue context. Despite prior research showing that modeling this property boosts response diversity, most modern LLM-based dialogue agents do not explicitly do so. In this work, we model the o2m property of OD in LLMs by decomposing OD generation into two key tasks: Multi-Response Generation (MRG) and Preference-based Selection (PS), which entail generating a set of n semantically and lexically diverse high-quality responses for a given dialogue context, followed by selecting a single response based on human preference, respectively. To facilitate MRG and PS, we introduce o2mDial, a dialogue corpus explicitly designed to capture the o2m property by featuring multiple plausible responses for each context. Leveraging o2mDial, we propose new in-context learning and instruction-tuning strategies, as well as novel evaluation metrics for MRG, alongside a model-based approach for PS. Empirical results demonstrate that applying the proposed two-stage framework to smaller LLMs for OD generation enhances overall response diversity while maintaining contextual coherence, improving response quality by up to 90%, bringing them closer to the performance of larger models.[13] Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Gyeongje Cho,Yeonkyoun So,Chanwoo Park,Sangmin Lee,Sungmok Jung,Jaejin Lee
Main category: cs.CL
TL;DR: This paper presents Thunder-Tok, a Korean tokenizer that reduces token fertility by 10% without sacrificing model performance across various downstream tasks.
Details
Motivation: To design an efficient tokenizer for the Korean language which can reduce token fertility without compromising model performance. Method: Using a rule-based pre-tokenization method aligned with Korean linguistic structure, creating a seed vocabulary of linguistically relevant tokens and employing a branching entropy-based selection algorithm to increase average token length thus reducing fertility. Result: Thunder-Tok reduces token fertility by approximately 10%, equating to a 10% reduction in the number of tokens and a 10% improvement in inference speed compared to BPE, all while maintaining performance across various downstream tasks. Conclusion: The linguistically informed approach used in designing Thunder-Tok is effective and practical for creating efficient tokenizers for language models. Abstract: This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.[14] Emergence of Primacy and Recency Effect in Mamba: A Mechanistic Point of View
Muhammad Cendekia Airlangga,Hilal AlQuabeh,Munachiso S Nwadike,Kentaro Inui
Main category: cs.CL
TL;DR: The paper explores memory in state-space language models using primacy and recency effects, analyzing the Mamba architecture through structured recall tasks.
Details
Motivation: To uncover how information is retained and forgotten over time in state-space language models by studying memory through behavioral tools like primacy and recency effects. Method: Applying structured recall tasks to the Mamba architecture, identifying mechanisms of long-term and short-term memory as well as memory allocation modulation by semantic regularity. Result: Observation of a U-shaped accuracy profile indicating strong performance at the beginning and end of input sequences, identification of three memory mechanisms including sparse channel encoding, delta-modulated recurrence, and dynamic memory allocation based on semantic regularity. Conclusion: Memory in state-space language models is governed by specific mechanisms that support both long-term and short-term retention, with limits defined by distractor items and semantic regularity. Abstract: We study memory in state-space language models using primacy and recency effects as behavioral tools to uncover how information is retained and forgotten over time. Applying structured recall tasks to the Mamba architecture, we observe a consistent U-shaped accuracy profile, indicating strong performance at the beginning and end of input sequences. We identify three mechanisms that give rise to this pattern. First, long-term memory is supported by a sparse subset of channels within the model's selective state space block, which persistently encode early input tokens and are causally linked to primacy effects. Second, short-term memory is governed by delta-modulated recurrence: recent inputs receive more weight due to exponential decay, but this recency advantage collapses when distractor items are introduced, revealing a clear limit to memory depth. Third, we find that memory allocation is dynamically modulated by semantic regularity: repeated relations in the input sequence shift the delta gating behavior, increasing the tendency to forget intermediate items. We validate these findings via targeted ablations and input perturbations on two large-scale Mamba-based language models: one with 1.4B and another with 7B parameters.[15] A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals
Andrea Cadeddu,Alessandro Chessa,Vincenzo De Leo,Gianni Fenu,Enrico Motta,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino,Luca Secchi
Main category: cs.CL
TL;DR: This paper explores the use of proprietary and open-source large language models (LLMs) for single-label, multi-class text classification related to the UN's Sustainable Development Goals (SDGs). It evaluates Zero-Shot, Few-Shot Learning, and Fine-Tuning methods, revealing that smaller models optimized via prompt engineering can match the performance of larger models like GPT.
Details
Motivation: The motivation behind this study is the difficulty in tracking progress towards the UN's 17 Sustainable Development Goals due to the extensive scale and complexity of the data involved. Text classification models are seen as vital tools to automate the analysis of vast amounts of text from various sources, making the exploration of LLMs' effectiveness in this area important. Method: The method involves analyzing different proprietary and open-source LLMs for a single-label, multi-class text classification task focused on the SDGs. The study also assesses the effectiveness of task adaptation techniques such as Zero-Shot and Few-Shot Learning, and Fine-Tuning within this domain. Result: The results indicate that smaller models, when optimized through prompt engineering, can perform comparably to larger models like OpenAI's GPT. Conclusion: In conclusion, the study highlights the potential of LLMs and the importance of prompt engineering in optimizing smaller models for effective text classification tasks related to the SDGs. Abstract: In 2012, the United Nations introduced 17 Sustainable Development Goals (SDGs) aimed at creating a more sustainable and improved future by 2030. However, tracking progress toward these goals is difficult because of the extensive scale and complexity of the data involved. Text classification models have become vital tools in this area, automating the analysis of vast amounts of text from a variety of sources. Additionally, large language models (LLMs) have recently proven indispensable for many natural language processing tasks, including text classification, thanks to their ability to recognize complex linguistic patterns and semantics. This study analyzes various proprietary and open-source LLMs for a single-label, multi-class text classification task focused on the SDGs. Then, it also evaluates the effectiveness of task adaptation techniques (i.e., in-context learning approaches), namely Zero-Shot and Few-Shot Learning, as well as Fine-Tuning within this domain. The results reveal that smaller models, when optimized through prompt engineering, can perform on par with larger models like OpenAI's GPT (Generative Pre-trained Transformer).[16] ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs
Feng He,Zijun Chen,Xinnian Liang,Tingting Ma,Yunqi Qiu,Shuangzhi Wu,Junchi Yan
Main category: cs.CL
TL;DR: ProtoReasoning, a framework enhancing LLMs reasoning ability via scalable and verifiable prototypical representations (Prolog for logic, PDDL for planning), demonstrates significant improvements across various reasoning tasks.
Details
Motivation: The motivation is to understand and enhance the cross-domain generalization capabilities of Large Reasoning Models by leveraging shared abstract reasoning prototypes, which are fundamental reasoning patterns capturing problem essence across domains. Method: ProtoReasoning proposes an automated prototype construction pipeline transforming problems into prototype representations, a verification system using Prolog/PDDL interpreters for feedback, and scalability to synthesize problems within prototype space while ensuring correctness. Result: ProtoReasoning achieves 4.7% improvement on logical reasoning, 6.3% on planning tasks, 4.0% on general reasoning, and 1.0% on mathematics compared to baseline models. Ablation studies confirm enhanced generalization in prototype space. Conclusion: Reasoning prototypes serve as the foundation for generalizable reasoning in large language models, validating the hypothesis that they support cross-domain generalization. Abstract: Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer remain poorly understood. We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes -- fundamental reasoning patterns that capture the essence of problems across domains. These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24). Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models.[17] MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs
Yongqi Fan,Yating Wang,Guandong Wang,Jie Zhai,Jingping Liu,Qi Ye,Tong Ruan
Main category: cs.CL
TL;DR: The paper proposes MinosEval, a new method for evaluating open-ended QA in LLMs that addresses shortcomings of existing methods by distinguishing factoid and non-factoid questions.
Details
Motivation: Current evaluation methods for open-ended QA in LLMs fail to capture semantic similarities, lack interpretability, and do not distinguish between factoid and non-factoid questions. Method: MinosEval first distinguishes open-ended questions into factoid and non-factoid categories. For factoid questions, it applies an adaptive key-point scoring strategy; for non-factoid questions, it uses an instance-aware listwise ranking strategy. Result: Experiments on multiple datasets demonstrate that MinosEval better aligns with human annotations and provides more interpretable results compared to existing methods. Conclusion: MinosEval is a novel and effective approach for the automatic evaluation of open-ended QA in LLMs, offering improvements in alignment with human judgments and interpretability. Abstract: Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose \textbf{MinosEval}, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.[18] Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants
Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri
Main category: cs.CL
TL;DR: This paper evaluates the ability of current language technologies to understand Basque and Spanish languages using NLI as a pivot task, providing a new parallel dataset. Experiments reveal that LLMs show decreased performance when dealing with linguistic variations, particularly in Basque.
Details
Motivation: To assess how well current language technologies can handle Basque and Spanish language varieties, especially focusing on their capacity to deal with linguistic variations. Method: Using Natural Language Inference (NLI) as a key task, the authors created a new manually-curated parallel dataset in Basque and Spanish. They conducted empirical analysis via crosslingual and in-context learning experiments using both encoder-only and decoder-based Large Language Models (LLMs). Result: The results indicate a decline in model performance when processing linguistic variation, especially within the Basque language. Error analysis shows this decline is not due to lexical overlap but rather the inherent linguistic variation. Encoder-only models have particular difficulty with Western Basque. Conclusion: Linguistic variation poses a significant challenge for current language technologies, especially for less resourced languages like Basque. The findings align with linguistic theory regarding peripheral dialects being more distant from the standard. Abstract: In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.[19] Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs
Yang Fan,Zhang Qi,Xing Wenqian,Liu Chang,Liu Liu
Main category: cs.CL
TL;DR: This paper proposes the Graph RAG framework which combines chain-of-thought prompting, self-instruction generation, and process supervision for creating a historical character relationship dataset with minimal manual annotation. It introduces a collaborative mechanism between knowledge graphs and retrieval-augmented generation to align general models with historical knowledge. Experiments show that Xunzi-Qwen1.5-14B performs optimally in relation extraction while DeepSeek integrated with GraphRAG improves F1 significantly on the C-CLUE dataset, reducing hallucinations and improving interpretability.
Details
Motivation: To address domain knowledge gaps in large language models for historical text analysis, aiming to reduce labor costs and improve automated historical knowledge extraction. Method: The Graph RAG framework was developed, incorporating chain-of-thought prompting, self-instruction generation, and process supervision to create a historical character relationship dataset with minimal manual annotation. A collaborative mechanism between knowledge graphs and retrieval-augmented generation was also introduced. Result: Xunzi-Qwen1.5-14B achieved optimal performance in relation extraction (F1 = 0.68). DeepSeek model integrated with GraphRAG improved F1 by 11% (0.08-0.19) on the C-CLUE dataset, surpassing Xunzi-Qwen1.5-14B's F1 value of 0.12, effectively alleviating hallucinations and improving interpretability. Conclusion: The Graph RAG framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research. Abstract: This article addresses domain knowledge gaps in general large language models for historical text analysis in the context of computational humanities and AIGC technology. We propose the Graph RAG framework, combining chain-of-thought prompting, self-instruction generation, and process supervision to create a The First Four Histories character relationship dataset with minimal manual annotation. This dataset supports automated historical knowledge extraction, reducing labor costs. In the graph-augmented generation phase, we introduce a collaborative mechanism between knowledge graphs and retrieval-augmented generation, improving the alignment of general models with historical knowledge. Experiments show that the domain-specific model Xunzi-Qwen1.5-14B, with Simplified Chinese input and chain-of-thought prompting, achieves optimal performance in relation extraction (F1 = 0.68). The DeepSeek model integrated with GraphRAG improves F1 by 11% (0.08-0.19) on the open-domain C-CLUE relation extraction dataset, surpassing the F1 value of Xunzi-Qwen1.5-14B (0.12), effectively alleviating hallucinations phenomenon, and improving interpretability. This framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research.[20] TopClustRAG at SIGIR 2025 LiveRAG Challenge
Juli Bakagianni,John Pavlopoulos,Aristidis Likas
Main category: cs.CL
TL;DR: The paper introduces TopClustRAG, a RAG system using hybrid retrieval and clustering for question answering over large web corpora. It achieved 2nd in faithfulness and 7th in correctness on the leaderboard.
Details
Motivation: To improve answer diversity, relevance, and faithfulness in end-to-end question answering systems over large-scale web corpora by employing clustering-based context filtering and prompt aggregation. Method: TopClustRAG uses a hybrid retrieval strategy combining sparse and dense indices, K-Means clustering to group similar passages, cluster-specific prompts for LLMs, and a multi-stage pipeline for filtering, reranking, and synthesizing answers. Result: On the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard. Conclusion: Clustering-based context filtering and prompt aggregation are effective approaches in large-scale RAG systems for enhancing answer quality. Abstract: We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.[21] Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments
Sungen Hahm,Heejin Kim,Gyuseong Lee,Hyunji Park,Jaejin Lee
Main category: cs.CL
TL;DR: To address the challenges in de-identifying court judgments in South Korea, this paper proposes Thunder-DeID, a framework that includes constructing a Korean legal dataset, categorizing PII, and developing a DNN-based de-identification pipeline.
Details
Motivation: The current de-identification process for court judgments in South Korea is inadequate for handling large volumes of data while adhering to strict legal requirements. Legal definitions and categorizations of personal identifiers are vague, making it difficult to implement technical solutions effectively. Method: The authors propose Thunder-DeID, which includes: constructing and releasing the first Korean legal dataset with annotated judgments and entity mentions; introducing a systematic categorization of Personally Identifiable Information (PII); and developing an end-to-end deep neural network (DNN)-based de-identification pipeline. Result: Experimental results show that the proposed model achieves state-of-the-art performance in de-identifying court judgments. Conclusion: Thunder-DeID provides an effective solution for de-identifying court judgments in South Korea, aligning with relevant laws and practices. Abstract: To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.[22] Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment
Shrestha Ghosh,Moritz Schneider,Carina Reinicke,Carsten Eickhoff
Main category: cs.CL
TL;DR: Recent advances in LLMs have greatly improved general-domain NLP tasks, but their use in critical domains like clinical trial recruitment is still limited. This survey analyzes the task of trial-patient matching and contextualizes emerging LLM-based approaches in this field.
Details
Motivation: To explore the potential of LLMs in the critical domain of clinical trial recruitment by analyzing the task of trial-patient matching. Method: Surveying existing benchmarks, approaches and evaluation frameworks for LLM-based methods in clinical trial recruitment, examining challenges and suggesting future directions. Result: Provides a comprehensive overview of LLM applications in clinical trial recruitment, identifying challenges and proposing future research directions. Conclusion: LLMs hold promise for improving trial-patient matching in clinical trial recruitment, but challenges remain in adopting these technologies. Abstract: Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.[23] ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Negar Foroutan,Jakhongir Saydaliev,Ye Eun Kim,Antoine Bosselut
Main category: cs.CL
TL;DR: The paper proposes a novel supervised contrastive learning (SCL) approach to enhance language identification (LID) for low-resource languages by learning domain-invariant representations, resulting in a 3.2% performance improvement on out-of-domain data.
Details
Motivation: Current LID models perform poorly for low-resource languages, especially when data is limited to single domains like the Bible, due to class imbalance and bias issues. Method: A novel supervised contrastive learning (SCL) approach is proposed to learn domain-invariant representations for low-resource languages. Result: The SCL approach improves LID performance on out-of-domain data for low-resource languages by 3.2%. Conclusion: The proposed SCL approach effectively enhances LID models for low-resource languages. Abstract: Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. Through an extensive analysis, we show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%, demonstrating its effectiveness in enhancing LID models.[24] DeVisE: Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue,Heloisa Oss Boll,Aykut Erdem,Erkut Erdem,Iacer Calixto
Main category: cs.CL
TL;DR: Large language models (LLMs) are used in clinical decision support, but current evaluation methods have shortcomings. This paper introduces DeVisE, a framework for testing fine-grained clinical understanding using real-world and synthetic ICU discharge notes. The study evaluates five LLMs and finds that zero-shot models show better counterfactual reasoning, while fine-tuned models are more stable but less responsive to clinical changes. Demographic factors significantly influence outputs, underlining the need for fairness-aware evaluation.
Details
Motivation: To address the limitation of current evaluation methods in distinguishing genuine medical reasoning from superficial patterns in clinical decision support using large language models. Method: Construct a dataset of ICU discharge notes from MIMIC-IV with both real-world and synthetic versions, introducing controlled single-variable counterfactuals targeting demographic and vital sign attributes. Evaluate five LLMs under zero-shot and fine-tuned settings, assessing model behavior via input-level sensitivity and downstream reasoning. Result: Zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models are more stable yet less responsive to clinically meaningful changes. Demographic factors subtly but consistently influence outputs. Conclusion: Behavioral testing is useful in exposing the reasoning strategies of clinical LLMs and can inform the design of safer, more transparent medical AI systems. Abstract: Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.[25] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture
Arijit Maji,Raghvendra Kumar,Akash Ghosh,Anushka,Sriparna Saha
Main category: cs.CL
TL;DR: SANSKRITI is a new benchmark for evaluating language models' understanding of Indian cultural diversity, revealing disparities in their performance and setting a new standard.
Details
Motivation: To improve the global effectiveness of language models by enhancing their comprehension of local socio-cultural contexts, particularly India's rich cultural diversity. Method: Developed SANSKRITI, a large dataset with 21,853 question-answer pairs covering sixteen key attributes of Indian culture across 28 states and 8 union territories, used to evaluate various types of language models. Result: Significant disparities were found in the ability of different language models to handle culturally nuanced queries, especially in region-specific contexts. Conclusion: SANSKRITI sets a new standard for assessing and improving the cultural understanding capabilities of language models. Abstract: Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models' comprehension of India's rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India's cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.[26] COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation
Raghvendra Kumar,S. A. Mohammed Salman,Aryan Sahu,Tridib Nandi,Pragathi Y. P.,Sriparna Saha,Jose G. Moreno
Main category: cs.CL
TL;DR: COSMMIC is a new dataset for comment-aware multimodal and multilingual summarization in nine Indian languages, featuring 4,959 article-image pairs and 24,484 reader comments. It explores four configurations to enhance summaries with text, comments, and images while using advanced models like LLama3 and GPT-4. This bridges gaps in Indian language resources.
Details
Motivation: To address the lack of research and resources in comment-aware multimodal and multilingual summarization for Indian languages. Method: Introduced COSMMIC dataset with 4,959 article-image pairs and 24,484 reader comments across nine Indian languages. Explored four configurations for summarization: (1) article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining all three. Used state-of-the-art models such as LLama3 and GPT-4 along with IndicBERT for comment classification and CLIP-based classifiers for image analysis. Result: Evaluated different component combinations and identified supportive comments, filtered noise, and extracted valuable insights from images. Determined effective configurations for natural language generation tasks. Conclusion: COSMMIC uniquely integrates text, images, and user feedback, bridging gaps in Indian language resources and advancing NLP research for inclusivity. Abstract: Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset's effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.[27] Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning
Stanley Ngugi
Main category: cs.CL
TL;DR: This paper introduces Targeted Lexical Injection (TLI), a fine-tuning approach that significantly improves lexical alignment for Swahili-English word pairs in a Swahili-centric LLM. TLI uses Low-Rank Adaptation and contrastive learning to leverage the model's strong early-layer cross-lingual knowledge, resulting in improved output-level lexical alignment for both trained and unseen word pairs.
Details
Motivation: Large Language Models often underperform in low-resource languages due to data scarcity and insufficient pre-training representation. Achieving robust cross-lingual lexical alignment remains a key challenge, especially for tasks like translation and information retrieval. Method: The method involves using Targeted Lexical Injection (TLI) which employs Low-Rank Adaptation (LoRA) and a contrastive learning objective. This fine-tunes the model by targeting embeddings from an empirically identified optimal early layer where strong lexical alignment already exists. Result: Experiments show that TLI increases average cosine similarity from 0.3211 to 0.4113 (+28.08%) for trained Swahili-English word pairs and from 0.3143 to 0.4033 (+28.32%) for unseen control word pairs, demonstrating significant improvement and generalization. Conclusion: TLI is an effective and parameter-efficient strategy that enhances the ability of LRL-focused LLMs to preserve and propagate their inherent early-layer cross-lingual knowledge, improving lexical alignment. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their performance in low-resource languages (LRLs), such as Swahili, often lags due to data scarcity and underrepresentation in pre-training. A key challenge is achieving robust cross-lingual lexical alignment, crucial for tasks like translation and cross-lingual information retrieval. This paper introduces Targeted Lexical Injection (TLI), a novel and efficient fine-tuning approach. We first demonstrate that Lugha-Llama-8B-wura, a Swahili-centric LLM, exhibits strong, near-perfect lexical alignment for Swahili-English word pairs in its early internal layers (specifically Layer 2, with ~0.99998 average cosine similarity based on a pilot study), a capability not fully reflected in its final output representations (baseline ~0.32 similarity on our evaluation set). TLI leverages this insight by using Low-Rank Adaptation (LoRA) and a contrastive learning objective to fine-tune the model, specifically targeting embeddings from this empirically identified optimal early layer. Our experiments show that TLI significantly improves the output-level lexical alignment for 623 trained Swahili-English word pairs, increasing average cosine similarity from 0.3211 to 0.4113 (+28.08%, p < 1.33 x 10^-240). More importantly, these improvements generalize remarkably well to 63 unseen control word pairs, with similarity increasing from 0.3143 to 0.4033 (+28.32%, p < 7.17 x 10^-27). These findings suggest TLI enhances the model's ability to preserve and propagate its inherent early-layer cross-lingual knowledge, offering a parameter-efficient and effective strategy for improving lexical alignment in LRL-focused LLMs.[28] Understanding GUI Agent Localization Biases through Logit Sharpness
Xingjian Tao,Yiwei Wang,Yujun Cai,Zhicheng Yang,Jing Tang
Main category: cs.CL
TL;DR: Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems, but they often exhibit hallucinations. This paper proposes a fine-grained evaluation framework that categorizes model predictions into four distinct types and introduces the Peak Sharpness Score (PSS) to evaluate model uncertainty. Moreover, it proposes Context-Aware Cropping, a training-free technique to improve model performance.
Details
Motivation: To address the issue of hallucinations in MLLMs which are systematic localization errors that compromise reliability when interacting with operating systems. Method: Propose a fine-grained evaluation framework categorizing model predictions into four types, introduce Peak Sharpness Score (PSS) for evaluating model uncertainty, and propose Context-Aware Cropping technique to refine input context. Result: Extensive experiments show that the proposed framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior. Conclusion: The introduced evaluation framework and techniques offer significant improvements in understanding and enhancing the performance of GUI agents using MLLMs. Abstract: Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions. Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability. We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics. To better quantify model uncertainty, we introduce the Peak Sharpness Score (PSS), a metric that evaluates the alignment between semantic continuity and logits distribution in coordinate prediction. Building on this insight, we further propose Context-Aware Cropping, a training-free technique that improves model performance by adaptively refining input context. Extensive experiments demonstrate that our framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior.[29] AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need
Zhouhong Gu,Xiaoxuan Zhu,Yin Cai,Hao Shen,Xingzhou Chen,Qingyi Wang,Jialin Li,Xiaoran Shi,Haoran Guo,Wenxuan Huang,Hongwei Feng,Yanghua Xiao,Zheyu Ye,Yao Hu,Shaosheng Cao
Main category: cs.CL
TL;DR: AgentGroupChat-V2通过三项核心创新解决多代理系统面临的挑战,包括分而治之的全并行架构、自适应协作引擎和代理组织优化策略。实验表明其在多个领域表现出色,特别是在高难度任务中优势明显。
Details
Motivation: 现有的基于大语言模型的多代理系统在架构设计、跨域通用性和性能保证方面面临重大挑战,尤其是在任务复杂性和代理数量增加时。 Method: 提出了一种名为AgentGroupChat-V2的新框架,包含三个核心创新:(1) 分而治之的全并行架构;(2) 自适应协作引擎;(3) 代理组织优化策略。 Result: 广泛的实验表明,AgentGroupChat-V2在多个领域表现优于其他方法,特别是在高难度任务中,如GSM8K、AIME和HumanEval等数据集上的表现显著提高。 Conclusion: AgentGroupChat-V2为构建高效、通用的大语言模型多代理系统提供了全面的解决方案,尤其在复杂的推理场景中具有显著优势。 Abstract: Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at https://github.com/MikeGu721/AgentGroupChat-V2.[30] RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
Xinnuo Xu,Rachel Lawrence,Kshitij Dubey,Atharva Pandey,Risa Ueno,Fabian Falck,Aditya V. Nori,Rahul Sharma,Amit Sharma,Javier Gonzalez
Main category: cs.CL
TL;DR: Recent Large Language Models (LLMs) show high accuracy in reasoning benchmarks, but it is unclear whether this stems from true reasoning or statistical recall. This paper introduces RE-IMAGINE, a framework that characterizes the hierarchy of reasoning abilities in LLMs and generates problem variations to test these abilities beyond memorization.
Details
Motivation: To determine whether the high accuracy of LLMs in reasoning tasks is due to true reasoning capabilities or merely statistical recall of training data. Method: The paper proposes RE-IMAGINE, a framework based on the ladder of causation that includes three levels: associations, interventions, and counterfactuals. It generates problem variations in an intermediate symbolic representation which cannot be solved by memorization alone and applies across various reasoning domains like math, code, and logic. Result: When tested on four widely-used benchmarks, LLMs showed reduced performance when faced with problem variations generated by RE-IMAGINE, indicating reliance on statistical recall for previous results. Conclusion: The assessments reveal that LLMs depend on statistical recall for certain reasoning tasks, suggesting a need for further research into improving their reasoning skills across the hierarchy. Abstract: Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.[31] Context-Informed Grounding Supervision
Hyunji Lee,Seunghyun Yoon,Yunjae Won,Hanseok Oh,Geewook Kim,Trung Bui,Franck Dernoncourt,Elias Stengel-Eskin,Mohit Bansal,Minjoon Seo
Main category: cs.CL
TL;DR: CINGS is a post-training supervision method that improves the grounding of LLMs in both text and visual domains without sacrificing general performance.
Details
Motivation: To ensure large language models generate responses grounded in external context, addressing the issue where simply appending context at inference time does not guarantee grounded generation. Method: Propose Context-INformed Grounding Supervision (CINGS), which involves training the model with relevant context prepended to the response, computing loss only over the response tokens, and masking out the context. Result: Models trained with CINGS show stronger grounding in textual and visual domains, outperform other methods across datasets, reduce hallucinations in vision-language tasks, and maintain factual consistency. Conclusion: CINGS enhances model grounding by shifting the model's prior knowledge and behavior, encouraging greater reliance on external context. Abstract: Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination. In such cases, we expect the model to generate responses by grounding its response in the provided external context. However, prior work has shown that simply appending context at inference time does not ensure grounded generation. To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context. Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models. In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques. In the vision-language domain, replacing a vision-language model's LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. This improved grounding comes without degradation in general downstream performance. Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model's prior knowledge and behavior, implicitly encouraging greater reliance on the external context.[32] SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi,Xiaodan Zhu,Iryna Gurevych
Main category: cs.CL
TL;DR: SPARE is a new framework for efficient, high-quality automated process annotation that improves reasoning performance in LLMs.
Details
Motivation: Process or step-wise supervision is important for advancing complex multi-step reasoning capabilities of LLMs, but efficient, high-quality automated process annotation remains challenging. Method: SPARE enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. Result: SPARE improves reasoning performance when used for fine-tuning models and training reward models. It achieves competitive performance on mathematical datasets while being 2.6 times more efficient than tree search-based automatic annotation. Conclusion: SPARE offers an effective solution for process supervision across various domains and improves the efficiency of automated process annotation. Abstract: Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.[33] Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge
Li Zheng,Sihang Wang,Hao Fei,Zuquan Peng,Fei Li,Jianming Fu,Chong Teng,Donghong Ji
Main category: cs.CL
TL;DR: The paper presents EmoBi, a framework for detecting hyperbole and metaphor using emotion-guided bidirectional dynamic interaction, showing superior performance on multiple datasets.
Details
Motivation: Existing methods for hyperbole and metaphor detection mainly focus on superficial text features and neglect the associations between these rhetorical devices as well as the influence of implicit emotions. Method: Propose EmoBi, which includes an emotion analysis module to mine emotion connotations, a domain mapping module based on emotion to identify target and source domains, and a bidirectional dynamic interaction module to promote mutual enhancement between hyperbole and metaphor. A verification mechanism is also designed. Result: Experiments indicate that EmoBi surpasses all baseline methods across four datasets, with significant improvements in F1 scores (28.1% for hyperbole detection on TroFi and 23.1% for metaphor detection on HYPO-L compared to the current SoTA). Conclusion: EmoBi effectively detects hyperbole and metaphor through emotion-guided bidirectional dynamic interaction, demonstrating its potential to advance related NLP tasks. Abstract: Text-based hyperbole and metaphor detection are of great significance for natural language processing (NLP) tasks. However, due to their semantic obscurity and expressive diversity, it is rather challenging to identify them. Existing methods mostly focus on superficial text features, ignoring the associations of hyperbole and metaphor as well as the effect of implicit emotion on perceiving these rhetorical devices. To implement these hypotheses, we propose an emotion-guided hyperbole and metaphor detection framework based on bidirectional dynamic interaction (EmoBi). Firstly, the emotion analysis module deeply mines the emotion connotations behind hyperbole and metaphor. Next, the emotion-based domain mapping module identifies the target and source domains to gain a deeper understanding of the implicit meanings of hyperbole and metaphor. Finally, the bidirectional dynamic interaction module enables the mutual promotion between hyperbole and metaphor. Meanwhile, a verification mechanism is designed to ensure detection accuracy and reliability. Experiments show that EmoBi outperforms all baseline methods on four datasets. Specifically, compared to the current SoTA, the F1 score increased by 28.1% for hyperbole detection on the TroFi dataset and 23.1% for metaphor detection on the HYPO-L dataset. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing hyperbole and metaphor detection.[34] Lessons from Training Grounded LLMs with Verifiable Rewards
Shang Hong Sim,Tej Deep Pala,Vernon Toh,Hai Leong Chieu,Amir Zadeh,Chuan Li,Navonil Majumder,Soujanya Poria
Main category: cs.CL
TL;DR: 生成有根据且值得信赖的回复仍是大语言模型(LLMs)的关键挑战。本文探讨了强化学习和内部推理如何增强LLMs的接地性,并通过全面实验展示了推理增强模型的表现显著优于仅指令模型。
Details
Motivation: 尽管基于检索的生成(RAG)与引用为基础的接地方法有潜力,但指令调优模型在简单场景中仍常失败,如遗漏明确答案、错误引用或拒绝回应。 Method: 使用GRPO方法训练模型,利用可验证结果导向的奖励机制优化回答准确性、引用充分性和拒绝质量。采用两阶段训练设置,先优化回答和引用行为,再优化拒绝行为,以稳定学习信号。重新审视通过GPT-4蒸馏进行指令调优,并将其与GRPO结合以提高长篇生成式问答任务的性能。 Result: 推理增强模型在处理无法回答的问题和生成良好引用的回复方面显著优于仅指令模型。两阶段训练进一步提高了接地性。结合GRPO与指令调优增强了长篇生成式问答任务的性能。 Conclusion: 研究结果表明,推理、阶段性优化和结果驱动的强化学习对于构建更可验证和可靠的LLMs具有重要价值。 Abstract: Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.[35] RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models
Bailin Wang,Chang Lan,Chong Wang,Ruoming Pang
Main category: cs.CL
TL;DR: RATTENTION introduces a specialized linear attention mechanism to local-global attention models, achieving better performance-efficiency tradeoff without compromising training efficiency.
Details
Motivation: To overcome the limitation of local attention which completely ignores tokens outside the defined window. Method: Integrates a specialized linear attention mechanism with local attention, forming RATTENTION, designed to capture information from out-of-window tokens. Result: RATTENTION achieves superior Pareto tradeoff between performance and efficiency; a window size of 512 matches full-attention performance across settings, and enhances long-context performance without reducing training speed. Conclusion: RATTENTION successfully shifts the Pareto frontier for local-global attention models, providing efficiency gains in short-context regimes while maintaining or improving performance. Abstract: Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches.[36] Approximating Language Model Training Data from Weights
John X. Morris,Junjie Oscar Yin,Woojeong Kim,Vitaly Shmatikov,Alexander M. Rush
Main category: cs.CL
TL;DR: 现代语言模型通常权重开放但训练数据封闭。本文形式化了从模型权重进行数据近似的问题,并提出了几种基线和度量方法。通过梯度法,可以从大型公共文本语料库中选择最匹配的数据,即使在未知真实训练数据的情况下,也能恢复有用数据。此方法在AG News分类任务上将性能从65%提升至80%,接近专家基准88%;在MSMARCO文档上的SFT模型,困惑度从3.3降至2.3,接近专家LLAMA模型的2.0。
Details
Motivation: 当前现代语言模型虽然权重公开,但其训练数据往往是封闭的。这促使研究者探索如何仅通过模型权重来近似原始训练数据,从而解决数据不透明的问题。 Method: 提出了一种基于梯度的方法,利用模型权重从大型公共文本语料库中选取最匹配的数据。该方法不仅适用于原始模型,也适用于微调后的模型。通过对比不同数据子集的效果,验证了方法的有效性。 Result: 1. 在AG News分类任务上,使用随机选择的数据时准确率为65%,而使用所提方法后提升至80%,接近专家基准88%。 2. 在MSMARCO文档上的SFT模型中,困惑度从3.3降低到2.3,接近专家LLAMA模型的2.0。 Conclusion: 本文提出的基于梯度的方法能够有效地从公共Web文档中定位一小部分数据,用于训练模型以接近原始模型的性能,无论是在分类任务还是监督微调任务中都表现良好。 Abstract: Modern language models often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models. Even when none of the true training data is known, our method is able to locate a small subset of public Web documents can be used to train a model to close to the original model performance given models trained for both classification and supervised-finetuning. On the AG News classification task, our method improves performance from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, our method reduces perplexity from 3.3 to 2.3, compared to an expert LLAMA model's perplexity of 2.0.[37] PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction
Shufan Li,Aditya Grover
Main category: cs.CL
TL;DR: Predictive Generation (PredGen) is a novel framework that can significantly reduce latency in real-time voice chat applications by generating candidate responses while the user is still speaking, thereby enabling the system to start TTS processing with minimal delay. This method reduces latency by around 2x across various use cases with only a small additional computation cost.
Details
Motivation: Large Language Models (LLMs) often suffer from noticeable latency between the end of user input and the start of audio output when combined with text-to-speech (TTS) systems in real-time voice chat applications, especially on consumer-grade hardware with limited computing capacity. Method: The proposed method, Predictive Generation (PredGen), uses speculative decoding at input time to generate candidate responses while the user is still speaking. This allows the system to begin TTS processing immediately after the user finishes speaking, thus reducing the overall latency. Result: Simulated experiments on the Lmsys and MT-Bench datasets demonstrated that PredGen effectively reduced latency by around 2x across a wide range of use cases, with only minimal additional computation cost at input time. Conclusion: PredGen successfully mitigates or eliminates the latency bottleneck in real-time voice chat applications by leveraging speculative decoding, providing a smoother user experience. Abstract: Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.[38] Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models
Zhengyang Shan,Emily Ruth Diana,Jiawei Zhou
Main category: cs.CL
TL;DR: The paper introduces Gender Inclusivity Fairness Index (GIFI), a new metric to evaluate gender fairness in LLMs, and conducts evaluations on 22 LLMs, revealing significant variations in gender inclusivity.
Details
Motivation: Existing studies mainly focus on binary gender distinctions, lacking comprehensive evaluation for non-binary genders in LLMs. Method: Developed GIFI which includes various evaluations from probing models with gender pronouns to testing model generation under different gender assumptions. Result: Found significant variations in gender inclusivity among the evaluated LLMs. Conclusion: Improving gender inclusivity in LLMs is crucial, and GIFI provides a benchmark for future advancements. Abstract: We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers. We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity. Our study highlights the importance of improving LLMs' inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models.[39] SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
Chengye Wang,Yifei Shen,Zexi Kuang,Arman Cohan,Yilun Zhao
Main category: cs.CL
TL;DR: The paper introduces SciVer, a benchmark for evaluating foundation models' ability to verify claims in multimodal scientific contexts. It contains 3,000 annotated examples across 1,113 papers and assesses 21 state-of-the-art models, revealing significant performance gaps compared to human experts.
Details
Motivation: To evaluate and improve the claim verification capabilities of foundation models within a multimodal scientific context. Method: Developed SciVer, a benchmark with 3,000 expert-annotated examples covering four reasoning types from 1,113 scientific papers. Evaluated 21 advanced multimodal models including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL using this benchmark. Result: Found a considerable performance gap between the evaluated models and human experts on SciVer. Through RAG analysis and human error evaluations, identified key limitations in current open-source models. Conclusion: SciVer highlights critical areas for improvement in models' comprehension and reasoning abilities for multimodal scientific literature tasks. Abstract: We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.[40] DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
Shaoqing Lin,Chong Teng,Fei Li,Donghong Ji,Lizhen Qu,Zhuang Li
Main category: cs.CL
TL;DR: Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions. To address limitations in current text scene graph parsers, the paper introduces Discourse-level text Scene Graph parsing (DiscoSG), a new task supported by the DiscoSG-DS dataset. While fine-tuning large PLMs on DiscoSG-DS shows significant improvements, high costs and licensing issues limit their use. The proposed solution, DiscoSG-Refiner, uses two smaller PLMs to iteratively refine graphs, achieving faster inference and consistent improvements in downstream VLM tasks.
Details
Motivation: Current approaches for merging sentence-level parsing outputs for discourse input often miss phenomena like cross-sentence coreference, leading to fragmented graphs and degraded performance in downstream VLM tasks. Method: Introduced DiscoSG, a new task for Discourse-level text Scene Graph parsing, and its supporting dataset, DiscoSG-DS. Proposed DiscoSG-Refiner which drafts a base graph using one small PLM and employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Result: Fine-tuning large PLMs on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, but with high inference cost and licensing restrictions. DiscoSG-Refiner improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4, and consistently improves downstream VLM tasks. Conclusion: The paper successfully introduces DiscoSG and DiscoSG-DS to tackle limitations in current text scene graph parsers. DiscoSG-Refiner provides an efficient solution for discourse-level graph parsing, improving both performance and speed. Abstract: Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: https://github.com/ShaoqLin/DiscoSG[41] WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts
Negar Foroutan,Angelika Romanou,Matin Ansaripour,Julian Martin Eisenschlos,Karl Aberer,Rémi Lebret
Main category: cs.CL
TL;DR: The paper presents WikiMixQA, a new benchmark for evaluating cross-modal reasoning in document understanding, revealing challenges in long-context multi-modal reasoning.
Details
Motivation: To address the lack of evaluation for long-context vision inputs in vision-language large models. Method: Introduced WikiMixQA, a benchmark with 1,000 MCQs evaluating cross-modal reasoning over tables and charts from Wikipedia pages. Evaluated 12 state-of-the-art vision-language models on this benchmark. Result: Proprietary models achieved ~70% accuracy with direct context but performed worse with retrieval from long documents. GPT-4-o was the only model exceeding 50% accuracy in this setting, while open-source models had a maximum accuracy of 27%. Conclusion: WikiMixQA highlights the challenges in long-context, multi-modal reasoning and serves as an important benchmark for advancing document understanding research. Abstract: Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.[42] From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns
Bernardo Leite,Henrique Lopes Cardoso,Pedro Pinto,Abel Ferreira,Luís Abreu,Isabel Rangel,Sandra Monteiro
Main category: cs.CL
TL;DR: This paper explores the ability of generative AI models to create multiple-choice questions (MCQs) for Portuguese reading comprehension, targeting elementary school students. The study finds that AI-generated MCQs can match human-created ones in quality but face challenges in semantic clarity, answerability, and effective distractor generation.
Details
Motivation: The motivation is to address the time-consuming and costly nature of manually creating MCQs with varying difficulty levels and targeted reading skills, especially in languages other than English, such as Portuguese. Method: The method involves using generative AI models to produce MCQs aligned with curriculum-relevant narrative elements and different difficulty levels for Portuguese reading comprehension. These MCQs are then evaluated through expert review and psychometric analysis based on student responses. Result: The results indicate that current generative models can produce MCQs of similar quality to those authored by humans. However, issues related to semantic clarity, answerability, and the design of engaging and high-quality distractors persist. Conclusion: Generative AI models have the potential to automate MCQ creation effectively for Portuguese reading comprehension, though improvements are needed in ensuring semantic clarity, answerability, and the quality of distractors. Abstract: While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task. Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently. However, assessing the actual quality and reliability of generated MCQs has received limited attention -- particularly regarding cases where generation fails. This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings. Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language. Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels. We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students. Our results show that current models can generate MCQs of comparable quality to human-authored ones. However, we identify issues related to semantic clarity and answerability. Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design.[43] The Compositional Architecture of Regret in Large Language Models
Xiangxiang Cui,Shu Yang,Tianjin Huang,Wanyu Lin,Lijie Hu,Di Wang
Main category: cs.CL
TL;DR: 研究大型语言模型中的遗憾机制,提出构建遗憾数据集的工作流程、识别最优遗憾表示层的S-CDI度量以及识别遗憾神经元的RDS度量和分析激活模式的GIC度量。实验成功识别出最优遗憾表示层并发现M形解耦模式,将神经元分为三类:遗憾神经元、非遗憾神经元和双功能神经元。
Details
Motivation: 研究大型语言模型中的遗憾机制对于提高模型可靠性和揭示神经网络中认知编码方式至关重要。 Method: 提出构建遗憾数据集的工作流程;提出S-CDI度量以识别最优遗憾表示层;提出RDS度量识别遗憾神经元和GIC度量分析激活模式。 Result: 成功识别出最优遗憾表示层,显著提高了探测分类实验的性能;发现模型层间的M形解耦模式;将神经元分为三类:遗憾神经元、非遗憾神经元和双功能神经元。 Conclusion: 提出的度量和方法能够有效识别和分析大型语言模型中的遗憾机制,为未来研究提供了基础。 Abstract: Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model's hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons.[44] Minding the Politeness Gap in Cross-cultural Communication
Yuka Machino,Matthias Hofer,Max Siegel,Joshua B. Tenenbaum,Robert D. Hawkins
Main category: cs.CL
TL;DR: 研究了英式和美式英语中像'quite'和'very'这样的强化词的解释差异,发现跨文化差异源于不同的字面意义和表达成本权重。
Details
Motivation: 探讨跨文化交流中细微的解释差异是来源于词汇的字面意义还是如礼貌和简洁等普遍的语用因素。 Method: 进行了三个实验,开发了一个计算认知模型,该模型中听众递归推理平衡信息量、礼貌和表达成本的说话者。 Result: 跨文化强化词解释差异源于不同的字面意义和不同的表达成本权重。 Conclusion: 跨文化解释差异是由语义变化和礼貌规范之间的复杂相互作用产生的,挑战了基于纯粹语义变化或礼貌规范的解释。 Abstract: Misunderstandings in cross-cultural communication often arise from subtle differences in interpretation, but it is unclear whether these differences arise from the literal meanings assigned to words or from more general pragmatic factors such as norms around politeness and brevity. In this paper, we report three experiments examining how speakers of British and American English interpret intensifiers like "quite" and "very." To better understand these cross-cultural differences, we developed a computational cognitive model where listeners recursively reason about speakers who balance informativity, politeness, and utterance cost. Our model comparisons suggested that cross-cultural differences in intensifier interpretation stem from a combination of (1) different literal meanings, (2) different weights on utterance cost. These findings challenge accounts based purely on semantic variation or politeness norms, demonstrating that cross-cultural differences in interpretation emerge from an intricate interplay between the two.[45] Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
Main category: cs.CL
TL;DR: In generative commonsense reasoning tasks, LLMs can include all given concepts but may not follow a specified concept order well. This paper proposes Ordered CommonGen to evaluate LLMs' instruction-following and compositional generalization abilities by measuring ordered coverage. Analysis of 36 LLMs showed biases toward specific concept orders, leading to low-diversity outputs, with even the best model achieving only about 75% ordered coverage.
Details
Motivation: To better evaluate LLMs' ability to follow instructions and generate sentences with concepts in a specified order, improving both instruction-following and compositional generalization capabilities. Method: Proposed Ordered CommonGen benchmark which measures ordered coverage to assess whether concepts are generated in the specified order. Comprehensive analysis conducted using 36 LLMs. Result: LLMs generally understand instruction intent but show biases toward specific concept order patterns, resulting in low-diversity outputs or identical results despite changes in concept order. The best-performing model achieved about 75% ordered coverage. Conclusion: There is a need for improvements in both instruction-following and compositional generalization capabilities of LLMs. Abstract: In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.[46] Oldies but Goldies: The Potential of Character N-grams for Romanian Texts
Dana Lupsa,Sanda-Maria Avram
Main category: cs.CL
TL;DR: This study evaluates six machine learning techniques for Romanian authorship attribution using the ROST corpus, finding that the ANN model with 5-gram features performs the best.
Details
Motivation: To address the problem of authorship attribution for Romanian texts and evaluate the effectiveness of different machine learning techniques using character n-gram features. Method: Systematically evaluate six machine learning techniques (SVM, LR, k-NN, DT, RF, ANN) using character n-gram features on the ROST corpus for authorship attribution. Result: The ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. Conclusion: Lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, highlighting the potential of simple stylometric features in resource-constrained or under-studied language settings. Abstract: This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource, constrained or under-studied language settings.[47] CC-LEARN: Cohort-based Consistency Learning
Xiao Ye,Shaswat Shrivastava,Zhaonan Li,Jacob Dineen,Shijie Lu,Avneet Ahuja,Ming Shen,Zhikun Xu,Ben Zhou
Main category: cs.CL
TL;DR: 大型语言模型在许多任务中表现出色,但在一致性和稳健推理方面仍存在挑战。本文提出了基于群体的一致性学习(CC-Learn),通过强化学习框架提高LLM推理的可靠性。实验表明,CC-Learn在困难推理基准上提高了准确性和推理稳定性。
Details
Motivation: 尽管大型语言模型在许多任务上表现出色,但它们在一致性和稳健推理方面仍有不足,因此需要一种新的方法来提高LLM推理的可靠性。 Method: 提出了一种名为Cohort-based Consistency Learning (CC-Learn)的强化学习框架,该框架通过训练从共享程序抽象派生的相似问题组来提高LLM推理的可靠性。定义了一个复合目标,包括群体准确性、检索奖金和拒绝惩罚,以确保群体级别的连贯性。 Result: 在困难推理基准(如ARC-Challenge和StrategyQA)上的实验表明,CC-Learn显著提高了预训练和监督微调基线的准确性和推理稳定性。 Conclusion: 基于群体的强化学习有效地增强了大型语言模型的推理一致性。 Abstract: Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.[48] Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers
Tommaso Green,Martin Gubri,Haritz Puerto,Sangdoo Yun,Seong Joon Oh
Main category: cs.CL
TL;DR: 研究发现,大型推理模型的推理轨迹中存在隐私泄露问题,尽管增加计算预算可使模型最终输出更谨慎,但也会导致推理过程更加冗长并泄露更多信息,因此需要关注模型内部思考的安全性,而不仅仅是其输出。
Details
Motivation: 探讨大型推理模型作为个人代理时,其推理轨迹中的隐私泄露问题,尤其是通过提示注入或意外泄露的方式。 Method: 通过探测和代理评估,展示测试时计算方法(特别是增加推理步骤)如何放大隐私泄露,并分析增加计算预算对模型最终答案的影响。 Result: 增加推理步骤和计算预算虽然让模型在最终输出上更加谨慎,但也导致了更冗长的推理过程和更多的隐私泄露。 Conclusion: 推理虽然提高了模型的实用性,但也扩大了隐私攻击面,安全工作需要扩展到模型的内部思考过程。 Abstract: We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model's internal thinking, not just its outputs.[49] Gender-Neutral Machine Translation Strategies in Practice
Hillary Dawkins,Isar Nejadgholi,Chi-kiu Lo
Main category: cs.CL
TL;DR: 这篇论文研究了21个机器翻译系统在处理性别模糊时的敏感性,发现整体上缺乏性别中立翻译,但有少数系统根据目标语言使用特定策略进行性别中立翻译。
Details
Motivation: 探讨机器翻译系统在面对性别模糊时是否能够保持性别中立,以避免性别错误和表征伤害,并分析性别中立翻译的实际策略及二元性别刻板印象的影响。 Method: 评估21个机器翻译系统在三种不同难度翻译方向上对性别中立的需求的反应,分类并讨论实际观察到的具体性别中立策略,同时考察二元性别刻板印象对性别中立翻译使用的影响。 Result: 大多数机器翻译系统未能在性别模糊时提供性别中立翻译,但有少数系统根据不同目标语言采用特定策略实现性别中立翻译。 Conclusion: 机器翻译系统在处理性别模糊时普遍缺乏性别中立意识,需要进一步改进以更好地应对性别中立需求。 Abstract: Gender-inclusive machine translation (MT) should preserve gender ambiguity in the source to avoid misgendering and representational harms. While gender ambiguity often occurs naturally in notional gender languages such as English, maintaining that gender neutrality in grammatical gender languages is a challenge. Here we assess the sensitivity of 21 MT systems to the need for gender neutrality in response to gender ambiguity in three translation directions of varying difficulty. The specific gender-neutral strategies that are observed in practice are categorized and discussed. Additionally, we examine the effect of binary gender stereotypes on the use of gender-neutral translation. In general, we report a disappointing absence of gender-neutral translations in response to gender ambiguity. However, we observe a small handful of MT systems that switch to gender neutral translation using specific strategies, depending on the target language.[50] GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee,Ryo Hachiuma,Yong Man Ro,Yu-Chiang Frank Wang,Yueh-Hua Wu
Main category: cs.CL
TL;DR: Recent advancements in vision-language models (VLMs) have achieved performance on par with closed-source systems like GPT-4V, but deploying them in real-world scenarios remains challenging due to their substantial computational demands. To address this challenge, the paper presents Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs that significantly improves baseline performances.
Details
Motivation: The motivation of this paper is to address the challenge of deploying large VLMs in real-world scenarios, particularly on resource-constrained devices, by distilling knowledge from large VLMs into smaller, more efficient counterparts. Method: The method introduced in this paper is called Generation after Recalibration (GenRecal). It incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Result: Through extensive experiments on multiple challenging benchmarks, GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs. Conclusion: The conclusion of this paper is that GenRecal, a novel distillation framework for VLMs, effectively addresses the challenges posed by the diversity of VLM architectures and enables efficient knowledge transfer, leading to improved performances. Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.[51] PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning
Yuhui Shi,Yehan Yang,Qiang Sheng,Hao Mi,Beizhe Hu,Chaoxi Xu,Juan Cao
Main category: cs.CL
TL;DR: The paper introduces PhantomHunter, a detector for identifying text generated by privately-tuned LLMs. It uses a family-aware learning framework to capture shared traits among LLM families, showing superior performance with F1 scores over 96%.
Details
Motivation: Existing methods for detecting LLM-generated text struggle when faced with text from privately tuned LLMs, leading to significant performance drops. Method: PhantomHunter is proposed as an LLM-generated text detector specialized in identifying text from unseen, privately-tuned LLMs through a family-aware learning framework that captures family-level traits rather than individual characteristics. Result: Experiments conducted on data from LLaMA, Gemma, and Mistral families demonstrate PhantomHunter's superiority over 7 baselines and 3 industrial services, achieving F1 scores exceeding 96%. Conclusion: PhantomHunter effectively addresses the challenge of detecting text from privately-tuned LLMs by leveraging a family-aware learning approach. Abstract: With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%.cs.CV [Back]
[52] SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection
Jingxuan Zhou,Yuehao Wu,Yibo Zhang,Yeyubei Zhang,Yunchong Liu,Bolin Huang,Chunhong Yuan
Main category: cs.CV
TL;DR: This paper proposes a Semantic Irony Recognition Network (SemIRNet) to accurately identify graphical implicit correlations in multimodal irony detection tasks. It introduces ConceptNet knowledge base, designs two cross-modal semantic similarity detection modules and uses contrastive learning loss function.
Details
Motivation: To solve the problem of difficulty in accurately identifying graphical implicit correlations in multimodal irony detection tasks. Method: Propose SemIRNet which includes introducing ConceptNet knowledge base for common-sense reasoning ability, designing two cross-modal semantic similarity detection modules at word level and sample level, and introducing contrastive learning loss function to optimize the spatial distribution of sample features. Result: Experiments show that the accuracy and F1 value of this model are improved by 1.64% and 2.88% respectively compared with existing optimal methods. Ablation experiments verify the importance of knowledge fusion and semantic similarity detection. Conclusion: The proposed SemIRNet effectively improves the performance in multimodal irony detection tasks. Abstract: Aiming at the problem of difficulty in accurately identifying graphical implicit correlations in multimodal irony detection tasks, this paper proposes a Semantic Irony Recognition Network (SemIRNet). The model contains three main innovations: (1) The ConceptNet knowledge base is introduced for the first time to acquire conceptual knowledge, which enhances the model's common-sense reasoning ability; (2) Two cross-modal semantic similarity detection modules at the word level and sample level are designed to model graphic-textual correlations at different granularities; and (3) A contrastive learning loss function is introduced to optimize the spatial distribution of the sample features, which improves the separability of positive and negative samples. Experiments on a publicly available multimodal irony detection benchmark dataset show that the accuracy and F1 value of this model are improved by 1.64% and 2.88% to 88.87% and 86.33%, respectively, compared with the existing optimal methods. Further ablation experiments verify the important role of knowledge fusion and semantic similarity detection in improving the model performance.[53] Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?
Yang Yao,Lingyu Li,Jiaxin Song,Chiyu Chen,Zhenqi He,Yixu Wang,Xin Wang,Tianle Gu,Jie Li,Yan Teng,Yingchun Wang
Main category: cs.CV
TL;DR: Multimodal Large Language Models (MLLMs) have made significant progress in cognitive and reasoning capabilities, but still face challenges in visual fine-grained perception and commonsense causal inference. This paper presents Argus Inspection, a benchmark for evaluating MLLMs' detailed visual recognition and causal reasoning abilities, and Eye of Panoptes, a framework for assessing opinion-based reasoning tasks. Experiments on 26 MLLMs indicate substantial room for improvement.
Details
Motivation: To address the persisting challenges in visual fine-grained perception and commonsense causal inference in MLLMs. Method: Introduced Argus Inspection, a multimodal benchmark with two difficulty levels for evaluating detailed visual recognition and causal reasoning. Proposed the Eye of Panoptes framework, integrating a binary parametric Sigmoid metric with an indicator function for holistic evaluation of opinion-based reasoning tasks. Result: Experiments on 26 mainstream MLLMs showed that the highest performance in visual fine-grained reasoning was only 0.46. Conclusion: The research provides valuable insights for the continued advancement and refinement of MLLMs. Abstract: As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs' responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.[54] A Hybrid ConvNeXt-EfficientNet AI Solution for Precise Falcon Disease Detection
Alavikunhu Panthakkan,Zubair Medammal,S M Anzar,Fatma Taher,Hussain Al-Ahmad
Main category: cs.CV
TL;DR: The paper presents a hybrid AI model using ConvNeXt and EfficientNet for classifying falcon diseases, showing superior performance compared to traditional methods.
Details
Motivation: To improve the health surveillance of falcons by developing an accurate method for disease classification. Method: A hybrid AI model combining ConvNeXt and EfficientNet was used for classifying three conditions in falcons: Normal, Liver Disease, and 'Aspergillosis'. The model was trained and validated using a large dataset. Result: The hybrid AI model outperformed traditional diagnostic methods and individual model architectures based on metrics such as accuracy, precision, recall, and F1-score. Conclusion: The implementation of the hybrid AI model represents a significant advancement in falcon disease detection and opens opportunities for future AI-driven avian healthcare solutions. Abstract: Falconry, a revered tradition involving the training and hunting with falcons, requires meticulous health surveillance to ensure the health and safety of these prized birds, particularly in hunting scenarios. This paper presents an innovative method employing a hybrid of ConvNeXt and EfficientNet AI models for the classification of falcon diseases. The study focuses on accurately identifying three conditions: Normal, Liver Disease and 'Aspergillosis'. A substantial dataset was utilized for training and validating the model, with an emphasis on key performance metrics such as accuracy, precision, recall, and F1-score. Extensive testing and analysis have shown that our concatenated AI model outperforms traditional diagnostic methods and individual model architectures. The successful implementation of this hybrid AI model marks a significant step forward in precise falcon disease detection and paves the way for future developments in AI-powered avian healthcare solutions.[55] ViLLa: A Neuro-Symbolic approach for Animal Monitoring
Harsha Koduri
Main category: cs.CV
TL;DR: A neuro-symbolic framework named ViLLa is introduced for interpretable animal monitoring, integrating visual detection, language parsing and symbolic reasoning. It separates perception, understanding and reasoning to offer modularity and transparency.
Details
Motivation: Current systems lack the ability to effectively interpret both visual data and human language queries in animal monitoring scenarios. Method: ViLLa integrates three core components: a visual detection module for identifying animals and their locations, a language parser for understanding natural language queries, and a symbolic reasoning layer that applies logic-based inference to answer those queries. Result: The system was evaluated on a range of animal imagery tasks and demonstrates the ability to bridge visual content with structured, human-interpretable queries. Conclusion: ViLLa provides a modular and transparent approach to animal monitoring by separating perception, understanding and reasoning. Abstract: Monitoring animal populations in natural environments requires systems that can interpret both visual data and human language queries. This work introduces ViLLa (Vision-Language-Logic Approach), a neuro-symbolic framework designed for interpretable animal monitoring. ViLLa integrates three core components: a visual detection module for identifying animals and their spatial locations in images, a language parser for understanding natural language queries, and a symbolic reasoning layer that applies logic-based inference to answer those queries. Given an image and a question such as "How many dogs are in the scene?" or "Where is the buffalo?", the system grounds visual detections into symbolic facts and uses predefined rules to compute accurate answers related to count, presence, and location. Unlike end-to-end black-box models, ViLLa separates perception, understanding, and reasoning, offering modularity and transparency. The system was evaluated on a range of animal imagery tasks and demonstrates the ability to bridge visual content with structured, human-interpretable queries.[56] GraphGSOcc: Semantic and Geometric Graph Transformer for 3D Gaussian Splating-based Occupancy Prediction
Ke Song,Yunhe Wu,Chunchit Siu,Huiyuan Xiong
Main category: cs.CV
TL;DR: The paper introduces GraphGSOcc model that improves 3D Gaussian Splating methods by addressing feature aggregation and boundary ambiguities issues, achieving better mIoU and memory efficiency.
Details
Motivation: Existing 3D Gaussian Splating (3DGS) methods suffer from unified feature aggregation which neglects semantic correlations and boundary ambiguities due to lack of geometric constraints. Method: The authors propose the GraphGSOcc model with Dual Gaussians Graph Attention that dynamically constructs dual graph structures - a geometric graph for adaptive KNN search radii calculation and a semantic graph for encoding relationships. Also, Multi-scale Graph Attention framework is used for fine-grained and coarse-grained attention. Result: Experiments on the SurroundOcc dataset show an mIoU of 24.10%, with GPU memory reduced to 6.1 GB, indicating a 1.97% mIoU improvement and 13.7% memory reduction compared to GaussianWorld. Conclusion: GraphGSOcc model effectively combines semantic and geometric information for improved 3D Gaussian Splating-based Occupancy Prediction. Abstract: Addressing the task of 3D semantic occupancy prediction for autonomous driving, we tackle two key issues in existing 3D Gaussian Splating (3DGS) methods: (1) unified feature aggregation neglecting semantic correlations among similar categories and across regions, and (2) boundary ambiguities caused by the lack of geometric constraints in MLP iterative optimization. We propose the GraphGSOcc model, a novel framework that combines semantic and geometric graph Transformer for 3D Gaussian Splating-based Occupancy Prediction. We propose the Dual Gaussians Graph Attenntion, which dynamically constructs dual graph structures: a geometric graph adaptively calculating KNN search radii based on Gaussian poses, enabling large-scale Gaussians to aggregate features from broader neighborhoods while compact Gaussians focus on local geometric consistency; a semantic graph retaining top-M highly correlated nodes via cosine similarity to explicitly encode semantic relationships within and across instances. Coupled with the Multi-scale Graph Attention framework, fine-grained attention at lower layers optimizes boundary details, while coarse-grained attention at higher layers models object-level topology. Experiments on the SurroundOcc dataset achieve an mIoU of 24.10%, reducing GPU memory to 6.1 GB, demonstrating a 1.97% mIoU improvement and 13.7% memory reduction compared to GaussianWorld[57] DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning
Yifeng Gao,Yifan Ding,Hongyu Su,Juncheng Li,Yunhan Zhao,Lin Luo,Zixing Chen,Li Wang,Xin Wang,Yixu Wang,Xingjun Ma,Yu-Gang Jiang
Main category: cs.CV
TL;DR: The paper presents DAVID-X, a dataset of AI-generated videos with detailed annotations and written rationales. It also introduces DAVID-XR1, a video-language model that provides interpretable detection of AI-generated content.
Details
Motivation: To address the challenge of distinguishing AI-generated videos from authentic ones by providing fine-grained, persuasive evidence beyond simple binary classification. Method: Introduction of DAVID-X dataset with defect-level annotations and DAVID-XR1, a video-language model that offers an interpretable chain of visual reasoning including defect categorization, temporal-spatial localization, and natural language explanations. Result: A general-purpose backbone, fine-tuned on the compact DAVID-X dataset and enhanced with chain-of-thought distillation, shows strong generalization across various generators and generation modes. Conclusion: Explainable detection methods hold promise for trustworthy identification of AI-generated video content. Abstract: As AI-generated video becomes increasingly pervasive across media platforms, the ability to reliably distinguish synthetic content from authentic footage has become both urgent and essential. Existing approaches have primarily treated this challenge as a binary classification task, offering limited insight into where or why a model identifies a video as AI-generated. However, the core challenge extends beyond simply detecting subtle artifacts; it requires providing fine-grained, persuasive evidence that can convince auditors and end-users alike. To address this critical gap, we introduce DAVID-X, the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales. Leveraging these rich annotations, we present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning-including defect categorization, temporal-spatial localization, and natural language explanations. This approach fundamentally transforms AI-generated video detection from an opaque black-box decision into a transparent and verifiable diagnostic process. We demonstrate that a general-purpose backbone, fine-tuned on our compact dataset and enhanced with chain-of-thought distillation, achieves strong generalization across a variety of generators and generation modes. Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.[58] Recent Advances in Multi-Agent Human Trajectory Prediction: A Comprehensive Review
Céline Finet,Stephane Da Silva Martins,Jean-Bernard Hayet,Ioannis Karamouzas,Javad Amirian,Sylvie Le Hégarat-Mascle,Julien Pettré,Emanuel Aldea
Main category: cs.CV
TL;DR: This paper reviews recent advancements in deep learning-based multi-agent trajectory prediction from 2020 to 2024, categorizing methods by architecture, input representation, and prediction strategy, with a focus on models evaluated using the ETH/UCY benchmark. It also highlights challenges and future research directions.
Details
Motivation: To gain a finer understanding of multi-agent interactions through reviewing recent advancements in human trajectory prediction using deep learning-based methods. Method: Categorizing existing methods based on architectural design, input representations, and overall prediction strategies, with an emphasis on models evaluated using the ETH/UCY benchmark. Result: Provides a comprehensive review of the most recent advancements in deep learning-based multi-agent trajectory prediction, highlighting key challenges and suggesting future research directions. Conclusion: The field of multi-agent HTP has seen significant progress due to powerful data-driven methods, but there are still important challenges that need to be addressed for further advancement. Abstract: With the emergence of powerful data-driven methods in human trajectory prediction (HTP), gaining a finer understanding of multi-agent interactions lies within hand's reach, with important implications in areas such as autonomous navigation and crowd modeling. This survey reviews some of the most recent advancements in deep learning-based multi-agent trajectory prediction, focusing on studies published between 2020 and 2024. We categorize the existing methods based on their architectural design, their input representations, and their overall prediction strategies, placing a particular emphasis on models evaluated using the ETH/UCY benchmark. Furthermore, we highlight key challenges and future research directions in the field of multi-agent HTP.[59] ArchShapeNet:An Interpretable 3D-CNN Framework for Evaluating Architectural Shapes
Jun Yin,Jing Zhong,Pengyu Zeng,Peilin Li,Zixuan Dai,Miao Zhang,Shuai Lu
Main category: cs.CV
TL;DR: This paper presents ArchForms-4000, a dataset of human and machine-generated 3D architectural forms, proposes ArchShapeNet for classification and analysis, and conducts experiments revealing insights into the strengths of both design approaches.
Details
Motivation: To objectively analyze the differences between human-designed and machine-generated 3D forms in architecture and overcome the limitations in understanding their respective strengths. Method: Constructed ArchForms-4000 with 2,000 architect-designed and 2,000 Evomass-generated forms. Proposed ArchShapeNet, a 3D CNN with a saliency module for classification and feature highlighting. Conducted comparative experiments to evaluate model performance against human experts. Result: The model achieved 94.29% accuracy, 96.2% precision, and 98.51% recall in distinguishing form origins, outperforming human experts. Revealed human-designed forms excel in spatial organization, proportional harmony, and detail refinement. Conclusion: The study emphasizes the unique advantages of human-designed forms while providing insights to improve generative design tools. Abstract: In contemporary architectural design, the growing complexity and diversity of design demands have made generative plugin tools essential for quickly producing initial concepts and exploring novel 3D forms. However, objectively analyzing the differences between human-designed and machine-generated 3D forms remains a challenge, limiting our understanding of their respective strengths and hindering the advancement of generative tools. To address this, we built ArchForms-4000, a dataset containing 2,000 architect-designed and 2,000 Evomass-generated 3D forms; Proposed ArchShapeNet, a 3D convolutional neural network tailored for classifying and analyzing architectural forms, incorporating a saliency module to highlight key spatial features aligned with architectural reasoning; And conducted comparative experiments showing our model outperforms human experts in distinguishing form origins, achieving 94.29% accuracy, 96.2% precision, and 98.51% recall. This study not only highlights the distinctive advantages of human-designed forms in spatial organization, proportional harmony, and detail refinement but also provides valuable insights for enhancing generative design tools in the future.[60] Real-Time, Low-Latency Surveillance Using Entropy-Based Adaptive Buffering and MobileNetV2 on Edge Devices
Poojashree Chandrashekar Pankaj M Sajjanar
Main category: cs.CV
TL;DR: This paper presents a high-performance, low-latency video surveillance system for resource-constrained environments. It integrates an entropy-based adaptive frame buffering algorithm with MobileNetV2, achieving sub-50ms end-to-end inference latency on devices like Raspberry Pi and maintaining over 92% detection accuracy.
Details
Motivation: To create an efficient video surveillance system that operates effectively in resource-constrained environments without compromising performance or accuracy. Method: Proposes an entropy-based adaptive frame buffering algorithm integrated with MobileNetV2 to optimize throughput and reduce latency. The system is designed for processing live video streams on embedding platforms such as Raspberry Pi and NVIDIA Jetson Nano. Result: The system achieves sub-50ms end-to-end inference latency while maintaining over 92% detection accuracy on standard datasets. Comparative and ablation experiments validate the design's effectiveness. Conclusion: The architecture is scalable, cost-effective, and adheres to stricter data privacy regulations, making it suitable for smart city or embedded security systems. Abstract: This paper describes a high-performance, low-latency video surveillance system designed for resource-constrained environments. We have proposed a formal entropy-based adaptive frame buffering algorithm and integrated that with MobileNetV2 to achieve high throughput with low latency. The system is capable of processing live streams of video with sub-50ms end-to-end inference latency on resource-constrained devices (embedding platforms) such as Raspberry Pi, Amazon, and NVIDIA Jetson Nano. Our method maintains over 92% detection accuracy on standard datasets focused on video surveillance and exhibits robustness to varying lighting, backgrounds, and speeds. A number of comparative and ablation experiments validate the effectiveness of our design. Finally, our architecture is scalable, inexpensive, and compliant with stricter data privacy regulations than common surveillance systems, so that the system could coexist in a smart city or embedded security architecture.[61] MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation
Kiet Dang Vu,Trung Thai Tran,Duc Dung Nguyen
Main category: cs.CV
TL;DR: MonoVQD is a new framework for monocular 3D detection which improves upon DETR-based architectures through three key contributions: Mask Separated Self-Attention, Variational Query Denoising, and a self-distillation strategy. These innovations enhance performance on benchmarks like KITTI and nuScenes.
Details
Motivation: Monocular 3D object detection from a single image is a challenging task due to the limitations of directly applying DETR-like architectures, which face issues such as unstable Hungarian matching, gradient vanishing in denoising processes, and suboptimal query refinement. Method: The paper introduces MonoVQD with three main components: 1) Mask Separated Self-Attention that integrates denoising into DETR architecture for stable optimization; 2) Variational Query Denoising to address gradient vanishing problem by introducing stochastic properties; 3) A self-distillation strategy leveraging insights from later decoder layers to improve earlier queries. Result: MonoVQD achieves superior performance on the KITTI monocular benchmark and demonstrates significant improvements when integrated into other architectures for multi-view 3D detection on the nuScenes dataset. Conclusion: MonoVQD fundamentally advances DETR-based monocular 3D detection and shows robust generalization capabilities across different datasets and scenarios. Abstract: Precisely localizing 3D objects from a single image constitutes a central challenge in monocular 3D detection. While DETR-like architectures offer a powerful paradigm, their direct application in this domain encounters inherent limitations, preventing optimal performance. Our work addresses these challenges by introducing MonoVQD, a novel framework designed to fundamentally advance DETR-based monocular 3D detection. We propose three main contributions. First, we propose the Mask Separated Self-Attention mechanism that enables the integration of the denoising process into a DETR architecture. This improves the stability of Hungarian matching to achieve a consistent optimization objective. Second, we present the Variational Query Denoising technique to address the gradient vanishing problem of conventional denoising methods, which severely restricts the efficiency of the denoising process. This explicitly introduces stochastic properties to mitigate this fundamental limitation and unlock substantial performance gains. Finally, we introduce a sophisticated self-distillation strategy, leveraging insights from later decoder layers to synergistically improve query quality in earlier layers, thereby amplifying the iterative refinement process. Rigorous experimentation demonstrates that MonoVQD achieves superior performance on the challenging KITTI monocular benchmark. Highlighting its broad applicability, MonoVQD's core components seamlessly integrate into other architectures, delivering significant performance gains even in multi-view 3D detection scenarios on the nuScenes dataset and underscoring its robust generalization capabilities.[62] Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction
Chengzhi Xu,Yuyang Wang,Lai Wei,Lichao Sun,Weiran Huang
Main category: cs.CV
TL;DR: The paper proposes ChartIR, an iterative refinement method for chart-to-code generation using multimodal large language models (MLLMs). It divides the task into visual understanding and code translation, introducing structured instructions to enhance MLLMs' performance. Experiments show superior results on both Qwen2-VL and GPT-4o models.
Details
Motivation: Existing MLLMs struggle with chart-to-code generation due to its complexity, which requires both precise visual understanding and accurate translation of visual elements into structured code. Method: ChartIR is an iterative refinement method that distinguishes two tasks: visual understanding and code translation. For visual understanding, it uses structured instructions of description and difference types. The overall pipeline is decomposed into initial code generation and iterative refinement stages. Result: ChartIR outperforms other methods in chart-to-code generation tasks when applied to both the open-source model Qwen2-VL and the closed-source model GPT-4o. Conclusion: ChartIR effectively improves MLLMs' performance in chart-to-code generation by breaking down the task and utilizing structured instructions, as evidenced by its superior experimental results. Abstract: Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction. First, we distinguish two tasks: visual understanding and code translation. To accomplish the visual understanding component, we design two types of structured instructions: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations, thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement, enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.[63] PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers
Lukas Schiesser,Cornelius Wolff,Sophie Haas,Simon Pukrop
Main category: cs.CV
TL;DR: In the abstract, the authors address the challenge of building image classification models in data-scarce domains. They introduce PictSure, an In-context learning (ICL) framework that focuses on the role of image embeddings for few-shot image classification. Systematic experiments reveal that embedding model pretraining strongly affects performance. PictSure excels in out-of-domain benchmarks while matching existing models in in-domain tasks.
Details
Motivation: The motivation stems from the difficulty of constructing image classification models in scenarios with limited data, where acquiring large labeled datasets is not feasible. ICL has shown potential in few-shot image classification but the influence of image embeddings in this context has been largely ignored. Method: The method involves developing PictSure, an ICL framework emphasizing the embedding model's architecture, pretraining, and training dynamics. It investigates various visual encoder types, pretraining objectives, and fine-tuning strategies to understand their impact on few-shot image classification performance. Result: The experiments demonstrate that the success of training and out-of-domain performance are significantly influenced by how embedding models are pretrained. PictSure surpasses existing ICL-based models in out-of-domain benchmarks while showing similar results in in-domain tasks. Conclusion: PictSure offers a novel approach to few-shot image classification by focusing on the importance of image embeddings. It achieves better out-of-domain performance compared to current methods while maintaining competitive in-domain results. Abstract: Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.[64] Finding Optimal Kernel Size and Dimension in Convolutional Neural Networks An Architecture Optimization Approach
Shreyas Rajeev,B Sathish Babu
Main category: cs.CV
TL;DR: This paper proposes BKSEF for optimal kernel size selection in CNNs, improving accuracy and efficiency.
Details
Motivation: Kernel size in CNNs impacts multiple performance factors but lacks systematic optimization. Method: BKSEF integrates theories to determine layer-wise kernel sizes balancing information gain, efficiency, and accuracy. Result: Achieves up to 3.1% accuracy improvement and 42.8% FLOPs reduction on various datasets. Conclusion: Kernel size can be actively optimized, offering practical heuristics for efficient CNN designs. Abstract: Kernel size selection in Convolutional Neural Networks (CNNs) is a critical but often overlooked design decision that affects receptive field, feature extraction, computational cost, and model accuracy. This paper proposes the Best Kernel Size Estimation Function (BKSEF), a mathematically grounded and empirically validated framework for optimal, layer-wise kernel size determination. BKSEF balances information gain, computational efficiency, and accuracy improvements by integrating principles from information theory, signal processing, and learning theory. Extensive experiments on CIFAR-10, CIFAR-100, ImageNet-lite, ChestX-ray14, and GTSRB datasets demonstrate that BKSEF-guided architectures achieve up to 3.1 percent accuracy improvement and 42.8 percent reduction in FLOPs compared to traditional models using uniform 3x3 kernels. Two real-world case studies further validate the approach: one for medical image classification in a cloud-based setup, and another for traffic sign recognition on edge devices. The former achieved enhanced interpretability and accuracy, while the latter reduced latency and model size significantly, with minimal accuracy trade-off. These results show that kernel size can be an active, optimizable parameter rather than a fixed heuristic. BKSEF provides practical heuristics and theoretical support for researchers and developers seeking efficient and application-aware CNN designs. It is suitable for integration into neural architecture search pipelines and real-time systems, offering a new perspective on CNN optimization.[65] Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis
Varun Mannam,Zhenyu Shi
Main category: cs.CV
TL;DR: Automated key-frame identification and annotation in retail videos via deep learning, reducing costs by 2x and requiring human verification for only 5% of frames.
Details
Motivation: Accurate video annotation is crucial in modern retail applications but traditional methods are time-consuming and costly due to manual labeling. Method: A deep learning-based approach automates key-frame identification using deep neural networks that embed video frames and incorporate object detection techniques specific to retail environments. Result: The method achieves accuracy comparable to human annotators, reduces costs by 2 times, and requires human verification for less than 5% of frames. Conclusion: This automated approach significantly reduces operational costs and enhances efficiency in retail video annotation tasks. Abstract: Accurate video annotation plays a vital role in modern retail applications, including customer behavior analysis, product interaction detection, and in-store activity recognition. However, conventional annotation methods heavily rely on time-consuming manual labeling by human annotators, introducing non-robust frame selection and increasing operational costs. To address these challenges in the retail domain, we propose a deep learning-based approach that automates key-frame identification in retail videos and provides automatic annotations of products and customers. Our method leverages deep neural networks to learn discriminative features by embedding video frames and incorporating object detection-based techniques tailored for retail environments. Experimental results showcase the superiority of our approach over traditional methods, achieving accuracy comparable to human annotator labeling while enhancing the overall efficiency of retail video annotation. Remarkably, our approach leads to an average of 2 times cost savings in video annotation. By allowing human annotators to verify/adjust less than 5% of detected frames in the video dataset, while automating the annotation process for the remaining frames without reducing annotation quality, retailers can significantly reduce operational costs. The automation of key-frame detection enables substantial time and effort savings in retail video labeling tasks, proving highly valuable for diverse retail applications such as shopper journey analysis, product interaction detection, and in-store security monitoring.[66] Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction
Zhengquan Zhang,Feng Xu,Mengmi Zhang
Main category: cs.CV
TL;DR: An AI system for active view selection (AVS) in 3D reconstruction is developed, using a lightweight neural network called UPNet to predict uncertainty maps and select the most informative viewpoints. This approach achieves comparable accuracy with fewer viewpoints and significantly reduced computational resources.
Details
Motivation: Current AVS methods for 3D reconstruction often rely on learning radiance fields or computing uncertainties for each viewpoint, which can be computationally expensive. There's a need for a more efficient method that can reduce the number of viewpoints while maintaining reconstruction accuracy. Method: A novel AVS approach is introduced, utilizing UPNet, a lightweight feedforward deep neural network, to predict uncertainty maps from single input images. These maps help in selecting the most informative viewpoints by suppressing redundant ones. The selected viewpoints are then used to train 3D neural rendering models. Result: The method achieves comparable 3D reconstruction accuracy using only half of the viewpoints compared to upper bound methods. It also offers a significant reduction in computational resources, achieving up to 400 times speedup and over 50% reductions in CPU, RAM, and GPU usage. Additionally, it generalizes well to novel object categories without additional training. Conclusion: UPNet provides an efficient solution for AVS in 3D reconstruction, reducing computational overhead while maintaining accuracy. Its ability to generalize to new object categories makes it a promising advancement in the field. Abstract: Some perspectives naturally provide more information than others. How can an AI system determine which viewpoint offers the most valuable insight for accurate and efficient 3D object reconstruction? Active view selection (AVS) for 3D reconstruction remains a fundamental challenge in computer vision. The aim is to identify the minimal set of views that yields the most accurate 3D reconstruction. Instead of learning radiance fields, like NeRF or 3D Gaussian Splatting, from a current observation and computing uncertainty for each candidate viewpoint, we introduce a novel AVS approach guided by neural uncertainty maps predicted by a lightweight feedforward deep neural network, named UPNet. UPNet takes a single input image of a 3D object and outputs a predicted uncertainty map, representing uncertainty values across all possible candidate viewpoints. By leveraging heuristics derived from observing many natural objects and their associated uncertainty patterns, we train UPNet to learn a direct mapping from viewpoint appearance to uncertainty in the underlying volumetric representations. Next, our approach aggregates all previously predicted neural uncertainty maps to suppress redundant candidate viewpoints and effectively select the most informative one. Using these selected viewpoints, we train 3D neural rendering models and evaluate the quality of novel view synthesis against other competitive AVS methods. Remarkably, despite using half of the viewpoints than the upper bound, our method achieves comparable reconstruction accuracy. In addition, it significantly reduces computational overhead during AVS, achieving up to a 400 times speedup along with over 50\% reductions in CPU, RAM, and GPU usage compared to baseline methods. Notably, our approach generalizes effectively to AVS tasks involving novel object categories, without requiring any additional training.[67] DETONATE: A Benchmark for Text-to-Image Alignment and Kernelized Direct Preference Optimization
Renjith Prasad,Abhilekh Borah,Hasnat Md Abdullah,Chathurangi Shyalika,Gurpreet Singh,Ritvik Garimella,Rajarshi Roy,Harshul Surana,Nasrin Imanpour,Suranjana Trivedy,Amit Sheth,Amitava Das
Main category: cs.CV
TL;DR: The paper introduces DPO-Kernels for T2I models with a novel extension enhancing alignment across three dimensions and DETONATE, a large-scale benchmark for evaluating social bias and discrimination.
Details
Motivation: To ensure generated images capture user intent faithfully while maintaining safety and fairness. Method: DPO-Kernels for T2I models include Hybrid Loss, Kernelized Representations, and Divergence Selection. Also introduced is DETONATE, a large-scale benchmark, and AQI, a geometric measure. Result: Empirical results show that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). Conclusion: The advancements in alignment techniques and the introduction of DETONATE and AQI contribute significantly to improving safety and fairness in T2I models. Abstract: Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness. Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems. This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybrid Loss, integrating embedding-based objectives with traditional probability-based loss for improved optimization; (ii) Kernelized Representations, employing Radial Basis Function (RBF), Polynomial, and Wavelet kernels for richer feature transformations and better separation between safe and unsafe inputs; and (iii) Divergence Selection, expanding beyond DPO's default Kullback-Leibler (KL) regularizer by incorporating Wasserstein and R'enyi divergences for enhanced stability and robustness. We introduce DETONATE, the first large-scale benchmark of its kind, comprising approximately 100K curated image pairs categorized as chosen and rejected. DETONATE encapsulates three axes of social bias and discrimination: Race, Gender, and Disability. Prompts are sourced from hate speech datasets, with images generated by leading T2I models including Stable Diffusion 3.5 Large, Stable Diffusion XL, and Midjourney. Additionally, we propose the Alignment Quality Index (AQI), a novel geometric measure quantifying latent-space separability of safe/unsafe image activations, revealing hidden vulnerabilities. Empirically, we demonstrate that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). DETONATE and complete code are publicly released.[68] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning
Yizhen Zhang,Yang Ding,Shuoshuo Zhang,Xinchen Zhang,Haoling Li,Zhong-zhi Li,Peijie Wang,Jie Wu,Lei Ji,Yelong Shen,Yujiu Yang,Yeyun Gong
Main category: cs.CV
TL;DR: An abstract about a new reinforcement learning approach PeRL for multimodal reasoning tasks.
Details
Motivation: Existing multimodal reinforcement learning approaches are limited to spatial reasoning within single-image contexts and struggle with multi-image positional reasoning. Method: Propose PeRL, a general reinforcement learning approach for interleaved multimodal tasks, and a multi-stage strategy including permutation of image sequences and rollout filtering mechanism for resampling. Result: Experiments on 5 multi-image benchmarks and 3 single-image benchmarks show that PeRL surpasses R1-related and interleaved VLM baselines, achieving state-of-the-art performance on multi-image benchmarks while maintaining comparable performance on single-image tasks. Conclusion: PeRL improves learning efficiency and task performance for multimodal reasoning tasks. Abstract: Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.[69] Frequency-Calibrated Membership Inference Attacks on Medical Image Diffusion Models
Xinkai Zhao,Yuta Tokuoka,Junichiro Iwasawa,Keita Oda
Main category: cs.CV
TL;DR: A new method named Frequency-Calibrated Reconstruction Error (FCRE) is proposed to improve Membership Inference Attack (MIA) on medical image diffusion models by focusing on mid-frequency reconstruction errors. It outperforms existing MIA methods in experiments on medical image datasets.
Details
Motivation: The motivation of this paper is to address the privacy concerns related to the use of diffusion models for image generation in sensitive areas like medical imaging. Existing MIA methods relying on diffusion reconstruction errors face challenges when applied to medical images due to the influence of inherent image difficulty and the struggle of diffusion models with high-frequency detail reconstruction. Method: The proposed method, Frequency-Calibrated Reconstruction Error (FCRE), focuses on reconstruction errors within a specific mid-frequency range. It excludes high-frequency and low-frequency regions to mitigate the confounding factor of inherent image difficulty. The approach involves analyzing the reverse diffusion process, obtaining the mid-frequency reconstruction error, and computing the structural similarity index score between the reconstructed and original images. Result: Experiments conducted on several medical image datasets show that the FCRE method outperforms existing MIA methods in determining membership based on the computed scores. Conclusion: The FCRE method provides an effective solution to enhance MIAs on medical image diffusion models by concentrating on mid-frequency reconstruction errors, leading to better performance compared to existing approaches. Abstract: The increasing use of diffusion models for image generation, especially in sensitive areas like medical imaging, has raised significant privacy concerns. Membership Inference Attack (MIA) has emerged as a potential approach to determine if a specific image was used to train a diffusion model, thus quantifying privacy risks. Existing MIA methods often rely on diffusion reconstruction errors, where member images are expected to have lower reconstruction errors than non-member images. However, applying these methods directly to medical images faces challenges. Reconstruction error is influenced by inherent image difficulty, and diffusion models struggle with high-frequency detail reconstruction. To address these issues, we propose a Frequency-Calibrated Reconstruction Error (FCRE) method for MIAs on medical image diffusion models. By focusing on reconstruction errors within a specific mid-frequency range and excluding both high-frequency (difficult to reconstruct) and low-frequency (less informative) regions, our frequency-selective approach mitigates the confounding factor of inherent image difficulty. Specifically, we analyze the reverse diffusion process, obtain the mid-frequency reconstruction error, and compute the structural similarity index score between the reconstructed and original images. Membership is determined by comparing this score to a threshold. Experiments on several medical image datasets demonstrate that our FCRE method outperforms existing MIA methods.[70] Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images
Md Abrar Jahin,Shahriar Soudeep,Arian Rahman Aditta,M. F. Mridha,Nafiz Fahad,Md. Jakir Hossen
Main category: cs.CV
TL;DR: The paper explores the use of Vision Transformer (ViT) architectures for quark-gluon jet classification using calorimeter images, showing that ViT-based models outperform traditional CNNs.
Details
Motivation: Distinguishing between quark- and gluon-initiated jets is crucial in high-energy physics, yet it remains a challenging task. Current deep learning methods, particularly CNNs, have shown promise but the potential of ViTs for this task has not been fully explored, especially under realistic conditions. Method: The authors construct multi-channel jet-view images from detector-level energy deposits and tracks, then systematically evaluate ViT and hybrid ViT-CNN models for quark-gluon jet classification using simulated CMS Open Data from 2012. Result: ViT-based models, including hybrids like ViT+MaxViT and ViT+ConvNeXt, consistently surpass traditional CNN baselines in F1-score, ROC-AUC, and accuracy metrics. Conclusion: This study establishes a systematic framework and robust performance benchmarks for applying ViT architectures to calorimeter image-based jet classification, providing valuable insights and resources for further deep learning research in this area. Abstract: Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.[71] Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors
Ziteng Li,Malte Kuhlmann,Ilana Nisky,Nicolás Navarro-Guerrero
Main category: cs.CV
TL;DR: 本文提出基于LRCNs和Transformer架构的两种模型,利用GelSight传感器捕获的RGB触觉图像等信息准确预测物体的柔顺性指标。实验表明,所提出的模型比基线模型性能有显著提升,并发现传感器柔顺性与物体柔顺性估计之间存在相关性,且较硬的物体更难估计。
Details
Motivation: 传统的柔顺性检测方法存在便携性和可扩展性差、依赖昂贵设备以及不适合机器人应用的问题。现有的基于神经网络的方法使用视觉触觉传感器时预测精度仍然不足。 Method: 提出了两种基于长时递归卷积网络(LRCNs)和Transformer架构的模型,结合GelSight传感器捕获的RGB触觉图像和其他信息来预测柔顺性指标。 Result: 通过多种指标验证了模型性能,证明了其在准确估计柔顺性方面的有效性,相较于基线模型有显著改进。此外,研究发现传感器柔顺性与物体柔顺性估计之间存在相关性,且较硬的物体更难估计。 Conclusion: 所提出的模型能够有效提高柔顺性预测的准确性,为工程、农业和生物医学应用提供了新的解决方案,并揭示了传感器柔顺性对物体柔顺性估计的影响。 Abstract: Compliance is a critical parameter for describing objects in engineering, agriculture, and biomedical applications. Traditional compliance detection methods are limited by their lack of portability and scalability, rely on specialized, often expensive equipment, and are unsuitable for robotic applications. Moreover, existing neural network-based approaches using vision-based tactile sensors still suffer from insufficient prediction accuracy. In this paper, we propose two models based on Long-term Recurrent Convolutional Networks (LRCNs) and Transformer architectures that leverage RGB tactile images and other information captured by the vision-based sensor GelSight to predict compliance metrics accurately. We validate the performance of these models using multiple metrics and demonstrate their effectiveness in accurately estimating compliance. The proposed models exhibit significant performance improvement over the baseline. Additionally, we investigated the correlation between sensor compliance and object compliance estimation, which revealed that objects that are harder than the sensor are more challenging to estimate.[72] Hyper-Local Deformable Transformers for Text Spotting on Historical Maps
Yijun Lin,Yao-Yi Chiang
Main category: cs.CV
TL;DR: The paper introduces PALETTE, an end-to-end text spotter for historical maps that includes a hyper-local sampling module and positional embeddings to improve detection and recognition of map text. It also presents SynthMap+, a method for generating synthetic training data. Experiments show PALETTE outperforms existing methods on new benchmark datasets, especially for long and angled text.
Details
Motivation: Text extraction from historical maps is challenging due to lack of effective methods and training data, with previous approaches limited to specific map styles and recent machine learning-based methods struggling with precise feature extraction for complex map text. Method: PALETTE is proposed as an end-to-end text spotter featuring a hyper-local sampling module for localized image feature learning around boundary points and characters, and hyper-local positional embeddings for spatial interactions within and across text instances. SynthMap+ is introduced to automatically generate synthetic map images for training. Result: Experiments on two new benchmark datasets demonstrate that PALETTE with SynthMap+ outperforms state-of-the-art text spotters, particularly in handling long and angled text. The system has been used to process over 60,000 maps, generating over 100 million text labels for map searching. Conclusion: PALETTE, along with SynthMap+, provides a robust solution for text extraction from scanned historical maps, significantly improving performance on challenging text instances and supporting large-scale map processing. Abstract: Text on historical maps contains valuable information providing georeferenced historical, political, and cultural contexts. However, text extraction from historical maps is challenging due to the lack of (1) effective methods and (2) training data. Previous approaches use ad-hoc steps tailored to only specific map styles. Recent machine learning-based text spotters (e.g., for scene images) have the potential to solve these challenges because of their flexibility in supporting various types of text instances. However, these methods remain challenges in extracting precise image features for predicting every sub-component (boundary points and characters) in a text instance. This is critical because map text can be lengthy and highly rotated with complex backgrounds, posing difficulties in detecting relevant image features from a rough text region. This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps of a wide variety. PALETTE introduces a novel hyper-local sampling module to explicitly learn localized image features around the target boundary points and characters of a text instance for detection and recognition. PALETTE also enables hyper-local positional embeddings to learn spatial interactions between boundary points and characters within and across text instances. In addition, this paper presents a novel approach to automatically generate synthetic map images, SynthMap+, for training text spotters for historical maps. The experiment shows that PALETTE with SynthMap+ outperforms SOTA text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. We have deployed PALETTE with SynthMap+ to process over 60,000 maps in the David Rumsey Historical Map collection and generated over 100 million text labels to support map searching. The project is released at https://github.com/kartta-foundation/mapkurator-palette-doc.[73] Break Stylistic Sophon: Are We Really Meant to Confine the Imagination in Style Transfer?
Gary Song Yan,Yusen Zhang,Jinyu Zhao,Hao Zhang,Zhangping Yang,Guanye Xiong,Yanfei Liu,Tao Zhang,Yujie He,Siyuan Tian,Yao Gou,Min Li
Main category: cs.CV
TL;DR: This study introduces StyleWallfacer, a novel framework that improves style transfer by introducing semantic-based style injection, human feedback-driven data augmentation, and a training-free triple diffusion process. It achieves artist-level style transfer with preserved content and introduces image color editing during the process.
Details
Motivation: Traditional methods for style transfer encounter various issues, prompting the need for a unified and improved framework to handle different tasks effectively while maintaining high quality and control over the results. Method: 1. Semantic-based style injection using BLIP and CLIP space to generate aligned text descriptions and create a semantic gap for efficient knowledge injection. 2. Data augmentation strategy based on human feedback to reduce overfitting by incorporating high-quality samples into the training set. 3. Training-free triple diffusion process manipulating self-attention layers features similar to cross-attention mechanism, preserving original content with query preservation. Result: Achieved high-quality image-driven style transfer and text-driven stylization with artist-level results while preserving original content. Also, achieved image color editing during the style transfer process for the first time. Conclusion: StyleWallfacer offers a revolutionary approach to style transfer by unifying the framework for different tasks and providing superior quality with preserved content and new capabilities such as image color editing. Abstract: In this pioneering study, we introduce StyleWallfacer, a groundbreaking unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to revolutionize the field by enabling artist level style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering artist-level style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time.[74] Enhancing Vector Quantization with Distributional Matching: A Theoretical and Empirical Study
Xianghong Fang,Litao Guo,Hengchao Chen,Yuxuan Zhang,XiaofanXia,Dingjie Song,Yexin Liu,Hao Wang,Harry Yang,Yuan Yuan,Qiang Sun
Main category: cs.CV
TL;DR: 通过使用Wasserstein距离对齐特征分布和码本分布,解决训练不稳定和码本坍缩问题,实现接近100%的码本利用率并显著降低量化误差。
Details
Motivation: 自回归模型的成功很大程度上依赖于矢量量化技术的有效性,但现有的矢量量化方法存在训练不稳定和码本坍缩两个关键问题。这些问题主要由特征分布和码本分布之间的不匹配引起,导致代表性不足的码本向量和数据信息在压缩过程中的大量损失。 Method: 采用Wasserstein距离来对齐特征分布与码本分布,以解决训练不稳定性和码本坍缩的问题。 Result: 实现了接近100%的码本利用率,并显著减少了量化误差。实证和理论分析均验证了所提出方法的有效性。 Conclusion: 所提出的基于Wasserstein距离的方法能够有效解决矢量量化中的训练不稳定性和码本坍缩问题,提升矢量量化的性能。 Abstract: The success of autoregressive models largely depends on the effectiveness of vector quantization, a technique that discretizes continuous features by mapping them to the nearest code vectors within a learnable codebook. Two critical issues in existing vector quantization methods are training instability and codebook collapse. Training instability arises from the gradient discrepancy introduced by the straight-through estimator, especially in the presence of significant quantization errors, while codebook collapse occurs when only a small subset of code vectors are utilized during training. A closer examination of these issues reveals that they are primarily driven by a mismatch between the distributions of the features and code vectors, leading to unrepresentative code vectors and significant data information loss during compression. To address this, we employ the Wasserstein distance to align these two distributions, achieving near 100\% codebook utilization and significantly reducing the quantization error. Both empirical and theoretical analyses validate the effectiveness of the proposed approach.[75] SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts
Yufei Liu,Haoke Xiao,Jiaxing Chai,Yongcun Zhang,Rong Wang,Zijie Meng,Zhiming Luo
Main category: cs.CV
TL;DR: The paper proposes SynPo, a training-free few-shot method based on LVMs to improve the quality of negative prompts for better performance on low-contrast medical image segmentation.
Details
Motivation: Existing training-free methods using LVMs for few-shot medical image segmentation do not effectively utilize negative prompts, resulting in poor performance on low-contrast images. Method: Propose SynPo which improves the quality of negative prompts by designing a Confidence Map Synergy Module combining DINOv2 and SAM. It selects point prompts using top-k pixels as positive points and Gaussian distribution with K-means clustering for negative points. Result: Extensive experiments show that SynPo performs comparably to state-of-the-art training-based few-shot methods in medical image segmentation. Conclusion: SynPo is an effective training-free few-shot method that leverages high-quality prompts to enhance medical image segmentation, particularly for low-contrast images. Abstract: The advent of Large Vision Models (LVMs) offers new opportunities for few-shot medical image segmentation. However, existing training-free methods based on LVMs fail to effectively utilize negative prompts, leading to poor performance on low-contrast medical images. To address this issue, we propose SynPo, a training-free few-shot method based on LVMs (e.g., SAM), with the core insight: improving the quality of negative prompts. To select point prompts in a more reliable confidence map, we design a novel Confidence Map Synergy Module by combining the strengths of DINOv2 and SAM. Based on the confidence map, we select the top-k pixels as the positive points set and choose the negative points set using a Gaussian distribution, followed by independent K-means clustering for both sets. Then, these selected points are leveraged as high-quality prompts for SAM to get the segmentation results. Extensive experiments demonstrate that SynPo achieves performance comparable to state-of-the-art training-based few-shot methods.[76] Enhancing point cloud analysis via neighbor aggregation correction based on cross-stage structure correlation
Jiaqi Shi,Jin Xiao,Xiaoguang Hu,Boyang Song,Hao Jiang,Tianyou Chen,Baochang Zhang
Main category: cs.CV
TL;DR: 提出了一种新的点云分析模块PDSA,通过高维空间相关性纠正特征分布,提高计算效率和鲁棒性,在语义分割和分类任务中表现出显著性能提升且参数成本更低。
Details
Motivation: 当前点云分析方法在聚合局部结构时存在无关点干扰和特征层次差距问题,而增强方法如直接几何结构编码则面临高计算开销和噪声敏感性问题。 Method: 提出了Point Distribution Set Abstraction (PDSA) 模块,利用高维空间中的相关性来校正聚合过程中的特征分布。具体包括:基于轻量级跨阶段结构描述符区分点相关性;通过减少邻居特征矩阵的方差和增加类别可分离性来增强结构同质性;引入关键点机制优化计算开销。 Result: 在语义分割和分类任务中,基于不同基线模型验证了该方法的泛化性,实现了显著的性能提升且参数成本更低。消融实验和可视化结果进一步证明了方法的有效性和合理性。 Conclusion: PDSA模块通过改进特征聚合方式,有效提高了点云分析任务的计算效率和鲁棒性,同时减少了参数需求,具有广泛的应用前景。 Abstract: Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three-dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limitation by refining spatial description though explicit modeling of cross-stage structure, these enhancement methods based on direct geometric structure encoding have problems of high computational overhead and noise sensitivity. To overcome these problems, we propose the Point Distribution Set Abstraction module (PDSA) that utilizes the correlation in the high-dimensional space to correct the feature distribution during aggregation, which improves the computational efficiency and robustness. PDSA distinguishes the point correlation based on a lightweight cross-stage structural descriptor, and enhances structural homogeneity by reducing the variance of the neighbor feature matrix and increasing classes separability though long-distance modeling. Additionally, we introducing a key point mechanism to optimize the computational overhead. The experimental result on semantic segmentation and classification tasks based on different baselines verify the generalization of the method we proposed, and achieve significant performance improvement with less parameter cost. The corresponding ablation and visualization results demonstrate the effectiveness and rationality of our method. The code and training weight is available at: https://github.com/AGENT9717/PointDistribution[77] Echo-DND: A dual noise diffusion model for robust and precise left ventricle segmentation in echocardiography
Abdur Rahman,Keerthiveena Balraj,Manojkumar Ramteke,Anurag Singh Rathore
Main category: cs.CV
TL;DR: Recent advancements in diffusion probabilistic models (DPMs) have led to Echo-DND, a dual-noise diffusion model for accurate left ventricle segmentation in echocardiograms. It uses Gaussian and Bernoulli noises, multi-scale fusion conditioning, and spatial coherence calibration. Evaluated on CAMUS and EchoNet-Dynamic datasets, it outperforms SOTA models with high Dice scores.
Details
Motivation: Accurate segmentation of the left ventricle in echocardiograms is essential for diagnosis and treatment but is complicated by noise, low contrast, and ambiguous boundaries in ultrasound images. Method: Echo-DND leverages a combination of Gaussian and Bernoulli noises within a dual-noise diffusion framework. It also incorporates a multi-scale fusion conditioning module and spatial coherence calibration to improve segmentation precision and maintain spatial integrity. Result: The proposed Echo-DND model achieves Dice scores of 0.962 and 0.939 on the CAMUS and EchoNet-Dynamic datasets respectively, outperforming existing state-of-the-art models. Conclusion: Echo-DND establishes a new standard in echocardiogram segmentation and shows potential for broader applicability in other medical imaging tasks, potentially improving diagnostic accuracy across various medical domains. Abstract: Recent advancements in diffusion probabilistic models (DPMs) have revolutionized image processing, demonstrating significant potential in medical applications. Accurate segmentation of the left ventricle (LV) in echocardiograms is crucial for diagnostic procedures and necessary treatments. However, ultrasound images are notoriously noisy with low contrast and ambiguous LV boundaries, thereby complicating the segmentation process. To address these challenges, this paper introduces Echo-DND, a novel dual-noise diffusion model specifically designed for this task. Echo-DND leverages a unique combination of Gaussian and Bernoulli noises. It also incorporates a multi-scale fusion conditioning module to improve segmentation precision. Furthermore, it utilizes spatial coherence calibration to maintain spatial integrity in segmentation masks. The model's performance was rigorously validated on the CAMUS and EchoNet-Dynamic datasets. Extensive evaluations demonstrate that the proposed framework outperforms existing SOTA models. It achieves high Dice scores of 0.962 and 0.939 on these datasets, respectively. The proposed Echo-DND model establishes a new standard in echocardiogram segmentation, and its architecture holds promise for broader applicability in other medical imaging tasks, potentially improving diagnostic accuracy across various medical domains. Project page: https://abdur75648.github.io/Echo-DND[78] ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections
Ziling Huang,Yidan Zhang,Shin'ichi Satoh
Main category: cs.CV
TL;DR: The paper introduces Referring Search and Discovery (ReSeDis), a task that combines corpus-level retrieval with pixel-level grounding. It provides a benchmark, a task-specific metric, and a zero-shot baseline for future research.
Details
Motivation: Current techniques in visual search engines only address one side of the challenge - either locating every image containing an object or identifying the object's exact location within an image. There is a need for a unified approach to solve both problems simultaneously. Method: The proposed method, ReSeDis, unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, it decides if the queried object appears in each image and identifies its location through bounding boxes or segmentation masks. Result: A benchmark was curated where each description uniquely maps to object instances across a large corpus. A task-specific metric was designed to jointly score retrieval recall and localization precision. A zero-shot baseline using a frozen vision-language model revealed significant potential for further research. Conclusion: ReSeDis offers a realistic, end-to-end testbed for developing robust and scalable multimodal search systems, marking an advancement in the field of visual search engines. Abstract: Large-scale visual search engines are expected to solve a dual problem at once: (i) locate every image that truly contains the object described by a sentence and (ii) identify the object's bounding box or exact pixels within each hit. Existing techniques address only one side of this challenge. Visual grounding yields tight boxes and masks but rests on the unrealistic assumption that the object is present in every test image, producing a flood of false alarms when applied to web-scale collections. Text-to-image retrieval excels at sifting through massive databases to rank relevant images, yet it stops at whole-image matches and offers no fine-grained localization. We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, a ReSeDis model must decide whether the queried object appears in each image and, if so, where it is, returning bounding boxes or segmentation masks. To enable rigorous study, we curate a benchmark in which every description maps uniquely to object instances scattered across a large, diverse corpus, eliminating unintended matches. We further design a task-specific metric that jointly scores retrieval recall and localization precision. Finally, we provide a straightforward zero-shot baseline using a frozen vision-language model, revealing significant headroom for future study. ReSeDis offers a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems.[79] Conquering the Retina: Bringing Visual in-Context Learning to OCT
Alessio Negrini,Simon Reiß
Main category: cs.CV
TL;DR: This paper investigates the training of generalist models for retinal optical coherence tomography using visual in-context learning (VICL) and proposes an evaluation protocol tailored to VICL in OCT.
Details
Motivation: The motivation is to overcome the limitations of specialized models which are restricted to predefined tasks, by developing generalist models that allow medical practitioners to define tasks on the fly without needing task-specific model development. Method: The method involves training generalist models for retinal optical coherence tomography using visual in-context learning (VICL), where models generalize across tasks based on a few examples provided at inference time. A broad evaluation protocol tailored to VICL in OCT is also proposed. Result: A state-of-the-art medical VICL approach was extensively evaluated on multiple retinal OCT datasets, establishing a baseline that highlights both the potential and current limitations of in-context learning for OCT. Conclusion: The authors conclude by openly releasing their code to foster further research and practical adoption of VICL for retinal OCT. Abstract: Recent advancements in medical image analysis have led to the development of highly specialized models tailored to specific clinical tasks. These models have demonstrated exceptional performance and remain a crucial research direction. Yet, their applicability is limited to predefined tasks, requiring expertise and extensive resources for development and adaptation. In contrast, generalist models offer a different form of utility: allowing medical practitioners to define tasks on the fly without the need for task-specific model development. In this work, we explore how to train generalist models for the domain of retinal optical coherence tomography using visual in-context learning (VICL), i.e., training models to generalize across tasks based on a few examples provided at inference time. To facilitate rigorous assessment, we propose a broad evaluation protocol tailored to VICL in OCT. We extensively evaluate a state-of-the-art medical VICL approach on multiple retinal OCT datasets, establishing a first baseline to highlight the potential and current limitations of in-context learning for OCT. To foster further research and practical adoption, we openly release our code.[80] Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models
Xuelin Shen,Jiayin Xu,Kangsheng Yin,Wenhan Yang
Main category: cs.CV
TL;DR: This paper proposes Privacy-Shielded Image Compression (PSIC) to protect privacy by implementing defenses at the image compression stage, producing bitstreams with multiple decoding options that can prevent interpretation by vision-language pretrained models while preserving perceptual quality and original image compression functionality.
Details
Motivation: The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. Method: A flexible coding method termed Privacy-Shielded Image Compression (PSIC) is proposed. It includes a Conditional Latent Trigger Generation (CLTG) module to produce bias information based on customizable conditions and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function designed to leverage soft labels inferred from the target VLP model's uncertainty on the training data. An adaptive multi-objective optimization strategy is also incorporated to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. Result: Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of the design. Conclusion: PSIC can produce bitstreams with multiple decoding options, preserve satisfactory perceptual quality while preventing interpretation by VLP models, retain original image compression functionality, and reconstruct the image that preserves its full semantic information with a customizable input condition. Abstract: The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. In this context, this paper seeks to protect users' privacy by implementing defenses at the image compression stage to prevent exploitation. Specifically, we propose a flexible coding method, termed Privacy-Shielded Image Compression (PSIC), that can produce bitstreams with multiple decoding options. By default, the bitstream is decoded to preserve satisfactory perceptual quality while preventing interpretation by VLP models. Our method also retains the original image compression functionality. With a customizable input condition, the proposed scheme can reconstruct the image that preserves its full semantic information. A Conditional Latent Trigger Generation (CLTG) module is proposed to produce bias information based on customizable conditions to guide the decoding process into different reconstructed versions, and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function is designed to leverage the soft labels inferred from the target VLP model's uncertainty on the training data. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing Learned Image Compression (LIC) models. Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of our design.[81] DM-FNet: Unified multimodal medical image fusion via diffusion process-trained encoder-decoder
Dan He,Weisheng Li,Guofen Wang,Yuping Huang,Shiqiang Liu
Main category: cs.CV
TL;DR: A two-stage diffusion model-based fusion network (DM-FNet) is proposed for multimodal medical image fusion, which integrates a diffusion process and UNet for detailed feature capture and cross-modal interaction, resulting in high-quality fused images with balanced brightness, color, contrast, and detail.
Details
Motivation: Existing MMIF methods have limited capacity to capture detailed features and insufficient cross-modal feature interaction, leading to suboptimal fused image quality. Method: The method involves a two-stage diffusion model-based fusion network (DM-FNet). In Stage I, a diffusion process trains UNet for image reconstruction, capturing detailed information through progressive denoising. In Stage II, noisy images are input into the fusion network to enhance feature recognition capability, with three key fusion modules integrated for adaptive processing of different modalities. Result: The experimental results demonstrate exceptional performance regarding objective evaluation metrics. The fused image preserves appropriate brightness, a comprehensive distribution of radioactive tracers, rich textures, and clear edges. Conclusion: The proposed DM-FNet achieves high-quality multimodal medical image fusion by effectively balancing brightness, color, contrast, and detail, thus enhancing fused image quality and information density. Abstract: Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality fusion results requires a careful balance of brightness, color, contrast, and detail; this ensures that the fused images effectively display relevant anatomical structures and reflect the functional status of the tissues. However, existing MMIF methods have limited capacity to capture detailed features during conventional training and suffer from insufficient cross-modal feature interaction, leading to suboptimal fused image quality. To address these issues, this study proposes a two-stage diffusion model-based fusion network (DM-FNet) to achieve unified MMIF. In Stage I, a diffusion process trains UNet for image reconstruction. UNet captures detailed information through progressive denoising and represents multilevel data, providing a rich set of feature representations for the subsequent fusion network. In Stage II, noisy images at various steps are input into the fusion network to enhance the model's feature recognition capability. Three key fusion modules are also integrated to process medical images from different modalities adaptively. Ultimately, the robust network structure and a hybrid loss function are integrated to harmonize the fused image's brightness, color, contrast, and detail, enhancing its quality and information density. The experimental results across various medical image types demonstrate that the proposed method performs exceptionally well regarding objective evaluation metrics. The fused image preserves appropriate brightness, a comprehensive distribution of radioactive tracers, rich textures, and clear edges. The code is available at https://github.com/HeDan-11/DM-FNet.[82] video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models
Changli Tang,Yixuan Li,Yudong Yang,Jimin Zhuang,Guangzhi Sun,Wei Li,Zejun Ma,Chao Zhang
Main category: cs.CV
TL;DR: 视频-SALMONN 2是一种先进的视听大语言模型(LLM),具有低秩适应性(LoRA),通过定向偏好优化(DPO)增强视频(带配对音频)的描述。提出新的评估指标和多轮DPO方法,实验表明显著提高描述准确性,减少错误率28%。7亿参数的最终模型在视频描述任务中超越GPT-4o和Gemini-1.5-Pro,并在常用视频问答基准测试中表现优异。
Details
Motivation: 视频包含大量信息,生成详细准确的自然语言描述是视频理解的关键方面。 Method: 提出了视频-SALMONN 2模型,结合了低秩适应(LoRA)和定向偏好优化(DPO)。引入新度量标准评估视频描述的完整性和准确性,并采用多轮DPO(MrDPO)方法改进训练过程,包括定期更新DPO参考模型、合并与重新初始化LoRA模块以及加入真实视频描述指导以稳定训练。 Result: 实验结果表明,多轮DPO方法显著提高了视频-SALMONN 2的描述准确性,将描述错误率降低了28%。最终模型在视频描述任务上超越了GPT-4o和Gemini-1.5-Pro等领先模型,并在相似规模模型的广泛使用的视频问答基准测试中保持高度竞争力。 Conclusion: 视频-SALMONN 2模型通过低秩适应和定向偏好优化有效提升了视频描述的准确性和完整性,在多项任务中表现出色并开源代码。 Abstract: Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.[83] Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images
Liangjie Meng,Danxia Li,Jinrong He,Lili Ma,Zhixin Li
Main category: cs.CV
TL;DR: The paper proposes C-AFBiFPN, a novel framework for SAR ship detection that enhances feature representation and fusion to address challenges like scale variations and complex backgrounds.
Details
Motivation: SAR ship detection encounters difficulties such as significant scale variations among ships, small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. Method: The method involves constructing a Convolutional Feature Enhancement (CFE) module and integrating BiFormer attention within the fusion strategy of BiFPN to create the AFBiFPN network. Result: Experimental results on the SAR Ship Detection Dataset (SSDD) show improved detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features. Conclusion: C-AFBiFPN effectively addresses the challenges in SAR ship detection, enhancing performance in terms of accuracy, robustness, and adaptability. Abstract: Synthetic Aperture Radar (SAR) enables submeter-resolution imaging and all-weather monitoring via active microwave and advanced signal processing. Currently, SAR has found extensive applications in critical maritime domains such as ship detection. However, SAR ship detection faces several challenges, including significant scale variations among ships, the presence of small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. To address these issues, this paper proposes a novel feature enhancement and fusion framework named C-AFBiFPN. C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network, aiming to enrich feature representation and enhance the ability to capture and represent local details and contextual information. Furthermore, C-AFBiFPN innovatively integrates BiFormer attention within the fusion strategy of BiFPN, creating the AFBiFPN network. AFBiFPN improves the global modeling capability of cross-scale feature fusion and can adaptively focus on critical feature regions. The experimental results on SAR Ship Detection Dataset (SSDD) indicate that the proposed approach substantially enhances detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features.[84] RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories
Qingsong Yan,Qiang Wang,Kaiyong Zhao,Jie Chen,Bo Li,Xiaowen Chu,Fei Deng
Main category: cs.CV
TL;DR: RA-NeRF is a new method that can predict accurate camera poses even with complex trajectories by using photometric consistency, flow-driven pose regulation, and an implicit pose filter.
Details
Motivation: Existing methods for 3D reconstruction and SLAM tasks like NeRF and 3DGS depend on accurate camera pose priors but fail to achieve satisfactory accuracy, especially with complex camera trajectories. Method: RA-NeRF uses an incremental pipeline to reconstruct scenes with NeRF based on photometric consistency, incorporates flow-driven pose regulation for robustness in initialization and localization, and employs an implicit pose filter to capture camera movement patterns and reduce noise in pose estimation. Result: Extensive experiments on Tanks&Temple and NeRFBuster datasets show RA-NeRF achieves state-of-the-art results in camera pose estimation and visual quality, proving its effectiveness and robustness under complex pose trajectories. Conclusion: RA-NeRF successfully predicts highly accurate camera poses even with complex camera trajectories through the use of photometric consistency, flow-driven pose regulation, and an implicit pose filter. Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks\&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.[85] Retrospective Memory for Camouflaged Object Detection
Chenxi Zhang,Jiayun Wu,Qing Zhang,Yazhe Zhai,Youwei Pang
Main category: cs.CV
TL;DR: The paper introduces RetroMem, a recall-augmented architecture for camouflaged object detection (COD). It uses a two-stage training paradigm with dense multi-scale adapter (DMA) and dynamic memory mechanism (DMM) to integrate historical knowledge, improving the model's understanding of camouflage scenes.
Details
Motivation: Existing COD methods based on static visual representation modeling lack explicit mechanisms for acquiring historical context, limiting their effectiveness in challenging scenes. Method: RetroMem is proposed with a two-stage training paradigm: 1) Learning stage with DMA to capture multi-scale visual information; 2) Recall stage with DMM and IPR to reconstruct inference of camouflage patterns using latent relationships between learned knowledge and current sample context. Result: Extensive experiments on widely used datasets show that RetroMem significantly outperforms existing state-of-the-art methods in COD. Conclusion: RetroMem, by integrating historical knowledge through its unique architecture and mechanisms, enhances the model's ability to handle challenging camouflage scenes. Abstract: Camouflaged object detection (COD) primarily focuses on learning subtle yet discriminative representations from complex scenes. Existing methods predominantly follow the parametric feedforward architecture based on static visual representation modeling. However, they lack explicit mechanisms for acquiring historical context, limiting their adaptation and effectiveness in handling challenging camouflage scenes. In this paper, we propose a recall-augmented COD architecture, namely RetroMem, which dynamically modulates camouflage pattern perception and inference by integrating relevant historical knowledge into the process. Specifically, RetroMem employs a two-stage training paradigm consisting of a learning stage and a recall stage to construct, update, and utilize memory representations effectively. During the learning stage, we design a dense multi-scale adapter (DMA) to improve the pretrained encoder's capability to capture rich multi-scale visual information with very few trainable parameters, thereby providing foundational inferences. In the recall stage, we propose a dynamic memory mechanism (DMM) and an inference pattern reconstruction (IPR). These components fully leverage the latent relationships between learned knowledge and current sample context to reconstruct the inference of camouflage patterns, thereby significantly improving the model's understanding of camouflage scenes. Extensive experiments on several widely used datasets demonstrate that our RetroMem significantly outperforms existing state-of-the-art methods.[86] Domain Adaptation for Image Classification of Defects in Semiconductor Manufacturing
Adrian Poniatowski,Natalie Gentner,Manuel Barusco,Davide Dalle Pezze,Samuele Salti,Gian Antonio Susto
Main category: cs.CV
TL;DR: In the semiconductor field, Domain Adaptation (DA) techniques are crucial for defect classification. This study explores DA in semi-supervised and unsupervised contexts, proposing DBACS, a CycleGAN-inspired model with additional loss terms to improve performance. Real-world Electron Microscope images validate the approach.
Details
Motivation: To address the challenge of adapting models effectively across different but related domains in the semiconductor field, reducing the need for manual re-labeling or re-training while improving robustness and scalability. Method: The paper proposes DBACS, a CycleGAN-inspired model enhanced with additional loss terms, to advance DA techniques in semi-supervised and unsupervised settings within the semiconductor industry. Result: DBACS was studied and validated on real-world Electron Microscope images, demonstrating its effectiveness in both unsupervised and semi-supervised settings. Conclusion: Domain Adaptation techniques, particularly the proposed DBACS approach, show promise in advancing defect classification in the semiconductor field, enhancing efficiency and reducing costs. Abstract: In the semiconductor sector, due to high demand but also strong and increasing competition, time to market and quality are key factors in securing significant market share in various application areas. Thanks to the success of deep learning methods in recent years in the computer vision domain, Industry 4.0 and 5.0 applications, such as defect classification, have achieved remarkable success. In particular, Domain Adaptation (DA) has proven highly effective since it focuses on using the knowledge learned on a (source) domain to adapt and perform effectively on a different but related (target) domain. By improving robustness and scalability, DA minimizes the need for extensive manual re-labeling or re-training of models. This not only reduces computational and resource costs but also allows human experts to focus on high-value tasks. Therefore, we tested the efficacy of DA techniques in semi-supervised and unsupervised settings within the context of the semiconductor field. Moreover, we propose the DBACS approach, a CycleGAN-inspired model enhanced with additional loss terms to improve performance. All the approaches are studied and validated on real-world Electron Microscope images considering the unsupervised and semi-supervised settings, proving the usefulness of our method in advancing DA techniques for the semiconductor field.[87] MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion
Jun Zhu,Xinfeng Zhang,Lv Tang,JunHao Jiang
Main category: cs.CV
TL;DR: The paper proposes MSNeRV, a multi-scale feature fusion framework for neural video representation that surpasses VTM-23.7 in dynamic scenarios for compression efficiency.
Details
Motivation: Existing INR-based methods have difficulty representing detail-intensive and fast-changing video content due to underutilization of internal network features and lack of video-specific considerations in network design. Method: The method involves enhancing temporal consistency using temporal windows, dividing videos into Groups of Pictures (GoPs) with a GoP-level grid for background representation, designing a multi-scale spatial decoder with a scale-adaptive loss function, and introducing a multi-scale feature block for better feature extraction. Result: MSNeRV shows superior representation capability among INR-based approaches and surpasses VTM-23.7 (Random Access) in dynamic scenarios in terms of compression efficiency. Conclusion: MSNeRV is a promising approach for neural video representation and compression, offering improved performance in dynamic scenarios. Abstract: Implicit Neural representations (INRs) have emerged as a promising approach for video compression, and have achieved comparable performance to the state-of-the-art codecs such as H.266/VVC. However, existing INR-based methods struggle to effectively represent detail-intensive and fast-changing video content. This limitation mainly stems from the underutilization of internal network features and the absence of video-specific considerations in network design. To address these challenges, we propose a multi-scale feature fusion framework, MSNeRV, for neural video representation. In the encoding stage, we enhance temporal consistency by employing temporal windows, and divide the video into multiple Groups of Pictures (GoPs), where a GoP-level grid is used for background representation. Additionally, we design a multi-scale spatial decoder with a scale-adaptive loss function to integrate multi-resolution and multi-frequency information. To further improve feature extraction, we introduce a multi-scale feature block that fully leverages hidden features. We evaluate MSNeRV on HEVC ClassB and UVG datasets for video representation and compression. Experimental results demonstrate that our model exhibits superior representation capability among INR-based approaches and surpasses VTM-23.7 (Random Access) in dynamic scenarios in terms of compression efficiency.[88] BCRNet: Enhancing Landmark Detection in Laparoscopic Liver Surgery via Bezier Curve Refinement
Qian Li,Feng Liu,Shuojue Yang,Daiyun Shen,Yueming Jin
Main category: cs.CV
TL;DR: This paper presents BCRNet, a novel framework for improving landmark detection in laparoscopic liver surgery by using Bezier curve refinement strategy. It includes MFE module, ACPI and HCR mechanism. Evaluations show it outperforms current methods.
Details
Motivation: Accurate identification of anatomical structures in laparoscopic liver surgery is challenging. AR systems integrating MRI/CT with laparoscopic images need better ways to detect curvilinear anatomical landmarks. Method: Propose BCRNet which starts with Multi-modal Feature Extraction (MFE) module, then Adaptive Curve Proposal Initialization (ACPI) generates Bezier curves and confidence scores, and Hierarchical Curve Refinement (HCR) enhances proposals iteratively. Result: BCRNet outperforms state-of-the-art methods on L3D and P2ILF datasets, achieving significant performance improvements. Conclusion: BCRNet significantly enhances landmark detection in laparoscopic liver surgery via the Bezier curve refinement strategy. Abstract: Laparoscopic liver surgery, while minimally invasive, poses significant challenges in accurately identifying critical anatomical structures. Augmented reality (AR) systems, integrating MRI/CT with laparoscopic images based on 2D-3D registration, offer a promising solution for enhancing surgical navigation. A vital aspect of the registration progress is the precise detection of curvilinear anatomical landmarks in laparoscopic images. In this paper, we propose BCRNet (Bezier Curve Refinement Net), a novel framework that significantly enhances landmark detection in laparoscopic liver surgery primarily via the Bezier curve refinement strategy. The framework starts with a Multi-modal Feature Extraction (MFE) module designed to robustly capture semantic features. Then we propose Adaptive Curve Proposal Initialization (ACPI) to generate pixel-aligned Bezier curves and confidence scores for reliable initial proposals. Additionally, we design the Hierarchical Curve Refinement (HCR) mechanism to enhance these proposals iteratively through a multi-stage process, capturing fine-grained contextual details from multi-scale pixel-level features for precise Bezier curve adjustment. Extensive evaluations on the L3D and P2ILF datasets demonstrate that BCRNet outperforms state-of-the-art methods, achieving significant performance improvements. Code will be available.[89] AI-driven visual monitoring of industrial assembly tasks
Mattia Nardon,Stefano Messelodi,Antonio Granata,Fabio Poiesi,Alberto Danese,Davide Boscaini
Main category: cs.CV
TL;DR: Visual monitoring of industrial assembly tasks is critical. ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks without rigid workspace setups or visual markers, combines a perception module with a reasoning module to infer the most likely action being performed. It was validated on two assembly tasks.
Details
Motivation: Visual monitoring of industrial assembly tasks is important for preventing equipment damage and ensuring worker safety. Current solutions typically require rigid workspace setups or visual markers. Method: ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. Result: ViMAT was validated on two assembly tasks involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Conclusion: ViMAT is an effective AI-driven system for real-time visual monitoring of assembly tasks without the need for rigid workspace setups or visual markers. Abstract: Visual monitoring of industrial assembly tasks is critical for preventing equipment damage due to procedural errors and ensuring worker safety. Although commercial solutions exist, they typically require rigid workspace setups or the application of visual markers to simplify the problem. We introduce ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without these constraints. ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. We validate ViMAT on two assembly tasks, involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Project page: https://tev-fbk.github.io/ViMAT[90] MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering
Xinqi Fan,Jingting Li,John See,Moi Hoon Yap,Wen-Huang Cheng,Xiaobai Li,Xiaopeng Hong,Su-Jing Wang,Adrian K. Davision
Main category: cs.CV
TL;DR: Facial micro-expression (ME) research is advancing with new tasks like ME spot-then-recognize (ME-STR) and ME visual question answering (ME-VQA), utilizing multimodal large models to improve analysis.
Details
Motivation: Current methods for facial micro-expression spotting and recognition are suboptimal, especially in long-duration videos. The emergence of powerful multimodal models provides new opportunities to enhance ME analysis. Method: The ME grand challenge 2025 introduces two tasks: ME-STR, which combines spotting and recognition in a sequential pipeline, and ME-VQA, which uses visual question answering to understand MEs through multimodal models. Result: Algorithms will be tested on a specific dataset and ranked on a leaderboard to evaluate their performance in the new tasks. Conclusion: The integration of multimodal models in ME analysis offers promising directions for more effective and comprehensive understanding of facial micro-expressions. Abstract: Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. However, conventional approaches that treat spotting and recognition as separate tasks are suboptimal, particularly for analyzing long-duration videos in realistic settings. Concurrently, the emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2025 introduces two tasks that reflect these evolving research directions: (1) ME spot-then-recognize (ME-STR), which integrates ME spotting and subsequent recognition in a unified sequential pipeline; and (2) ME visual question answering (ME-VQA), which explores ME understanding through visual question answering, leveraging MLLMs or LVLMs to address diverse question types related to MEs. All participating algorithms are required to run on this test set and submit their results on a leaderboard. More details are available at https://megc2025.github.io.[91] MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning
Leonid Ivanov,Vasily Yuryev,Dmitry Yudin
Main category: cs.CV
TL;DR: An enhanced End-to-End model MapFM is introduced for online vectorized HD map generation in autonomous driving. It incorporates a foundation model for encoding camera images and integrates auxiliary prediction heads for semantic segmentation, leading to higher accuracy and improved quality of the predicted maps.
Details
Motivation: High-definition (HD) maps and semantic maps in bird's-eye view (BEV) are crucial for accurate localization, planning, and decision-making in autonomous driving. The motivation of this paper is to develop an advanced model that can generate these maps efficiently and accurately. Method: The method involves creating an enhanced End-to-End model named MapFM. This model boosts feature representation quality by using a powerful foundation model for encoding camera images. Additionally, it integrates auxiliary prediction heads for semantic segmentation in the BEV representation to enrich the model's understanding of the environment and improve prediction quality. Result: The multi-task learning approach provides richer contextual supervision, resulting in a more comprehensive scene representation. This leads to higher accuracy and improved quality of the predicted vectorized HD maps. Conclusion: MapFM successfully generates high-quality vectorized HD maps for autonomous driving by leveraging advanced feature representation and multi-task learning strategies. Abstract: In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model's understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at https://github.com/LIvanoff/MapFM.[92] OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models
Lanfeng Zhong,Xin Liao,Shichuan Zhang,Shaoting Zhang,Guotai Wang
Main category: cs.CV
TL;DR: This paper proposes OpenPath, an open-set active learning approach for pathology image classification that leverages a pre-trained Vision-Language Model (VLM) to enhance the efficiency and effectiveness of sample selection.
Details
Motivation: Training high-performance models for pathology image classification typically requires large-scale annotated datasets which are costly. Active Learning (AL) can reduce labeling efforts but existing methods have limitations in handling Out-Of-Distribution (OOD) data and often start with random selection leading to inefficiencies. Method: OpenPath uses task-specific prompts in the first query round combining target and non-target class information to select In-Distribution samples from unlabeled data. In subsequent rounds, it employs Diverse Informative ID Sampling (DIS), including Prototype-based ID candidate Selection (PIS) and Entropy-Guided Stochastic Sampling (EGSS), ensuring purity and informativeness while avoiding OOD samples. Result: Experiments on two public pathology image datasets demonstrate that OpenPath significantly improves model performance due to its high purity of selected samples, outperforming several state-of-the-art open-set AL methods. Conclusion: OpenPath is a novel approach that addresses the challenges of traditional AL methods in pathology image classification by effectively handling OOD data and reducing labeling costs. Abstract: Pathology image classification plays a crucial role in accurate medical diagnosis and treatment planning. Training high-performance models for this task typically requires large-scale annotated datasets, which are both expensive and time-consuming to acquire. Active Learning (AL) offers a solution by iteratively selecting the most informative samples for annotation, thereby reducing the labeling effort. However, most AL methods are designed under the assumption of a closed-set scenario, where all the unannotated images belong to target classes. In real-world clinical environments, the unlabeled pool often contains a substantial amount of Out-Of-Distribution (OOD) data, leading to low efficiency of annotation in traditional AL methods. Furthermore, most existing AL methods start with random selection in the first query round, leading to a significant waste of labeling costs in open-set scenarios. To address these challenges, we propose OpenPath, a novel open-set active learning approach for pathological image classification leveraging a pre-trained Vision-Language Model (VLM). In the first query, we propose task-specific prompts that combine target and relevant non-target class prompts to effectively select In-Distribution (ID) and informative samples from the unlabeled pool. In subsequent queries, Diverse Informative ID Sampling (DIS) that includes Prototype-based ID candidate Selection (PIS) and Entropy-Guided Stochastic Sampling (EGSS) is proposed to ensure both purity and informativeness in a query, avoiding the selection of OOD samples. Experiments on two public pathology image datasets show that OpenPath significantly enhances the model's performance due to its high purity of selected samples, and outperforms several state-of-the-art open-set AL methods. The code is available at \href{https://github.com/HiLab-git/OpenPath}{https://github.com/HiLab-git/OpenPath}..[93] Open-World Object Counting in Videos
Niki Amini-Naieni,Andrew Zisserman
Main category: cs.CV
TL;DR: An open-world object counting task in videos is introduced, with a model CountVid that leverages an image-based counting model and a promptable video segmentation and tracking model. A new dataset VideoCount is also introduced for evaluation.
Details
Motivation: The motivation of this paper is to address the challenge of accurately counting unique instances of target objects in videos, especially in crowded scenes with occlusions and similar objects. Method: The method involves introducing a model called CountVid which uses an image-based counting model and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. Result: The results show that CountVid provides accurate object counts and significantly outperforms strong baselines when evaluated using the new VideoCount dataset. Conclusion: In conclusion, the paper introduces a new task of open-world object counting in videos, a model CountVid for this task, and a new dataset VideoCount for evaluation. Abstract: We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and similar objects, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for our novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://github.com/niki-amini-naieni/CountVid/.[94] Unsupervised Pelage Pattern Unwrapping for Animal Re-identification
Aleksandr Algasov,Ekaterina Nepovinnykh,Fedor Zolotarev,Tuomas Eerola,Heikki Kälviäinen,Pavel Zemčík,Charles V. Stewart
Main category: cs.CV
TL;DR: The paper proposes a geometry-aware texture mapping approach for re-identifying animals with deformable fur patterns, demonstrating improved accuracy.
Details
Motivation: Existing re-identification methods struggle with the deformable nature of animal fur or skin patterns due to geometric distortions from body movement and posture changes. Method: A geometry-aware texture mapping approach that unwraps pelage patterns into a canonical UV space using surface normal estimation to preserve geometric consistency between 3D surface and 2D texture space. Result: Experiments on seal and leopard datasets show up to a 5.4% improvement in re-identification accuracy. Conclusion: The proposed framework improves re-identification accuracy across diverse poses and viewing angles without requiring ground truth UV annotations and can be trained in a self-supervised manner. Abstract: Existing individual re-identification methods often struggle with the deformable nature of animal fur or skin patterns which undergo geometric distortions due to body movement and posture changes. In this paper, we propose a geometry-aware texture mapping approach that unwarps pelage patterns, the unique markings found on an animal's skin or fur, into a canonical UV space, enabling more robust feature matching. Our method uses surface normal estimation to guide the unwrapping process while preserving the geometric consistency between the 3D surface and the 2D texture space. We focus on two challenging species: Saimaa ringed seals (Pusa hispida saimensis) and leopards (Panthera pardus). Both species have distinctive yet highly deformable fur patterns. By integrating our pattern-preserving UV mapping with existing re-identification techniques, we demonstrate improved accuracy across diverse poses and viewing angles. Our framework does not require ground truth UV annotations and can be trained in a self-supervised manner. Experiments on seal and leopard datasets show up to a 5.4% improvement in re-identification accuracy.[95] When Model Knowledge meets Diffusion Model: Diffusion-assisted Data-free Image Synthesis with Alignment of Domain and Class
Yujin Kim,Hyunsoo Kim,Hyunwoo J. Kim,Suhyun Kim
Main category: cs.CV
TL;DR: The paper proposes DDIS, a Diffusion-assisted Data-free Image Synthesis method that improves synthetic image quality by leveraging a text-to-image diffusion model as an image prior and introduces Domain Alignment Guidance (DAG) and Class Alignment Token (CAT) for better alignment with the training data distribution.
Details
Motivation: Existing DFIS methods produce samples that deviate from the training data distribution due to the lack of prior knowledge about natural images. Method: DDIS leverages a text-to-image diffusion model as a powerful image prior, and introduces Domain Alignment Guidance (DAG) and Class Alignment Token (CAT) embedding to align the synthetic data domain with the training data domain during the diffusion sampling process. Result: Experiments on PACS and ImageNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution, achieving SOTA performance in data-free applications. Conclusion: DDIS is the first Diffusion-assisted Data-free Image Synthesis method that successfully generates images accurately aligning with the training data distribution. Abstract: Open-source pre-trained models hold great potential for diverse applications, but their utility declines when their training data is unavailable. Data-Free Image Synthesis (DFIS) aims to generate images that approximate the learned data distribution of a pre-trained model without accessing the original data. However, existing DFIS meth ods produce samples that deviate from the training data distribution due to the lack of prior knowl edge about natural images. To overcome this limitation, we propose DDIS, the first Diffusion-assisted Data-free Image Synthesis method that leverages a text-to-image diffusion model as a powerful image prior, improving synthetic image quality. DDIS extracts knowledge about the learned distribution from the given model and uses it to guide the diffusion model, enabling the generation of images that accurately align with the training data distribution. To achieve this, we introduce Domain Alignment Guidance (DAG) that aligns the synthetic data domain with the training data domain during the diffusion sampling process. Furthermore, we optimize a single Class Alignment Token (CAT) embedding to effectively capture class-specific attributes in the training dataset. Experiments on PACS and Ima geNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution, achieving SOTA performance in data-free applications.[96] NERO: Explainable Out-of-Distribution Detection with Neuron-level Relevance
Anju Chhetri,Jari Korhonen,Prashnna Gyawali,Binod Bhattarai
Main category: cs.CV
TL;DR: A new OOD scoring mechanism named NERO is proposed to enhance the reliability of deep learning models in medical imaging by improving OOD detection.
Details
Motivation: In medical imaging, it's crucial to identify out-of-distribution (OOD) samples which can indicate potential anomalies. Current OOD detection methods may not fully capture OOD diversity. Method: NERO leverages neuron-level relevance at the feature layer, clusters this relevance for each ID class to form centroids, and introduces a relevance distance metric to measure deviation from these centroids. It also incorporates scaled relevance in the bias term and combines feature norms. Result: NERO improves performance over state-of-the-art OOD detection methods when validated across multiple deep learning architectures on gastrointestinal imaging benchmarks Kvasir and GastroVision. Conclusion: NERO enhances OOD separability and enables explainable OOD detection, contributing to more reliable deep learning models in medical imaging. Abstract: Ensuring reliability is paramount in deep learning, particularly within the domain of medical imaging, where diagnostic decisions often hinge on model outputs. The capacity to separate out-of-distribution (OOD) samples has proven to be a valuable indicator of a model's reliability in research. In medical imaging, this is especially critical, as identifying OOD inputs can help flag potential anomalies that might otherwise go undetected. While many OOD detection methods rely on feature or logit space representations, recent works suggest these approaches may not fully capture OOD diversity. To address this, we propose a novel OOD scoring mechanism, called NERO, that leverages neuron-level relevance at the feature layer. Specifically, we cluster neuron-level relevance for each in-distribution (ID) class to form representative centroids and introduce a relevance distance metric to quantify a new sample's deviation from these centroids, enhancing OOD separability. Additionally, we refine performance by incorporating scaled relevance in the bias term and combining feature norms. Our framework also enables explainable OOD detection. We validate its effectiveness across multiple deep learning architectures on the gastrointestinal imaging benchmarks Kvasir and GastroVision, achieving improvements over state-of-the-art OOD detection methods.[97] Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
Team Hunyuan3D,Shuhui Yang,Mingxin Yang,Yifei Feng,Xin Huang,Sheng Zhang,Zebin He,Di Luo,Haolin Liu,Yunfei Zhao,Qingxiang Lin,Zeqiang Lai,Xianghui Yang,Huiwen Shi,Zibo Zhao,Bowen Zhang,Hongyu Yan,Lifu Wang,Sicong Liu,Jihong Zhang,Meng Chen,Liang Dong,Yiwen Jia,Yulin Cai,Jiaao Yu,Yixuan Tang,Dongyuan Guo,Junlin Yu,Hao Zhang,Zheng Ye,Peng He,Runzhou Wu,Shida Wei,Chao Zhang,Yonghao Tan,Yifu Sun,Lin Niu,Shirui Huang,Bojian Zheng,Shu Liu,Shilin Chen,Xiang Yuan,Xiaofeng Yang,Kai Liu,Jianchen Zhu,Peng Chen,Tian Liu,Di Wang,Yuhong Liu,Linus,Jie Jiang,Jingwei Huang,Chunchao Guo
Main category: cs.CV
TL;DR: This paper presents Hunyuan3D 2.1, a system for generating high-resolution textured 3D assets, aiming to democratize access to 3D AI-generated content.
Details
Motivation: To make the field of 3D AI-generated content more accessible beyond researchers and developers by addressing complexities in collecting, processing, and training 3D models. Method: Introduces Hunyuan3D 2.1 with two core components - Hunyuan3D-DiT for shape generation and Hunyuan3D-Paint for texture synthesis. Provides a step-by-step guide on data processing, model training, evaluation metrics, and deployment strategies. Result: Users will gain knowledge on finetuning or developing robust 3D generative models suitable for gaming, virtual reality, and industrial design applications. Conclusion: Hunyuan3D 2.1 offers an advanced system for producing high-quality 3D assets and provides comprehensive guidance for utilizing and customizing the model. Abstract: 3D AI-generated content (AIGC) is a passionate field that has significantly accelerated the creation of 3D models in gaming, film, and design. Despite the development of several groundbreaking models that have revolutionized 3D generation, the field remains largely accessible only to researchers, developers, and designers due to the complexities involved in collecting, processing, and training 3D models. To address these challenges, we introduce Hunyuan3D 2.1 as a case study in this tutorial. This tutorial offers a comprehensive, step-by-step guide on processing 3D data, training a 3D generative model, and evaluating its performance using Hunyuan3D 2.1, an advanced system for producing high-resolution, textured 3D assets. The system comprises two core components: the Hunyuan3D-DiT for shape generation and the Hunyuan3D-Paint for texture synthesis. We will explore the entire workflow, including data preparation, model architecture, training strategies, evaluation metrics, and deployment. By the conclusion of this tutorial, you will have the knowledge to finetune or develop a robust 3D generative model suitable for applications in gaming, virtual reality, and industrial design.[98] Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning
Chunlei Li,Jingyang Hou,Yilei Shi,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou
Main category: cs.CV
TL;DR: Medical report generation from imaging data is challenging. This paper introduces MRG-LLM, a multimodal large language model that combines a frozen LLM with a learnable visual encoder and uses dynamic prompt customization for medical report generation. Experiments show it performs well.
Details
Motivation: Medical report generation from imaging data remains a challenge in clinical practice, despite the promise of large language models (LLMs). There is a need for better integration of LLMs with medical imaging data. Method: The paper presents MRG-LLM, which combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism using conditional affine transformations derived from visual features. Two implementations are proposed: prompt-wise and promptbook-wise customization. Result: Extensive experiments on IU X-ray and MIMIC-CXR datasets show that MRG-LLM achieves state-of-the-art performance in medical report generation. Conclusion: MRG-LLM effectively integrates LLMs with medical imaging data through dynamic prompt customization, achieving outstanding results in medical report generation. Abstract: Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.[99] GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects
Shujia Li,Haiyu Zhang,Xinyuan Chen,Yaohui Wang,Yutong Ban
Main category: cs.CV
TL;DR: Diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, but extending these advances to 4D human-object interaction (HOI) is challenging due to limited large-scale 4D HOI datasets. The study introduces GenHOI, a two-stage framework for generalization to unseen objects and high-fidelity 4D HOI sequence synthesis. It uses Object-AnchorNet for sparse 3D HOI keyframe reconstruction and Contact-Aware Diffusion Model (ContactDM) for interpolation into dense sequences. Novel components like Contact-Aware Encoder and Contact-Aware HOI Attention improve the quality of generated sequences. Experiments show state-of-the-art results on OMOMO and 3D-FUTURE datasets.
Details
Motivation: The motivation is to address the challenges in synthesizing high-quality 4D human-object interactions by overcoming the limitation of scarce large-scale 4D HOI datasets. Method: The method involves a two-stage framework named GenHOI. In the first stage, Object-AnchorNet reconstructs sparse 3D HOI keyframes for unseen objects using only 3D HOI datasets. In the second stage, Contact-Aware Diffusion Model (ContactDM) interpolates these keyframes into densely temporally coherent 4D HOI sequences. ContactDM includes a Contact-Aware Encoder and Contact-Aware HOI Attention for better integration of contact signals. Result: GenHOI achieves state-of-the-art results on the OMOMO and 3D-FUTURE datasets, demonstrating strong generalization abilities to unseen objects and enabling high-fidelity 4D HOI generation. Conclusion: GenHOI successfully addresses the challenge of synthesizing high-quality 4D human-object interactions by leveraging a two-stage framework that reduces dependence on large-scale 4D HOI datasets, showing superior performance and generalization capabilities. Abstract: While diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, extending these advances to 4D human-object interaction (HOI) remains challenging, mainly due to the limited availability of large-scale 4D HOI datasets. In our study, we introduce GenHOI, a novel two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences. In the initial stage of our framework, we employ an Object-AnchorNet to reconstruct sparse 3D HOI keyframes for unseen objects, learning solely from 3D HOI datasets, thereby mitigating the dependence on large-scale 4D HOI datasets. Subsequently, we introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate sparse 3D HOI keyframes into densely temporally coherent 4D HOI sequences. To enhance the quality of generated 4D HOI sequences, we propose a novel Contact-Aware Encoder within ContactDM to extract human-object contact patterns and a novel Contact-Aware HOI Attention to effectively integrate the contact signals into diffusion models. Experimental results show that we achieve state-of-the-art results on the publicly available OMOMO and 3D-FUTURE datasets, demonstrating strong generalization abilities to unseen objects, while enabling high-fidelity 4D HOI generation.[100] NTIRE 2025 Image Shadow Removal Challenge Report
Florin-Alexandru Vasluianu,Tim Seizinger,Zhuyun Zhou,Cailian Chen,Zongwei Wu,Radu Timofte,Mingjia Li,Jin Hu,Hainuo Wang,Hengxing Liu,Jiarui Wang,Qiming Hu,Xiaojie Guo,Xin Lu,Jiarong Yang,Yuanfei Bao,Anya Hu,Zihao Fan,Kunyu Wang,Jie Xiao,Xi Wang,Xueyang Fu,Zheng-Jun Zha,Yu-Fan Lin,Chia-Ming Lee,Chih-Chung Hsu,Xingbo Wang,Dong Li,Yuxu Chen,Bin Chen,Yuanbo Zhou,Yuanbin Chen,Hongwei Wang,Jiannan Lin,Qinquan Gao,Tong Tong,Zhao Zhang,Yanyan Wei,Wei Dong,Han Zhou,Seyed Amirreza Mousavi,Jun Chen,Haobo Liang,Jiajie Jing,Junyu Li,Yan Yang,Seoyeon Lee,Chaewon Kim,Ziyu Feng,Shidi Chen,Bowen Luan,Zewen Chen,Vijayalaxmi Ashok Aralikatti,G Gyaneshwar Rao,Nikhil Akalwadi,Chaitra Desai,Ramesh Ashok Tabib,Uma Mudenagudi,Anas M. Ali,Bilel Benjdira,Wadii Boulila,Alexandru Brateanu,Cosmin Ancuti,Tanmay Chaturvedi,Manish Kumar,Anmol Srivastav,Daksh Trivedi,Shashwat Thakur,Kishor Upla,Zeyu Xiao,Zhuoyuan Li,Boda Zhou,Shashank Shekhar,Kele Xu,Qisheng Xu,Zijian Gao,Tianjiao Wan,Suiyi Zhao,Bo Wang,Yan Luo,Mingshen Wang,Yilin Zhang
Main category: cs.CV
TL;DR: This paper examines the results of the NTIRE 2025 Shadow Removal Challenge, which had two evaluation tracks and used the WSRD+ dataset.
Details
Motivation: To assess the current state-of-the-art in shadow removal techniques and encourage further research in this area. Method: The challenge included two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Participants used the WSRD+ dataset for testing their solutions. Result: A total of 306 participants registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Conclusion: The NTIRE 2025 Shadow Removal Challenge successfully evaluated the latest advancements in shadow removal techniques using diverse objects, textures, and materials. Abstract: This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.[101] CLAIM: Clinically-Guided LGE Augmentation for Realistic and Diverse Myocardial Scar Synthesis and Segmentation
Farheen Ramzan,Yusuf Kiberu,Nikesh Jathanna,Shahnaz Jamil-Copley,Richard H. Clayton,Chen,Chen
Main category: cs.CV
TL;DR: The paper introduces CLAIM, a framework for myocardial scar synthesis and segmentation from LGE cardiac MRI using deep learning. It includes the SMILE module for clinically guided scar mask generation and uses joint training to improve realism and segmentation accuracy.
Details
Motivation: The motivation is the limited availability and variability of high-quality LGE images with scar labels, which restricts the development of robust segmentation models for myocardial scars. Method: The method involves CLAIM framework with SMILE module that synthesizes anatomically consistent scar patterns based on clinical knowledge and AHA 17-segment model. Joint training strategy is used to optimize both generator and segmentation network. Result: Experimental results indicate that CLAIM produces anatomically coherent scar patterns with higher Dice similarity compared to baseline models. Conclusion: CLAIM enables controllable and realistic myocardial scar synthesis and shows utility for downstream medical imaging tasks. Abstract: Deep learning-based myocardial scar segmentation from late gadolinium enhancement (LGE) cardiac MRI has shown great potential for accurate and timely diagnosis and treatment planning for structural cardiac diseases. However, the limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models. To address this, we introduce CLAIM: \textbf{C}linically-Guided \textbf{L}GE \textbf{A}ugmentation for Real\textbf{i}stic and Diverse \textbf{M}yocardial Scar Synthesis and Segmentation framework, a framework for anatomically grounded scar generation and segmentation. At its core is the SMILE module (Scar Mask generation guided by cLinical knowledgE), which conditions a diffusion-based generator on the clinically adopted AHA 17-segment model to synthesize images with anatomically consistent and spatially diverse scar patterns. In addition, CLAIM employs a joint training strategy in which the scar segmentation network is optimized alongside the generator, aiming to enhance both the realism of synthesized scars and the accuracy of the scar segmentation performance. Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Our approach enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task.[102] RaCalNet: Radar Calibration Network for Sparse-Supervised Metric Depth Estimation
Xingrui Qin,Wentao Zhao,Chuan Cao,Yihe Niu,Houcheng Jiang,Jingchuan Wang
Main category: cs.CV
TL;DR: RaCalNet is a new framework for dense metric depth estimation using millimeter-wave radar that eliminates the need for dense LiDAR supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in improved structural consistency and fine detail preservation while surpassing state-of-the-art dense-supervised methods.
Details
Motivation: Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, which is costly and data-intensive. Method: RaCalNet recalibrates and refines sparse radar points to construct accurate depth priors, which serve as reliable anchors to guide monocular depth prediction without resorting to dense supervision. Result: RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively. Conclusion: RaCalNet provides an effective solution for dense metric depth estimation using millimeter-wave radar that does not require dense LiDAR supervision, offering improved performance with reduced cost and data requirements. Abstract: Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, to guide the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1% compared to dense-supervised methods. Unlike previous approaches that associate radar points with broad image regions and rely heavily on dense labels, RaCalNet first recalibrates and refines sparse radar points to construct accurate depth priors. These priors then serve as reliable anchors to guide monocular depth prediction, enabling metric-scale estimation without resorting to dense supervision. This design improves structural consistency and preserves fine details. Despite relying solely on sparse supervision, RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively.[103] Control and Realism: Best of Both Worlds in Layout-to-Image without Training
Bonan Li,Yinhan Hu,Songhua Liu,Xinchao Wang
Main category: cs.CV
TL;DR: WinWinLay is a training-free method that enhances layout-to-image generation by introducing Non-local Attention Energy Function and Adaptive Update to improve control precision and realism, surpassing state-of-the-art methods in experiments.
Details
Motivation: Existing layout-to-image generation methods using pre-trained Text-to-Image models face challenges with imprecise localization and unrealistic artifacts. Method: WinWinLay uses two strategies: Non-local Attention Energy Function to redistribute attention scores for better alignment with spatial conditions, and Adaptive Update based on Langevin dynamics to ensure in-domain updating while respecting layout constraints. Result: Extensive experiments show that WinWinLay improves control over element placement and achieves high photorealistic visual fidelity, outperforming current state-of-the-art methods. Conclusion: WinWinLay successfully addresses the issues of imprecise localization and unrealistic artifacts in layout-to-image generation, setting a new benchmark in this field. Abstract: Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies, Non-local Attention Energy Function and Adaptive Update, that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.[104] Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie,Zhenheng Yang,Mike Zheng Shou
Main category: cs.CV
TL;DR: This paper presents Show-o2, an improved multimodal model using autoregressive modeling and flow matching for effective multimodal understanding and generation across text, images, and videos.
Details
Motivation: To advance the capabilities of unified multimodal models in handling diverse modalities like text, images, and videos with scalability and efficiency. Method: The model is built on a 3D causal variational autoencoder space with dual-path spatial (-temporal) fusion for constructing unified visual representations. It applies autoregressive modeling to the language head and flow matching to the flow head, based on a language model. A two-stage training approach is used to scale the model effectively. Result: Show-o2 demonstrates versatility in a wide range of multimodal tasks including understanding and generation across different modalities. Conclusion: Show-o2 successfully leverages autoregressive modeling and flow matching within a unified multimodal framework, achieving scalability and effectiveness in multimodal tasks. Abstract: This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.[105] Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification
Junhao Wu,Aboagye-Ntow Stephen,Chuyuan Wang,Gang Chen,Xin Huang
Main category: cs.CV
TL;DR: 提出了一种高效的半监督分割框架,适用于0.3米分辨率影像的土地覆盖分类任务,通过引入SAM2知识和FreqWeaver适配器,在少量参数增加的情况下显著提升了分割性能。
Details
Motivation: 高分辨率土地覆盖分类对于精细分析至关重要,但面临像素级标注成本高、尺度变化大及大规模视觉模型适应性有限等挑战。现有的方法多集中于1米分辨率影像且依赖大量标注数据,难以满足实际应用中对更高分辨率影像的弱监督处理需求。 Method: 提出了一个参数高效的半监督分割框架,专门用于0.3米分辨率影像。该框架利用了SAM2的知识,并引入了针对遥感的FreqWeaver适配器,以增强细粒度细节建模能力,同时保持仅占总模型参数5.96%的轻量化设计。 Result: 有效利用未标记数据并保持极低的参数开销,实现了结构一致性更强的稳健分割结果。相较于现有的参数高效调优策略,性能提升1.78%,相对于最先进的高分辨率遥感分割方法,性能提升3.44%。 Conclusion: 所提出的半监督分割框架在参数效率和性能上均表现出色,为高分辨率遥感影像的土地覆盖分类提供了新的解决方案。 Abstract: Ultra-high Spatial Resolution Land Cover Classification is essential for fine-grained land cover analysis, yet it remains challenging due to the high cost of pixel-level annotations, significant scale variation, and the limited adaptability of large-scale vision models. Existing methods typically focus on 1-meter spatial resolution imagery and rely heavily on annotated data, whereas practical applications often require processing higher-resolution imagery under weak supervision. To address this, we propose a parameter-efficient semi-supervised segmentation framework for 0.3 m spatial resolution imagery, which leverages the knowledge of SAM2 and introduces a remote sensing-specific FreqWeaver Adapter to enhance fine-grained detail modeling while maintaining a lightweight design at only 5.96% of the total model parameters. By effectively leveraging unlabeled data and maintaining minimal parameter overhead, the proposed method delivers robust segmentation results with superior structural consistency, achieving a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches.[106] A Unified Graph-based Framework for Scalable 3D Tree Reconstruction and Non-Destructive Biomass Estimation from Point Clouds
Di Wang,Shi Li
Main category: cs.CV
TL;DR: The paper presents a novel unified framework for end-to-end processing of large-scale point clouds to estimate forest above-ground biomass (AGB) through 3D tree structural reconstruction. This framework reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS, significantly improving the feasibility of QSM-based AGB estimation.
Details
Motivation: Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Current Quantitative Structural Model (QSM) methods face limitations such as being primarily designed for individual trees, depending on high-quality point cloud data from terrestrial laser scanning (TLS), and requiring multiple pre-processing steps that hinder scalability and practical deployment. Method: This study proposes a novel unified framework enabling end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The approach integrates tree segmentation, leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Result: Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). Conclusion: The proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS, substantially improving the feasibility of QSM-based AGB estimation. Abstract: Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction. However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment. This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS). Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales. This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research.[107] One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution
Yujing Sun,Lingchen Sun,Shuaizheng Liu,Rongyuan Wu,Zhengqiang Zhang,Lei Zhang
Main category: cs.CV
TL;DR: This paper proposes Dual LoRA Learning (DLoRAL) paradigm for real-world video super-resolution (Real-VSR), which effectively extracts temporal consistency priors from low-quality input video and enhances video details while maintaining the extracted consistency priors.
Details
Motivation: Reproducing rich spatial details while maintaining temporal consistency in Real-VSR is challenging, especially when using pre-trained generative models like stable diffusion. Existing methods often compromise spatial details for temporal coherence, leading to suboptimal visual quality. Method: The proposed DLoRAL paradigm includes a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, a Consistency-LoRA (C-LoRA) trained to learn robust temporal representations from degraded inputs, and a Detail-LoRA (D-LoRA) trained to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization. Result: Experiments demonstrate that DLoRAL achieves strong performance in both accuracy and speed. Conclusion: DLoRAL effectively addresses the challenge of maintaining temporal consistency while enhancing spatial details in Real-VSR, providing a significant improvement in visual quality. Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.[108] Mono-Modalizing Extremely Heterogeneous Multi-Modal Medical Image Registration
Kyobin Choo,Hyunkyung Han,Jinyeong Kim,Chanyong Yoon,Seong Jae Hwang
Main category: cs.CV
TL;DR: 提出M2M-Reg框架,用于多模态可变形图像配准,结合GradCyCon正则化器,在ADNI数据集上显著提高PET-MRI和FA-MRI注册的DSC值。
Details
Motivation: 临床实践中,功能成像方式(如PET、FA)需要与结构性参考(如MRI、CT)对齐进行准确解释或组分析,但传统无监督DIR方法难以可靠地学习空间映射并常导致图像失真。 Method: 提出M2M-Reg框架,仅使用单模态相似性训练多模态DIR模型,并引入GradCyCon正则化器利用循环训练方案促进微分同胚。框架可扩展到半监督设置中。 Result: 在ADNI数据集上的实验表明,M2M-Reg在PET-MRI和FA-MRI注册中的DSC值比先前方法高出2倍。 Conclusion: M2M-Reg框架有效处理高度异构的多模态DIR问题,为临床应用提供了更精确的解决方案。 Abstract: In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to standard structural scans, conventional unsupervised DIR methods struggle to learn reliable spatial mappings and often distort images. We find that the similarity metrics guiding these models fail to capture alignment between highly disparate modalities. To address this, we propose M2M-Reg (Multi-to-Mono Registration), a novel framework that trains multi-modal DIR models using only mono-modal similarity while preserving the established architectural paradigm for seamless integration into existing models. We also introduce GradCyCon, a regularizer that leverages M2M-Reg's cyclic training scheme to promote diffeomorphism. Furthermore, our framework naturally extends to a semi-supervised setting, integrating pre-aligned and unaligned pairs only, without requiring ground-truth transformations or segmentation masks. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that M2M-Reg achieves up to 2x higher DSC than prior methods for PET-MRI and FA-MRI registration, highlighting its effectiveness in handling highly heterogeneous multi-modal DIR. Our code is available at https://github.com/MICV-yonsei/M2M-Reg.[109] BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion
Yuqing Lan,Chenyang Zhu,Zhirui Gao,Jiazhao Zhang,Yihan Cao,Renjiao Yi,Yijie Wang,Kai Xu
Main category: cs.CV
TL;DR: Open-vocabulary 3D object detection is crucial for autonomous driving and embodied AI, but dense point cloud reconstruction limits real-time deployment. This paper introduces a novel online framework that eliminates the need for reconstruction, enabling memory-efficient and real-time 3D detection using pre-trained models and optimization techniques.
Details
Motivation: To overcome the computational overhead and memory constraints of existing 3D object detection methods that rely on dense point cloud reconstruction. Method: The proposed method uses Cubify Anything as a pre-trained visual foundation model for single-view 3D object detection, combined with CLIP for open-vocabulary semantics. An association module with 3D Non-Maximum Suppression and a box correspondence matching module, along with an optimization module using IoU-guided particle filtering, fuses multi-view detections into a unified 3D bounding box. Result: The method achieves state-of-the-art performance among online methods on ScanNetV2 and CA-1M datasets, demonstrating strong generalization abilities and enabling real-time perception in large environments. Conclusion: This reconstruction-free paradigm offers significant improvements in efficiency and scalability for 3D object detection, paving the way for broader real-world applications. Abstract: Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.[110] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization
Roey Ron,Guy Tevet,Haim Sawdayee,Amit H. Bermano
Main category: cs.CV
TL;DR: HOIDiNi is a text-driven diffusion framework for generating realistic human-object interactions(HOI) using Diffusion Noise Optimization(DNO). It separates the generation process into object-centric and human-centric phases, achieving high contact accuracy and motion naturalness. Experiments on GRAB dataset show its superiority in various aspects.
Details
Motivation: Current HOI generation methods compromise between realism and physical correctness, failing to achieve both simultaneously. This paper aims to address this issue by introducing HOIDiNi. Method: The method uses Diffusion Noise Optimization (DNO) in the noise space of a pretrained diffusion model. It divides the problem into two phases: an object-centric phase that determines hand-object contact locations and a human-centric phase that refines full-body motion based on the initial choices. Result: HOIDiNi outperforms previous works and baselines in contact accuracy, physical validity, and overall quality as evaluated quantitatively, qualitatively, and subjectively on the GRAB dataset. It can generate complex, controllable interactions like grasping, placing, and full-body coordination using only textual prompts. Conclusion: HOIDiNi successfully generates realistic and plausible human-object interactions by optimizing directly in the noise space of a pretrained diffusion model with DNO, proving effective through evaluations on the GRAB dataset. Abstract: We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. https://hoidini.github.io.[111] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents
Karmesh Yadav,Yusuf Ali,Gunshi Gupta,Yarin Gal,Zsolt Kira
Main category: cs.CV
TL;DR: 为了评估视觉-语言模型在长期控制任务中的表现,本文提出了一种新的基准测试方法,并结合了60个需要持续参与和环境意识的任务。此基准可在Habitat模拟器中进行长范围实体任务的评估,并可程序化地扩展到更长、更具挑战性的版本,以实现对记忆和推理的可扩展评估。同时,作者还提供了将最先进的VLM与低级导航策略相结合的基线,评估其在这些记忆密集型任务上的性能并指出改进空间。
Details
Motivation: 目前的大型视觉-语言模型在规划和控制任务上表现出色,但在实际机器人应用中受限于处理长期记忆的能力。现有的长时间视频问答基准忽略了物体操作和导航等实体挑战,而有效的记忆整合需要同时考虑历史信息的回忆和基于该信息的动作执行。 Method: 引入一个新的基准测试方法,用于在Habitat模拟器中进行长范围实体任务的评估,包括60个需要持续参与和环境意识的任务。该基准可以程序化地扩展到更长、更具挑战性的版本,以实现对记忆和推理的可扩展评估。此外,还提供了将最先进的VLM与低级导航策略相结合的基线,用于评估其在记忆密集型任务上的表现。 Result: 通过提供的基准和基线,能够有效评估当前VLMs在记忆密集型任务上的表现,并指出了改进的方向。 Conclusion: 新提出的基准为评估视觉-语言模型在实体场景中的长期记忆和推理能力提供了一个有力工具,揭示了当前模型的局限性,并为未来的研究指明了方向。 Abstract: Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.[112] Demystifying the Visual Quality Paradox in Multimodal Large Language Models
Shuo Xing,Lanqing Guo,Hongyuan Hua,Seoyoung Lee,Peiran Li,Yufei Wang,Zhangyang Wang,Zhengzhong Tu
Main category: cs.CV
TL;DR: Recent MLLMs perform well on vision-language tasks, but the impact of input visual quality on their responses is unclear. This study reveals a visual-quality paradox where model performance can improve with deviations from human-perceived fidelity. To address this, VQ-TTT is introduced to align input images with model preferences.
Details
Motivation: To understand how input visual quality affects the responses of MLLMs and to find ways to improve their performance without using external models or extra training data. Method: Conduct a systematic study applying controlled degradations and stylistic shifts to images across leading MLLMs and benchmarks. Introduce VQ-TTT, which includes inserting a learnable low-rank kernel before the vision encoder and fine-tuning shallow vision-encoder layers via LoRA. Result: VQ-TTT dynamically adjusts input images in a single forward pass, aligning them with task-specific model preferences and lifting average accuracy significantly across evaluated MLLMs and datasets. Conclusion: The findings redefine what constitutes 'better' visual inputs for MLLMs and emphasize the need for adaptive imagery over universally 'clean' images. Abstract: Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.[113] Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Ankan Deria,Adinath Madhavrao Dukre,Feilong Tang,Sara Atito,Sudipta Roy,Muhammad Awais,Muhammad Haris Khan,Imran Razzak
Main category: cs.CV
TL;DR: The paper introduces Value-guided Inference with Margin-based Reward (ViMaR), a two-stage inference framework for vision-language models that enhances efficiency and output fidelity. It combines a temporal-difference value model with margin-aware reward adjustment, leading to more reliable and accurate captions with over 4× speedup compared to existing methods. ViMaR demonstrates robust cross-model generalization and significantly improves caption quality.
Details
Motivation: To address the computational expense and low-confidence generations in current vision-language model inference approaches, which often lead to persistent hallucinations. Method: A two-stage inference framework called ViMaR is proposed. The first stage identifies the highest-value caption among diverse candidates through a single pass. The second stage selectively refines overlooked or weakly grounded segments while discouraging low-confidence continuations using a calibrated margin-based penalty. Result: ViMaR generates captions that are more reliable, factually accurate, detailed, and explanatory. It achieves over 4× speedup compared to existing value-guided methods and effectively generalizes across different VLM architectures, improving caption quality consistently. Conclusion: ViMaR is a scalable and transferable inference-time decoding strategy for vision-language models, enhancing both efficiency and output fidelity. Its potential extends to self-training pipelines, where it leads to substantial gains in visual comprehension benchmarks. Abstract: Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.[114] UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting
Kai He,Ruofan Liang,Jacob Munkberg,Jon Hasselgren,Nandita Vijaykumar,Alexander Keller,Sanja Fidler,Igor Gilitschenski,Zan Gojcic,Zian Wang
Main category: cs.CV
TL;DR: The paper proposes a single-pass method using video diffusion models to estimate albedo and synthesize relit outputs, overcoming limitations of existing models by improving scene understanding and producing realistic lighting effects.
Details
Motivation: Existing relighting models either suffer from limited paired multi-illumination data or error accumulation in two-stage pipelines, leading to challenges in generalizing across scenes and creating realistic outputs under complex conditions. Method: A general-purpose approach is introduced that jointly estimates albedo and synthesizes relit outputs in one pass using the generative capabilities of video diffusion models. This is trained on synthetic multi-illumination data and real-world videos. Result: The model shows strong generalization across various domains with improvements in visual fidelity and temporal consistency compared to previous methods. Conclusion: This joint formulation enhances implicit scene comprehension and enables the creation of realistic lighting effects and material interactions. Abstract: We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.[115] Sekai: A Video Dataset towards World Exploration
Zhen Li,Chuanhao Li,Xiaofeng Mao,Shaoheng Lin,Ming Li,Shitian Zhao,Zhaopan Xu,Xinyue Li,Yukang Feng,Jianwen Sun,Zizhen Li,Fanrui Zhang,Jiaxin Ai,Zhixiang Wang,Yuwei Wu,Tong He,Jiangmiao Pang,Yu Qiao,Yunde Jia,Kaipeng Zhang
Main category: cs.CV
TL;DR: The paper presents Sekai, a large-scale first-person video dataset with rich annotations for world exploration, and YUME, a model trained on a subset of it for interactive video world exploration.
Details
Motivation: Existing video generation datasets have limitations such as limited locations, short duration, static scenes, and lack of exploration-related annotations. Method: Introduced Sekai, a high-quality first-person view worldwide video dataset with over 5,000 hours of footage from diverse locations. Developed a toolbox to annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Used a subset of Sekai to train YUME, an interactive video world exploration model. Result: Experiments show the high quality of the Sekai dataset. YUME demonstrates potential in interactive video world exploration. Conclusion: Sekai will advance video generation and world exploration research and inspire valuable applications. Abstract: Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.[116] Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model
Anirud Aggarwal,Abhinav Shrivastava,Matthew Gwilliam
Main category: cs.CV
TL;DR: Evolutionary Caching to Accelerate Diffusion models (ECAD) is a genetic algorithm that learns efficient caching schedules for diffusion-based image generation models, providing significant inference speedups and fine-grained control over quality-latency trade-offs without modifying network parameters or reference images. Evaluated on PixArt-alpha, PixArt-Sigma, and FLUX-1.dev, ECAD demonstrates consistent improvements over previous methods.