Table of Contents
cs.CL [Back]
[1] Adverse Event Extraction from Discharge Summaries: A New Dataset, Annotation Scheme, and Initial Findings
Imane Guellil,Salomé Andres,Atul Anand,Bruce Guthrie,Huayu Zhang,Abul Hasan,Honghan Wu,Beatrice Alex
Main category: cs.CL
TL;DR: The paper presents a manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients. It includes 14 clinically significant AEs along with contextual attributes like negation, diagnosis type, and in-hospital occurrence. The annotation schema supports both discontinuous and overlapping entities.
Details
Motivation: There is a lack of clinical NLP resources focusing on elderly patients, which are often underrepresented. Additionally, prior work rarely addresses challenges related to discontinuous and overlapping entities in AE extraction. Method: The dataset was developed within a Trusted Research Environment (TRE) and includes multiple clinically significant AEs along with contextual attributes. Multiple models were evaluated using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. Result: Transformer-based models achieved strong performance on document-level coarse-grained extraction (F1 = 0.943), but performance dropped notably for fine-grained entity-level tasks (e.g., F1 = 0.675), particularly for rare events and complex attributes. Conclusion: Significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language despite high-level scores in document-level coarse-grained extraction. Abstract: In this work, we present a manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients, a population often underrepresented in clinical NLP resources. The dataset includes 14 clinically significant AEs-such as falls, delirium, and intracranial haemorrhage, along with contextual attributes like negation, diagnosis type, and in-hospital occurrence. Uniquely, the annotation schema supports both discontinuous and overlapping entities, addressing challenges rarely tackled in prior work. We evaluate multiple models using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. While transformer-based models (e.g., BERT-cased) achieve strong performance on document-level coarse-grained extraction (F1 = 0.943), performance drops notably for fine-grained entity-level tasks (e.g., F1 = 0.675), particularly for rare events and complex attributes. These results demonstrate that despite high-level scores, significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language. Developed within a Trusted Research Environment (TRE), the dataset is available upon request via DataLoch and serves as a robust benchmark for evaluating AE extraction methods and supporting future cross-dataset generalisation.[2] Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction
Marija Šakota,Robert West
Main category: cs.CL
TL;DR: BoostCD通过结合受约束和不受约束的解码来提高结构化NLP任务中的输出质量,特别是在封闭信息提取任务中表现出色。
Details
Motivation: 在结构化NLP任务中,许多方法使用自回归语言模型进行映射,但这些模型在测试时可能产生低质量输出。因此需要一种方法来改善受约束解码的质量。 Method: 提出了一种名为Boosted Constrained Decoding (BoostCD) 的两阶段方法:第一阶段从基础模型以受约束和不受约束两种模式解码获得两个初步预测;第二阶段通过一个学习到的自回归增强模型将这两个初步预测组合成最终预测。 Result: BoostCD应用于封闭信息提取任务时,其模型BoostIE在分布内和分布外都优于先前的方法,并解决了那些方法中常见的几种错误。 Conclusion: BoostCD能够有效提高受约束解码的质量,对于结构化NLP任务具有重要意义。 Abstract: Many recent approaches to structured NLP tasks use an autoregressive language model $M$ to map unstructured input text $x$ to output text $y$ representing structured objects (such as tuples, lists, trees, code, etc.), where the desired output structure is enforced via constrained decoding. During training, these approaches do not require the model to be aware of the constraints, which are merely implicit in the training outputs $y$. This is advantageous as it allows for dynamic constraints without requiring retraining, but can lead to low-quality output during constrained decoding at test time. We overcome this problem with Boosted Constrained Decoding (BoostCD), which combines constrained and unconstrained decoding in two phases: Phase 1 decodes from the base model $M$ twice, in constrained and unconstrained mode, obtaining two weak predictions. In phase 2, a learned autoregressive boosted model combines the two weak predictions into one final prediction. The mistakes made by the base model with vs. without constraints tend to be complementary, which the boosted model learns to exploit for improved performance. We demonstrate the power of BoostCD by applying it to closed information extraction. Our model, BoostIE, outperforms prior approaches both in and out of distribution, addressing several common errors identified in those approaches.[3] CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision
Dyah Adila,Shuai Zhang,Boran Han,Bonan Min,Yuyang Wang
Main category: cs.CL
TL;DR: The paper introduces CrEst, a weakly supervised framework to assess context document credibility for LLMs without manual annotations. It offers two integration strategies and shows consistent improvements in accuracy and F1 score across experiments.
Details
Motivation: Existing methods integrating contextual information into LLMs often overlook the variability in context document credibility, risking unreliable information propagation. Method: CrEst is a weakly supervised framework that leverages inter-document agreement to estimate credibility based on semantic coherence among credible documents. Two integration strategies are proposed: a black-box approach for models with no access to internal parameters and a white-box method modifying attention mechanisms. Result: Experiments across three model architectures and five datasets show CrEst outperforms strong baselines with up to 26.86% improvement in accuracy and 3.49% in F1 score, maintaining robustness under high-noise conditions. Conclusion: CrEst effectively enhances the performance of LLMs on knowledge-intensive tasks by assessing and incorporating context document credibility. Abstract: The integration of contextual information has significantly enhanced the performance of large language models (LLMs) on knowledge-intensive tasks. However, existing methods often overlook a critical challenge: the credibility of context documents can vary widely, potentially leading to the propagation of unreliable information. In this paper, we introduce CrEst, a novel weakly supervised framework for assessing the credibility of context documents during LLM inference--without requiring manual annotations. Our approach is grounded in the insight that credible documents tend to exhibit higher semantic coherence with other credible documents, enabling automated credibility estimation through inter-document agreement. To incorporate credibility into LLM inference, we propose two integration strategies: a black-box approach for models without access to internal weights or activations, and a white-box method that directly modifies attention mechanisms. Extensive experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines, achieving up to a 26.86% improvement in accuracy and a 3.49% increase in F1 score. Further analysis shows that CrEst maintains robust performance even under high-noise conditions.[4] MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance
Joseph J. Peper,Wenzhao Qiu,Ali Payani,Lu Wang
Main category: cs.CL
TL;DR: MDBench 是一个用于评估大型语言模型在多文档推理任务上的新数据集,通过一种新颖的合成生成方法创建,这种方法基于浓缩的结构化种子知识,并通过大型语言模型辅助编辑以引入特定于多文档的推理挑战。
Details
Motivation: 自然语言处理评估取得了显著进展,主要得益于强大的大型语言模型的发展。然而,尽管多文档推理对于处理长上下文输入的大型语言模型来说是一个极其相关的领域,但很少有基准能够严格检查模型在这种设置下的行为。此外,由于标注长输入的成本昂贵,为多文档设置创建基准一直具有挑战性。 Method: MDBench 通过一种新颖的合成生成过程创建,该过程可以可控且高效地生成具有挑战性的文档集和相应的问答示例。具体而言,该技术基于浓缩的结构化种子知识,通过大型语言模型辅助编辑来引入特定于多文档的推理挑战。然后将这种结构化知识转换为自然文本形式,生成文档集和对应的问答示例。 Result: 研究发现,MDBench 对所有方法都提出了重大挑战,即使是在相对较短的文档集上。此外,他们的知识引导生成技术使得针对多文档特定推理能力的分析变得容易,并且可以快速适应以应对新的挑战和未来的建模改进。 Conclusion: MDBench 的引入为评估大型语言模型在多文档推理任务上的表现提供了一个重要的新工具,其合成生成过程展示了可控性和效率,有助于未来的研究和技术进步。 Abstract: Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs). New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs are expanding at a rapid pace. In particular, while multi-document (MD) reasoning is an area of extreme relevance given LLM capabilities in handling longer-context inputs, few benchmarks exist to rigorously examine model behavior in this setting. Moreover, the multi-document setting is historically challenging for benchmark creation due to the expensive cost of annotating long inputs. In this work, we introduce MDBench, a new dataset for evaluating LLMs on the task of multi-document reasoning. Notably, MDBench is created through a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets and the corresponding question-answer (QA) examples. Our novel technique operates on condensed structured seed knowledge, modifying it through LLM-assisted edits to induce MD-specific reasoning challenges. We then convert this structured knowledge into a natural text surface form, generating a document set and corresponding QA example. We analyze the behavior of popular LLMs and prompting techniques, finding that MDBENCH poses significant challenges for all methods, even with relatively short document sets. We also see our knowledge-guided generation technique (1) allows us to readily perform targeted analysis of MD-specific reasoning capabilities and (2) can be adapted quickly to account for new challenges and future modeling improvements.[5] From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction?
Shadman Sakib,Oishy Fatema Akhand,Ajwad Abrar
Main category: cs.CL
TL;DR: This study explores the use of Large Language Models (LLMs) for diabetes prediction using structured numerical data. It compares six LLMs with three traditional machine learning models using the Pima Indian Diabetes Database. The results show that proprietary LLMs perform better than open-source ones, and one LLM outperforms traditional models in terms of F1-score.
Details
Motivation: To investigate the effectiveness of LLMs in predicting diabetes using structured numerical data through zero-shot, one-shot, and three-shot prompting methods. Method: Empirical analysis using the Pima Indian Diabetes Database to evaluate six LLMs (four open-source and two proprietary) and compare their performance with three traditional machine learning models using accuracy, precision, recall, and F1-score as metrics. Result: Proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings. Gemma-2-27B also outperforms traditional ML models in terms of F1-score. Conclusion: LLMs can be useful for medical prediction tasks, but there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning. Abstract: While Machine Learning (ML) and Deep Learning (DL) models have been widely used for diabetes prediction, the use of Large Language Models (LLMs) for structured numerical data is still not well explored. In this study, we test the effectiveness of LLMs in predicting diabetes using zero-shot, one-shot, and three-shot prompting methods. We conduct an empirical analysis using the Pima Indian Diabetes Database (PIDD). We evaluate six LLMs, including four open-source models: Gemma-2-27B, Mistral-7B, Llama-3.1-8B, and Llama-3.2-2B. We also test two proprietary models: GPT-4o and Gemini Flash 2.0. In addition, we compare their performance with three traditional machine learning models: Random Forest, Logistic Regression, and Support Vector Machine (SVM). We use accuracy, precision, recall, and F1-score as evaluation metrics. Our results show that proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings. Notably, Gemma-2-27B also outperforms the traditional ML models in terms of F1-score. However, there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning. This study shows that LLMs can be useful for medical prediction tasks and encourages future work on prompt engineering and hybrid approaches to improve healthcare predictions.[6] Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings
Ignacio Sastre,Aiala Rosá
Main category: cs.CL
TL;DR: An interesting phenomenon is observed where reversible sentence embeddings can be generated to allow LLMs to reconstruct original text exactly without modifying model weights.
Details
Motivation: To explore the capability of generating reversible sentence embeddings that enable LLMs to reconstruct original text exactly without altering model's weights. Method: Introduce a special memory token whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. Result: Evaluated across English and Spanish datasets, sequences up to ~240 tokens, and model scales from 100M to 8B parameters. Notably, Llama 3.1 8B successfully reconstructs all tested sequences. Conclusion: This phenomenon highlights an interesting capability of LLMs with potential applications in memory-based retrieval, compression, and controlled text generation. Abstract: In this work, we observe an interesting phenomenon: it is possible to generate reversible sentence embeddings that allow an LLM to reconstruct the original text exactly, without modifying the model's weights. This is achieved by introducing a special memory token, whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. We evaluate this phenomenon across English and Spanish datasets, sequences of up to approximately 240 tokens, and model scales ranging from 100M to 8B parameters. Notably, Llama 3.1 8B successfully reconstructs all tested sequences. Our findings highlight an interesting capability of LLMs and suggest potential applications in memory-based retrieval, compression, and controlled text generation.[7] Identifying social isolation themes in NVDRS text narratives using topic modeling and text-classification methods
Drew Walker,Swati Rajwal,Sudeshna Das,Snigdha Peddireddy,Abeed Sarker
Main category: cs.CL
TL;DR: 通过自然语言处理技术,研究发现社会孤立和孤独感与自杀率的关系,并开发了高质量分类器以识别相关因素。
Details
Motivation: 社会孤立和孤独感近年来持续增加,并对自杀率有显著影响,因此需要有效的手段来监控和预防这些问题。 Method: 利用主题建模生成词汇表开发和监督学习分类器,分析美国国家暴力死亡报告系统中的文本数据。 Result: 识别出30万例自杀案例中有1,198例提到长期社会孤立,且男性、同性恋及离异人士更容易被归类为存在长期社会孤立。 Conclusion: 本方法可改进美国社会孤立和孤独感的监控和预防措施。 Abstract: Social isolation and loneliness, which have been increasing in recent years strongly contribute toward suicide rates. Although social isolation and loneliness are not currently recorded within the US National Violent Death Reporting System's (NVDRS) structured variables, natural language processing (NLP) techniques can be used to identify these constructs in law enforcement and coroner medical examiner narratives. Using topic modeling to generate lexicon development and supervised learning classifiers, we developed high-quality classifiers (average F1: .86, accuracy: .82). Evaluating over 300,000 suicides from 2002 to 2020, we identified 1,198 mentioning chronic social isolation. Decedents had higher odds of chronic social isolation classification if they were men (OR = 1.44; CI: 1.24, 1.69, p<.0001), gay (OR = 3.68; 1.97, 6.33, p<.0001), or were divorced (OR = 3.34; 2.68, 4.19, p<.0001). We found significant predictors for other social isolation topics of recent or impending divorce, child custody loss, eviction or recent move, and break-up. Our methods can improve surveillance and prevention of social isolation and loneliness in the United States.[8] Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation
Zongxia Li,Yapei Chang,Yuhang Zhou,Xiyang Wu,Zichao Liang,Yoo Yeon Sung,Jordan Lee Boyd-Graber
Main category: cs.CL
TL;DR: The paper proposes PrefBERT, a scoring model for evaluating open-ended long-form generation. PrefBERT provides better feedback than traditional metrics and aligns well with the rewards needed for GRPO training. Human evaluations show that using PrefBERT improves policy model responses.
Details
Motivation: Evaluating open-ended long-form generation is difficult due to unclear distinctions between good and bad outputs, and existing methods often miss key aspects or are biased by pretraining data. Method: PrefBERT is trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality. It offers better semantic reward feedback than ROUGE-L and BERTScore, supporting GRPO training effectively. Result: Comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, confirm that PrefBERT remains reliable across varied long passages and aligns well with verifiable rewards GRPO needs. Conclusion: Using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Abstract: Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.[9] Learning-Time Encoding Shapes Unlearning in LLMs
Ruihan Wu,Konstantin Garov,Kamalika Chaudhuri
Main category: cs.CL
TL;DR: The paper explores how learning-time choices in knowledge encoding affect the unlearning of factual knowledge in large language models.
Details
Motivation: As LLMs are increasingly deployed, there's a need to remove specific pieces of knowledge for reasons like privacy and correcting harmful content. Method: The authors empirically investigate the impact of learning-time knowledge encoding on unlearning effectiveness through experiments. Result: Paraphrased descriptions improve unlearning performance but unlearning individual pieces of knowledge from text chunks is challenging. Conclusion: Learning-time knowledge encoding may be crucial for reliable post-hoc unlearning. Abstract: As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn'', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge. Our experiments reveal two key findings: (1) learning with paraphrased descriptions improves unlearning performance and (2) unlearning individual piece of knowledge from a chunk of text is challenging. Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.[10] Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification
Yaxin Fan,Peifeng Li,Qiaoming Zhu
Main category: cs.CL
TL;DR: This paper proposes a Discourse-aware Clarification Module (DCM) and Contribution-aware Preference Optimization (CPO) to improve dialogue discourse parsing, resolving ambiguities and outperforming SOTA baselines.
Details
Motivation: Linguistic features in dialogues like omission and idiom introduce ambiguities that obscure the intended discourse relations, making it challenging for parsers. Method: The method involves two components: DCM which uses clarification type reasoning and discourse goal reasoning, and CPO which optimizes clarifications by assessing contributions and providing feedback to enhance adaptability and alignment with parser requirements. Result: Extensive experiments on STAC and Molweni datasets show that the approach effectively resolves ambiguities and significantly outperforms state-of-the-art baselines. Conclusion: The proposed DCM and CPO successfully address the challenges posed by ambiguities in dialogue discourse parsing and achieve superior performance. Abstract: Dialogue discourse parsing aims to identify and analyze discourse relations between the utterances within dialogues. However, linguistic features in dialogues, such as omission and idiom, frequently introduce ambiguities that obscure the intended discourse relations, posing significant challenges for parsers. To address this issue, we propose a Discourse-aware Clarification Module (DCM) to enhance the performance of the dialogue discourse parser. DCM employs two distinct reasoning processes: clarification type reasoning and discourse goal reasoning. The former analyzes linguistic features, while the latter distinguishes the intended relation from the ambiguous one. Furthermore, we introduce Contribution-aware Preference Optimization (CPO) to mitigate the risk of erroneous clarifications, thereby reducing cascading errors. CPO enables the parser to assess the contributions of the clarifications from DCM and provide feedback to optimize the DCM, enhancing its adaptability and alignment with the parser's requirements. Extensive experiments on the STAC and Molweni datasets demonstrate that our approach effectively resolves ambiguities and significantly outperforms the state-of-the-art (SOTA) baselines.[11] CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records
Junke Wang,Hongshun Ling,Li Zhang,Longqian Zhang,Fang Wang,Yuan Gao,Zhi Li
Main category: cs.CL
TL;DR: This paper proposes CKD-EHR framework which uses knowledge distillation techniques to enhance disease risk prediction based on EHR. It fine-tunes Qwen2.5-7B as teacher model and transfers distilled knowledge to a lightweight BERT student model. On MIMIC-III dataset, it significantly improves diagnostic accuracy and inference speed.
Details
Motivation: Existing large language models have insufficient medical knowledge representation and low efficiency in clinical deployment for EHR-based disease prediction. Method: CKD-EHR framework uses Qwen2.5-7B fine-tuned on medical knowledge-enhanced data as teacher model, generates interpretable soft labels through multi-granularity attention distillation mechanism, then transfers the distilled knowledge to a lightweight BERT student model. Result: On MIMIC-III dataset, CKD-EHR increases diagnostic accuracy by 9%, improves F1-score by 27%, and achieves a 22.2 times inference speedup compared to baseline model. Conclusion: CKD-EHR greatly improves resource utilization efficiency and enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for clinical settings. Abstract: Electronic Health Records (EHR)-based disease prediction models have demonstrated significant clinical value in promoting precision medicine and enabling early intervention. However, existing large language models face two major challenges: insufficient representation of medical knowledge and low efficiency in clinical deployment. To address these challenges, this study proposes the CKD-EHR (Clinical Knowledge Distillation for EHR) framework, which achieves efficient and accurate disease risk prediction through knowledge distillation techniques. Specifically, the large language model Qwen2.5-7B is first fine-tuned on medical knowledge-enhanced data to serve as the teacher model.It then generates interpretable soft labels through a multi-granularity attention distillation mechanism. Finally, the distilled knowledge is transferred to a lightweight BERT student model. Experimental results show that on the MIMIC-III dataset, CKD-EHR significantly outperforms the baseline model:diagnostic accuracy is increased by 9%, F1-score is improved by 27%, and a 22.2 times inference speedup is achieved. This innovative solution not only greatly improves resource utilization efficiency but also significantly enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for resource optimization in clinical settings. The code and data for this research are available athttps://github.com/209506702/CKD_EHR.[12] Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs
Jing Yang Lee,Kong-Aik Lee,Woon-Seng Gan
Main category: cs.CL
TL;DR: 通过将开放域对话生成分解为多响应生成和基于偏好的选择两个任务,并引入专门设计的数据集o2mDial,本研究提出的方法在较小的语言模型上应用时,能显著提高回复的多样性和质量,使其接近较大模型的表现。
Details
Motivation: 尽管先前的研究表明建模开放域对话的一对多特性可以提升回复多样性,但大多数现代基于大语言模型的对话系统并未明确这样做。 Method: 将开放域对话生成分为两个主要任务:多响应生成(MRG)和基于偏好的选择(PS)。首先生成一组语义和词汇上多样且高质量的回复,然后根据人类偏好选择一个回复。为此,创建了一个名为o2mDial的对话语料库,该语料库针对捕捉一对多特性进行了专门设计。利用o2mDial,提出了新的情境学习和指令调优策略以及针对MRG的新评估指标,还提出了一种基于模型的PS方法。 Result: 实证结果表明,在较小的大语言模型上应用所提出的两阶段框架进行开放域对话生成,可以在保持上下文连贯性的同时增强整体回复多样性,回复质量最多可提高90%。 Conclusion: 本研究表明,通过采用专门设计的语料库和策略,可以在较小的大语言模型上实现接近较大型模型性能的对话生成效果,特别是在提升回复多样性和质量方面。 Abstract: Open-domain Dialogue (OD) exhibits a one-to-many (o2m) property, whereby multiple appropriate responses exist for a single dialogue context. Despite prior research showing that modeling this property boosts response diversity, most modern LLM-based dialogue agents do not explicitly do so. In this work, we model the o2m property of OD in LLMs by decomposing OD generation into two key tasks: Multi-Response Generation (MRG) and Preference-based Selection (PS), which entail generating a set of n semantically and lexically diverse high-quality responses for a given dialogue context, followed by selecting a single response based on human preference, respectively. To facilitate MRG and PS, we introduce o2mDial, a dialogue corpus explicitly designed to capture the o2m property by featuring multiple plausible responses for each context. Leveraging o2mDial, we propose new in-context learning and instruction-tuning strategies, as well as novel evaluation metrics for MRG, alongside a model-based approach for PS. Empirical results demonstrate that applying the proposed two-stage framework to smaller LLMs for OD generation enhances overall response diversity while maintaining contextual coherence, improving response quality by up to 90%, bringing them closer to the performance of larger models.[13] Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models
Gyeongje Cho,Yeonkyoun So,Chanwoo Park,Sangmin Lee,Sungmok Jung,Jaejin Lee
Main category: cs.CL
TL;DR: The paper presents Thunder-Tok, a Korean tokenizer that reduces token fertility by 10% without sacrificing model performance through rule-based pre-tokenization and branching entropy-based selection.
Details
Motivation: To design an efficient tokenizer for the Korean language that reduces token fertility without compromising model performance. Method: Uses rule-based pre-tokenization method aligning with Korean linguistic structure and creates a seed vocabulary with tokens resembling linguistic units. Also employs a branching entropy-based selection algorithm to increase average token length. Result: Reduces token fertility by approximately 10%, which is equivalent to reducing the number of tokens by 10% and improving inference speed by 10%, compared to BPE, without any drop in performance across various downstream tasks. Conclusion: The linguistically informed approach of Thunder-Tok proves effective and practical for designing efficient tokenizers for language models. Abstract: This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.[14] Emergence of Primacy and Recency Effect in Mamba: A Mechanistic Point of View
Muhammad Cendekia Airlangga,Hilal AlQuabeh,Munachiso S Nwadike,Kentaro Inui
Main category: cs.CL
TL;DR: This paper explores memory in state-space language models, specifically the Mamba architecture, using primacy and recency effects. They observe a U-shaped accuracy profile and identify three mechanisms behind this pattern.
Details
Motivation: To understand how information is retained and forgotten over time in state-space language models. Method: Applying structured recall tasks to the Mamba architecture and analyzing the results through targeted ablations and input perturbations on two large-scale Mamba-based language models. Result: A consistent U-shaped accuracy profile was observed, with strong performance at the beginning and end of input sequences. Three mechanisms were identified: long-term memory supported by a sparse subset of channels; short-term memory governed by delta-modulated recurrence; and memory allocation modulated by semantic regularity. Conclusion: The study uncovers key mechanisms of memory retention and forgetting in state-space language models, validated through experiments on large-scale Mamba-based models. Abstract: We study memory in state-space language models using primacy and recency effects as behavioral tools to uncover how information is retained and forgotten over time. Applying structured recall tasks to the Mamba architecture, we observe a consistent U-shaped accuracy profile, indicating strong performance at the beginning and end of input sequences. We identify three mechanisms that give rise to this pattern. First, long-term memory is supported by a sparse subset of channels within the model's selective state space block, which persistently encode early input tokens and are causally linked to primacy effects. Second, short-term memory is governed by delta-modulated recurrence: recent inputs receive more weight due to exponential decay, but this recency advantage collapses when distractor items are introduced, revealing a clear limit to memory depth. Third, we find that memory allocation is dynamically modulated by semantic regularity: repeated relations in the input sequence shift the delta gating behavior, increasing the tendency to forget intermediate items. We validate these findings via targeted ablations and input perturbations on two large-scale Mamba-based language models: one with 1.4B and another with 7B parameters.[15] A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals
Andrea Cadeddu,Alessandro Chessa,Vincenzo De Leo,Gianni Fenu,Enrico Motta,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino,Luca Secchi
Main category: cs.CL
TL;DR: The study explores the application of various proprietary and open-source LLMs for single-label, multi-class text classification related to SDGs. It evaluates Zero-Shot, Few-Shot Learning, and Fine-Tuning techniques revealing that smaller models can perform comparably to larger ones with optimized prompt engineering.
Details
Motivation: To address the challenge of tracking progress toward the 17 Sustainable Development Goals (SDGs) by leveraging text classification models and evaluating the effectiveness of different learning approaches with large language models. Method: Analyze various proprietary and open-source LLMs for a single-label, multi-class text classification task focused on the SDGs. Evaluate task adaptation techniques such as Zero-Shot, Few-Shot Learning, and Fine-Tuning within this domain. Result: Smaller models, when optimized through prompt engineering, can perform on par with larger models like OpenAI's GPT. Conclusion: Optimized smaller models are effective alternatives to larger models in text classification tasks for SDGs, making efficient use of resources while maintaining performance. Abstract: In 2012, the United Nations introduced 17 Sustainable Development Goals (SDGs) aimed at creating a more sustainable and improved future by 2030. However, tracking progress toward these goals is difficult because of the extensive scale and complexity of the data involved. Text classification models have become vital tools in this area, automating the analysis of vast amounts of text from a variety of sources. Additionally, large language models (LLMs) have recently proven indispensable for many natural language processing tasks, including text classification, thanks to their ability to recognize complex linguistic patterns and semantics. This study analyzes various proprietary and open-source LLMs for a single-label, multi-class text classification task focused on the SDGs. Then, it also evaluates the effectiveness of task adaptation techniques (i.e., in-context learning approaches), namely Zero-Shot and Few-Shot Learning, as well as Fine-Tuning within this domain. The results reveal that smaller models, when optimized through prompt engineering, can perform on par with larger models like OpenAI's GPT (Generative Pre-trained Transformer).[16] ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs
Feng He,Zijun Chen,Xinnian Liang,Tingting Ma,Yunqi Qiu,Shuangzhi Wu,Junchi Yan
Main category: cs.CL
TL;DR: ProtoReasoning is a framework enhancing LLMs' reasoning ability via scalable and verifiable prototypes, improving performance across various reasoning tasks.
Details
Motivation: To understand the mechanisms behind cross-domain generalization capabilities of Large Reasoning Models trained with Long Chain-of-Thought reasoning and hypothesize that shared abstract reasoning prototypes are responsible for this. Method: Propose ProtoReasoning framework featuring an automated prototype construction pipeline, comprehensive verification system, and scalability within prototype space using Prolog and PDDL representations. Result: ProtoReasoning achieves improvements over baseline models in logical reasoning (4.7%), planning tasks (6.3%), general reasoning (4.0%), and mathematics (1.0%). Ablation studies confirm enhanced generalization when learning in prototype space. Conclusion: Reasoning prototypes serve as the foundation for generalizable reasoning in large language models, validated by the success of the ProtoReasoning framework. Abstract: Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer remain poorly understood. We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes -- fundamental reasoning patterns that capture the essence of problems across domains. These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24). Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models.[17] MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs
Yongqi Fan,Yating Wang,Guandong Wang,Jie Zhai,Jingping Liu,Qi Ye,Tong Ruan
Main category: cs.CL
TL;DR: The paper proposes MinosEval, a new evaluation method for open-ended QA that distinguishes between factoid and non-factoid questions and uses different strategies to rank candidate answers. Experiments show it aligns better with human annotations.
Details
Motivation: Current evaluation methods for open-ended QA have limitations such as struggling to capture semantic similarities, lacking interpretability, failing to adapt across different question contents, and overlooking the distinction between factoid and non-factoid questions. Method: MinosEval first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies: an adaptive key-point scoring strategy for factoid questions and an instance-aware listwise ranking strategy for non-factoid questions. Result: Experiments on multiple datasets, including self-built ones, demonstrate that MinosEval better aligns with human annotations and provides more interpretable results. Conclusion: MinosEval is a novel and effective method for evaluating open-ended QA, offering improved alignment with human annotations and enhanced interpretability. Abstract: Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose \textbf{MinosEval}, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.[18] Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants
Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri
Main category: cs.CL
TL;DR: The study evaluates current language technologies' understanding of Basque and Spanish linguistic variations through Natural Language Inference (NLI), revealing a performance drop when dealing with these variations, especially in Basue. Encoder-only models show particular difficulty with Western Basque.
Details
Motivation: To assess the capacity of current language technologies to understand Basque and Spanish language varieties using Natural Language Inference (NLI) as a pivot task. Method: Introduced a novel, manually-curated parallel dataset in Basque and Spanish along with their variants. Conducted empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs). Performed error and ablation analyses. Result: Showed a performance drop when handling linguistic variation, particularly in Basque. Found that the decline is due to linguistic variation itself rather than lexical overlap. Identified that encoder-only models struggle more with Western Basque. Conclusion: Current language technologies face challenges in understanding linguistic variations, particularly peripheral dialects like Western Basque. Data and code are publicly available. Abstract: In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.[19] Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs
Yang Fan,Zhang Qi,Xing Wenqian,Liu Chang,Liu Liu
Main category: cs.CL
TL;DR: This paper proposes the Graph RAG framework which combines various techniques to create a character relationship dataset for historical texts, reducing manual labor and enhancing historical knowledge extraction. Experiments show that models using this framework improve performance in relation extraction and alleviate hallucinations, offering a low-resource solution for classical text knowledge extraction.
Details
Motivation: To address domain knowledge gaps in large language models for historical text analysis, providing a more efficient and accurate method for historical knowledge extraction. Method: Propose the Graph RAG framework that integrates chain-of-thought prompting, self-instruction generation, and process supervision to generate a character relationship dataset with minimal manual annotation. Introduce a collaborative mechanism between knowledge graphs and retrieval-augmented generation in the graph-augmented generation phase. Result: The Xunzi-Qwen1.5-14B model achieves optimal performance in relation extraction with an F1 score of 0.68. The DeepSeek model integrated with GraphRAG improves F1 by 11% on the C-CLUE relation extraction dataset, surpassing the F1 value of Xunzi-Qwen1.5-14B and effectively alleviating hallucinations. Conclusion: The Graph RAG framework provides a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research. Abstract: This article addresses domain knowledge gaps in general large language models for historical text analysis in the context of computational humanities and AIGC technology. We propose the Graph RAG framework, combining chain-of-thought prompting, self-instruction generation, and process supervision to create a The First Four Histories character relationship dataset with minimal manual annotation. This dataset supports automated historical knowledge extraction, reducing labor costs. In the graph-augmented generation phase, we introduce a collaborative mechanism between knowledge graphs and retrieval-augmented generation, improving the alignment of general models with historical knowledge. Experiments show that the domain-specific model Xunzi-Qwen1.5-14B, with Simplified Chinese input and chain-of-thought prompting, achieves optimal performance in relation extraction (F1 = 0.68). The DeepSeek model integrated with GraphRAG improves F1 by 11% (0.08-0.19) on the open-domain C-CLUE relation extraction dataset, surpassing the F1 value of Xunzi-Qwen1.5-14B (0.12), effectively alleviating hallucinations phenomenon, and improving interpretability. This framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research.[20] TopClustRAG at SIGIR 2025 LiveRAG Challenge
Juli Bakagianni,John Pavlopoulos,Aristidis Likas
Main category: cs.CL
TL;DR: The paper introduces TopClustRAG, a RAG system for question answering over large web corpora. It uses hybrid retrieval and clustering to enhance answer diversity, relevance, and faithfulness.
Details
Motivation: To improve the performance of question answering systems over large-scale web corpora by enhancing answer diversity, relevance, and faithfulness. Method: TopClustRAG employs a hybrid retrieval strategy combining sparse and dense indices, K-Means clustering to group similar passages, and cluster-specific prompts for LLMs to generate answers which are then filtered, reranked, and synthesized. Result: On the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard. Conclusion: Clustering-based context filtering and prompt aggregation is effective in large-scale RAG systems. Abstract: We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.[21] Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments
Sungen Hahm,Heejin Kim,Gyuseong Lee,Hyunji Park,Jaejin Lee
Main category: cs.CL
TL;DR: To address the challenges in de-identifying court judgments, this paper proposes Thunder-DeID, a framework that includes constructing a Korean legal dataset, categorizing PII, and developing a DNN-based de-identification pipeline.
Details
Motivation: The current de-identification process for court judgments is inadequate for handling large volumes while adhering to strict legal requirements. Legal definitions of personal identifiers are vague, making it hard to implement technical solutions. Method: The authors propose Thunder-DeID which involves (i) constructing and releasing the first Korean legal dataset with annotated judgments, (ii) introducing a systematic categorization of PII, and (iii) developing an end-to-end DNN-based de-identification pipeline. Result: Experimental results show that the proposed model achieves state-of-the-art performance in de-identifying court judgments. Conclusion: Thunder-DeID provides an effective solution for de-identifying court judgments while aligning with relevant laws and practices. Abstract: To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.[22] Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment
Shrestha Ghosh,Moritz Schneider,Carina Reinicke,Carsten Eickhoff
Main category: cs.CL
TL;DR: Recent advances in LLMs have improved general-domain NLP tasks, but their use in clinical trial recruitment is limited. This survey analyzes the task of trial-patient matching and discusses emerging LLM-based approaches, existing benchmarks, challenges, and future directions.
Details
Motivation: The motivation for this paper is to address the limitation of current LLM applications in critical domains like clinical trial recruitment. It aims to explore how LLMs can be utilized more effectively in matching trials with patients by leveraging their knowledge aggregation and reasoning capabilities. Method: The method involves a comprehensive survey and analysis of the task of trial-patient matching within the context of clinical trial recruitment. The authors examine existing benchmarks, approaches, evaluation frameworks, and the challenges associated with adopting LLM technologies in clinical research. Result: The result of this survey is a detailed contextualization of emerging LLM-based approaches in clinical trial recruitment, along with an assessment of current benchmarks and identification of challenges and future directions. Conclusion: The conclusion highlights the potential of LLMs in revolutionizing clinical trial recruitment through effective trial-patient matching while acknowledging the need for better benchmarks and evaluation frameworks to fully realize this potential. Abstract: Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.[23] ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Negar Foroutan,Jakhongir Saydaliev,Ye Eun Kim,Antoine Bosselut
Main category: cs.CL
TL;DR: The paper proposes a novel supervised contrastive learning (SCL) approach to improve language identification (LID) performance for low-resource languages on out-of-domain data.
Details
Motivation: Low-resource languages perform poorly in LID due to limited and single-domain data, which leads to class imbalance and bias issues. Method: A novel supervised contrastive learning (SCL) approach is proposed to learn domain-invariant representations for low-resource languages. Result: The SCL approach improves LID performance on out-of-domain data for low-resource languages by 3.2%. Conclusion: The proposed SCL approach demonstrates effectiveness in enhancing LID models for low-resource languages. Abstract: Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. Through an extensive analysis, we show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%, demonstrating its effectiveness in enhancing LID models.[24] DeVisE: Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue,Heloisa Oss Boll,Aykut Erdem,Erkut Erdem,Iacer Calixto
Main category: cs.CL
TL;DR: Large language models (LLMs) are used in clinical decision support, but current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. The authors introduce DeVisE, a behavioral testing framework for probing fine-grained clinical understanding.
Details
Motivation: To address the limitation of current evaluation methods that fail to distinguish genuine medical reasoning from superficial patterns in LLMs used for clinical decision support. Method: Construct a dataset of ICU discharge notes from MIMIC-IV with both raw and template-based versions containing controlled single-variable counterfactuals targeting demographic and vital sign attributes. Evaluate five LLMs under zero-shot and fine-tuned settings by assessing input-level sensitivity and downstream reasoning. Result: Zero-shot models show more coherent counterfactual reasoning patterns while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Demographic factors subtly but consistently influence outputs. Conclusion: Behavioral testing like DeVisE can expose the reasoning strategies of clinical LLMs and inform the design of safer, more transparent medical AI systems. Abstract: Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.[25] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture
Arijit Maji,Raghvendra Kumar,Akash Ghosh,Anushka,Sriparna Saha
Main category: cs.CL
TL;DR: The paper introduces SANSKRITI, a benchmark dataset with 21,853 question-answer pairs to evaluate language models' understanding of Indian cultural diversity across sixteen attributes. It assesses various language models, revealing disparities in handling culturally nuanced queries.
Details
Motivation: There is a need for language models to effectively understand local socio-cultural contexts beyond global effectiveness. India's rich cultural diversity requires specific evaluation. Method: Developed SANSKRITI, a large dataset with question-answer pairs covering 28 states and 8 union territories across sixteen cultural attributes. Used it to evaluate Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs). Result: Significant disparities were found among the evaluated models in their ability to handle culturally nuanced queries, especially in region-specific contexts. Conclusion: SANSKRITI sets a new standard for assessing and improving cultural understanding in language models through its extensive, culturally rich, and diverse dataset. Abstract: Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models' comprehension of India's rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India's cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.[26] COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation
Raghvendra Kumar,S. A. Mohammed Salman,Aryan Sahu,Tridib Nandi,Pragathi Y. P.,Sriparna Saha,Jose G. Moreno
Main category: cs.CL
TL;DR: The paper presents COSMMIC, a new dataset for comment-aware multimodal and multilingual summarization in nine Indian languages. It includes 4,959 article-image pairs and 24,484 comments with ground-truth summaries. The study explores four configurations of summarization using text, comments, images or all combined, employing models like LLama3 and GPT-4 to identify the best setup for NLG tasks.
Details
Motivation: To address the lack of research in comment-aware multimodal and multilingual summarization for Indian languages and bridge gaps in Indian language resources. Method: Creating the COSMMIC dataset with article-image-comment-summary quadruples in nine Indian languages; exploring four configurations for summarization (text alone, text+comments, text+images, text+comments+images); using state-of-the-art models for evaluation and IndicBERT for comment classification. Result: COSMMIC is introduced as a unique resource integrating text, images, and user feedback for Indian languages. Experiments show that incorporating comments and images can enhance summarization quality, identifying supportive comments and filtering noise effectively. Conclusion: COSMMIC provides a valuable resource for advancing NLP research in Indian languages by incorporating multimodal and multilingual aspects with user feedback, promoting inclusivity. Abstract: Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset's effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.[27] Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning
Stanley Ngugi
Main category: cs.CL
TL;DR: Large Language Models often perform poorly in low-resource languages due to insufficient data. This paper presents Targeted Lexical Injection, a fine-tuning method that significantly enhances lexical alignment for Swahili-English word pairs in a Swahili-centric LLM.
Details
Motivation: To address the issue of poor performance of Large Language Models in low-resource languages such as Swahili, caused by data scarcity and underrepresentation in pre-training. Method: Introduce Targeted Lexical Injection (TLI), a fine-tuning approach using Low-Rank Adaptation and a contrastive learning objective to specifically target embeddings from an optimal early layer where strong lexical alignment is observed. Result: TLI improves output-level lexical alignment for both trained and unseen Swahili-English word pairs, increasing average cosine similarity by about 28% in each case. Conclusion: Targeted Lexical Injection enhances the model's ability to preserve and propagate its inherent cross-lingual knowledge, providing an efficient strategy to improve lexical alignment in low-resource language-focused LLMs. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their performance in low-resource languages (LRLs), such as Swahili, often lags due to data scarcity and underrepresentation in pre-training. A key challenge is achieving robust cross-lingual lexical alignment, crucial for tasks like translation and cross-lingual information retrieval. This paper introduces Targeted Lexical Injection (TLI), a novel and efficient fine-tuning approach. We first demonstrate that Lugha-Llama-8B-wura, a Swahili-centric LLM, exhibits strong, near-perfect lexical alignment for Swahili-English word pairs in its early internal layers (specifically Layer 2, with ~0.99998 average cosine similarity based on a pilot study), a capability not fully reflected in its final output representations (baseline ~0.32 similarity on our evaluation set). TLI leverages this insight by using Low-Rank Adaptation (LoRA) and a contrastive learning objective to fine-tune the model, specifically targeting embeddings from this empirically identified optimal early layer. Our experiments show that TLI significantly improves the output-level lexical alignment for 623 trained Swahili-English word pairs, increasing average cosine similarity from 0.3211 to 0.4113 (+28.08%, p < 1.33 x 10^-240). More importantly, these improvements generalize remarkably well to 63 unseen control word pairs, with similarity increasing from 0.3143 to 0.4033 (+28.32%, p < 7.17 x 10^-27). These findings suggest TLI enhances the model's ability to preserve and propagate its inherent early-layer cross-lingual knowledge, offering a parameter-efficient and effective strategy for improving lexical alignment in LRL-focused LLMs.[28] Understanding GUI Agent Localization Biases through Logit Sharpness
Xingjian Tao,Yiwei Wang,Yujun Cai,Zhicheng Yang,Jing Tang
Main category: cs.CL
TL;DR: Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions, but they frequently exhibit hallucinations. The authors propose a fine-grained evaluation framework that categorizes model predictions into four distinct types and introduce the Peak Sharpness Score (PSS), Context-Aware Cropping to better quantify model uncertainty and improve model performance.
Details
Motivation: The motivation is to address the issue of hallucinations in multimodal large language models (MLLMs) that compromise reliability when interacting with operating systems through GUI agents. Method: The method involves proposing a fine-grained evaluation framework that categorizes model predictions into four distinct types, introducing the Peak Sharpness Score (PSS) to evaluate the alignment between semantic continuity and logits distribution in coordinate prediction, and proposing Context-Aware Cropping, a training-free technique that improves model performance by adaptively refining input context. Result: Extensive experiments demonstrate that the proposed framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior. Conclusion: The conclusion is that the proposed evaluation framework and methods effectively address the issue of hallucinations in MLLMs, providing deeper insights and improving the reliability of GUI agent interactions. Abstract: Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions. Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability. We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics. To better quantify model uncertainty, we introduce the Peak Sharpness Score (PSS), a metric that evaluates the alignment between semantic continuity and logits distribution in coordinate prediction. Building on this insight, we further propose Context-Aware Cropping, a training-free technique that improves model performance by adaptively refining input context. Extensive experiments demonstrate that our framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior.[29] AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need
Zhouhong Gu,Xiaoxuan Zhu,Yin Cai,Hao Shen,Xingzhou Chen,Qingyi Wang,Jialin Li,Xiaoran Shi,Haoran Guo,Wenxuan Huang,Hongwei Feng,Yanghua Xiao,Zheyu Ye,Yao Hu,Shaosheng Cao
Main category: cs.CL
TL;DR: AgentGroupChat-V2是一种新型框架,通过分而治之的全并行架构、自适应协作引擎和代理组织优化策略解决当前多智能体系统在架构设计、跨域通用性和性能保证方面的挑战。
Details
Motivation: 现有的基于大语言模型的多智能体系统在社会模拟和复杂任务解决领域表现出巨大潜力,但在系统架构设计、跨域通用性和性能保证方面面临关键挑战,特别是在任务复杂度和智能体数量增加时。 Method: 提出了三个核心创新:1)分而治之的全并行架构,将用户查询分解为层次化任务森林结构;2)自适应协作引擎,根据任务特性动态选择异构LLM组合和交互模式;3)结合分而治之方法的代理组织优化策略以实现高效问题分解。 Result: 广泛的实验表明,AgentGroupChat-V2在多个领域的表现优异,如GSM8K准确率达到91.50%,AIME准确率达到30.4%,HumanEval的pass@1为79.20%。在高难度任务上优势更加明显,例如在Level 5 MATH问题上的改进超过11个百分点。 Conclusion: AgentGroupChat-V2为构建高效、通用的大语言模型多智能体系统提供了一个全面的解决方案,在复杂推理场景中具有显著优势。 Abstract: Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at https://github.com/MikeGu721/AgentGroupChat-V2.[30] RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation
Xinnuo Xu,Rachel Lawrence,Kshitij Dubey,Atharva Pandey,Risa Ueno,Fabian Falck,Aditya V. Nori,Rahul Sharma,Amit Sharma,Javier Gonzalez
Main category: cs.CL
TL;DR: Recent Large Language Models (LLMs) show high accuracy on reasoning benchmarks, but it's unclear if this is due to true reasoning or statistical recall. This paper introduces RE-IMAGINE, a framework that characterizes the hierarchy of reasoning ability in LLMs and generates problem variations to test these abilities.
Details
Motivation: To understand whether the high accuracy of LLMs on reasoning benchmarks is due to true reasoning capabilities or merely statistical recall of the training set. Method: Introduced RE-IMAGINE, a framework that uses the ladder of causation's three levels (associations, interventions, and counterfactuals) to characterize a hierarchy of reasoning ability in LLMs. It also includes an automated pipeline to generate problem variations at different levels of the hierarchy by altering problems in an intermediate symbolic representation. Result: Observed reductions in performance when several families of LLMs were queried with problem variations generated by RE-IMAGINE, indicating reliance on statistical recall for past performance. Conclusion: The assessments open the door to further research targeting skills across the reasoning hierarchy. Abstract: Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.[31] Context-Informed Grounding Supervision
Hyunji Lee,Seunghyun Yoon,Yunjae Won,Hanseok Oh,Geewook Kim,Trung Bui,Franck Dernoncourt,Elias Stengel-Eskin,Mohit Bansal,Minjoon Seo
Main category: cs.CL
TL;DR: 大型语言模型(LLMs)在推理时常常需要补充外部知识以提供信息或减少幻觉。本文提出了一种名为CINGS的后训练监督方法,通过将相关上下文前置到响应中,并仅在响应标记上计算损失来确保生成内容基于提供的外部上下文。实验表明,CINGS在文本和视觉领域都表现出更强的接地性,并且不会降低通用下游性能。
Details
Motivation: 尽管在推理时简单地附加上下文并不能确保生成内容基于提供的外部上下文,但我们需要模型能够根据提供的外部上下文生成响应。 Method: 提出了一种名为Context-INformed Grounding Supervision(CINGS)的后训练监督方法,其中模型在相关上下文前置到响应的情况下进行训练,同时仅在响应标记上计算损失并屏蔽上下文。 Result: 在文本领域,CINGS在11个信息检索数据集上优于其他训练方法,并且与推理时接地技术互补。在视觉-语言领域,用CINGS训练的模型替换视觉-语言模型的LLM主干可以减少四个基准测试中的幻觉,并在整个生成响应中保持事实一致性。 Conclusion: CINGS提高了模型在文本和视觉领域的接地能力,而不会降低通用下游性能。分析表明,CINGS通过改变模型的先验知识和行为,隐式地鼓励更多依赖外部上下文。 Abstract: Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination. In such cases, we expect the model to generate responses by grounding its response in the provided external context. However, prior work has shown that simply appending context at inference time does not ensure grounded generation. To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context. Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models. In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques. In the vision-language domain, replacing a vision-language model's LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. This improved grounding comes without degradation in general downstream performance. Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model's prior knowledge and behavior, implicitly encouraging greater reliance on the external context.[32] SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Md Imbesat Hassan Rizvi,Xiaodan Zhu,Iryna Gurevych
Main category: cs.CL
TL;DR: SPARE is a new framework for single-pass, step-wise annotation that improves reasoning performance and efficiency in LLMs.
Details
Motivation: Process or step-wise supervision is vital for advancing multi-step reasoning capabilities of LLMs, but efficient, high-quality automated process annotation remains challenging. Method: SPARE aligns each solution step to one or multiple steps in a reference solution with explicit reasoning for evaluation. It enables single-pass, per-step annotation. Result: SPARE improves reasoning performance when used for fine-tuning models in an offline RL setup and training reward models for ranking/aggregating LLM-generated outputs. It achieves competitive performance on mathematical datasets while being 2.6 times more efficient than tree search-based automatic annotation. Conclusion: SPARE offers a novel approach to process supervision that enhances both the quality and efficiency of annotations for LLMs. Abstract: Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.[33] Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge
Li Zheng,Sihang Wang,Hao Fei,Zuquan Peng,Fei Li,Jianming Fu,Chong Teng,Donghong Ji
Main category: cs.CL
TL;DR: The paper presents EmoBi, an emotion-guided framework for hyperbole and metaphor detection using bidirectional dynamic interaction. It outperforms baselines with a 28.1% F1 score increase for hyperbole and 23.1% for metaphor detection.
Details
Motivation: Hyperbole and metaphor detection is crucial in NLP but challenging due to semantic obscurity and expressive diversity. Current methods focus on superficial text features, ignoring associations between these devices and the role of implicit emotion. Method: EmoBi consists of three modules: an emotion analysis module to mine connotations, a domain mapping module for identifying target and source domains, and a bidirectional dynamic interaction module for mutual promotion. A verification mechanism ensures accuracy and reliability. Result: Experiments show EmoBi outperforms all baseline methods across four datasets, achieving significant F1 score improvements over the current SoTA. Conclusion: EmoBi demonstrates effectiveness in advancing hyperbole and metaphor detection, offering potential for further improvements in this area. Abstract: Text-based hyperbole and metaphor detection are of great significance for natural language processing (NLP) tasks. However, due to their semantic obscurity and expressive diversity, it is rather challenging to identify them. Existing methods mostly focus on superficial text features, ignoring the associations of hyperbole and metaphor as well as the effect of implicit emotion on perceiving these rhetorical devices. To implement these hypotheses, we propose an emotion-guided hyperbole and metaphor detection framework based on bidirectional dynamic interaction (EmoBi). Firstly, the emotion analysis module deeply mines the emotion connotations behind hyperbole and metaphor. Next, the emotion-based domain mapping module identifies the target and source domains to gain a deeper understanding of the implicit meanings of hyperbole and metaphor. Finally, the bidirectional dynamic interaction module enables the mutual promotion between hyperbole and metaphor. Meanwhile, a verification mechanism is designed to ensure detection accuracy and reliability. Experiments show that EmoBi outperforms all baseline methods on four datasets. Specifically, compared to the current SoTA, the F1 score increased by 28.1% for hyperbole detection on the TroFi dataset and 23.1% for metaphor detection on the HYPO-L dataset. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing hyperbole and metaphor detection.[34] Lessons from Training Grounded LLMs with Verifiable Rewards
Shang Hong Sim,Tej Deep Pala,Vernon Toh,Hai Leong Chieu,Amir Zadeh,Chuan Li,Navonil Majumder,Soujanya Poria
Main category: cs.CL
TL;DR: 生成可靠响应对大语言模型(LLMs)来说是一个挑战。本文探讨了如何通过强化学习和内部推理增强LLM的接地能力,使用GRPO方法训练模型,并通过综合实验表明推理增强模型优于仅指令变体。分两阶段的训练设置进一步提高了接地效果。此外,通过GPT-4蒸馏重新审视指令微调,发现将其与GRPO结合可以提高长篇、生成式问答任务的性能。整体研究强调了推理、阶段性优化和结果驱动的强化学习的价值。
Details
Motivation: 尽管基于检索的生成(RAG)和基于引用的接地技术有潜力,但指令微调模型在简单场景中仍常失败,如遗漏明确答案、错误引用或在有证据时拒绝回答。因此需要探索强化学习和内部推理来增强LLM的接地能力。 Method: 使用GRPO方法训练模型,采用可验证的结果导向奖励机制,针对答案正确性、引用充分性和拒绝质量进行优化,而无需黄金推理轨迹或昂贵注释。分两个阶段进行训练:首先优化答案和引用行为,然后优化拒绝行为。同时通过GPT-4蒸馏重新审视指令微调,并将之与GRPO结合。 Result: 推理增强模型在ASQA、QAMPARI、ELI5和ExpertQA等数据集上的表现显著优于仅指令变体,特别是在处理无法回答的问题和生成良好引用的响应方面。分阶段训练设置稳定了学习信号并进一步提升了接地效果。结合GRPO和指令微调提高了长篇、生成式问答任务的性能。 Conclusion: 研究结果表明,推理、阶段性优化和结果驱动的强化学习对于构建更可验证和可靠的LLM具有重要价值。 Abstract: Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.[35] RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models
Bailin Wang,Chang Lan,Chong Wang,Ruoming Pang
Main category: cs.CL
TL;DR: RATTENTION是一种改进的局部-全局注意力模型,通过结合线性注意力机制处理窗口外信息,在短上下文场景中实现了性能和效率的更优权衡,尤其在512窗口大小时媲美全注意力模型,同时保持训练效率。
Details
Motivation: 当前局部-全局注意力模型在窗口大小选择上存在性能与效率的权衡问题,较小窗口虽提升效率但可能损害性能,因为局部注意力完全忽略窗口外的token。研究旨在解决这一内在局限性。 Method: 提出RATTENTION,将局部注意力与专门设计的线性注意力机制结合,用于捕捉窗口外token的信息。该方法通过预训练实验验证,并利用线性注意力的递归特性提升长上下文性能。 Result: RATTENTION在3B和12B规模的预训练中展现出更优的性能-效率权衡;512窗口大小时性能媲美全注意力模型;在RULER基准测试中表现出色,且未牺牲训练效率。 Conclusion: RATTENTION成功突破了局部-全局模型的Pareto权衡,实现短上下文场景中的效率提升,同时保持训练速度与现有最佳方法相当。 Abstract: Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches.[36] Approximating Language Model Training Data from Weights
John X. Morris,Junjie Oscar Yin,Woojeong Kim,Vitaly Shmatikov,Alexander M. Rush
Main category: cs.CL
TL;DR: Modern language models usually have open weights but closed training data. The researchers formalized the problem of data approximation from model weights, proposed several baselines and metrics, and developed a gradient-based approach that selects the highest-matching data from a large public text corpus. This method is effective at recovering useful data given only weights of the original and finetuned models.
Details
Motivation: To address the issue of approximating training data of modern language models from their open weights when the actual training data is not available. Method: A gradient-based approach was developed to select the highest-matching data from a large public text corpus. This method uses only the weights of the original and fine-tuned models to recover useful data. Result: The method significantly improved performance on the AG News classification task from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, it reduced perplexity from 3.3 to 2.3, compared to an expert LLAMA model's perplexity of 2.0. Conclusion: The gradient-based approach can effectively approximate training data from model weights alone, even without any knowledge of the true training data, achieving close to original model performance. Abstract: Modern language models often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models. Even when none of the true training data is known, our method is able to locate a small subset of public Web documents can be used to train a model to close to the original model performance given models trained for both classification and supervised-finetuning. On the AG News classification task, our method improves performance from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, our method reduces perplexity from 3.3 to 2.3, compared to an expert LLAMA model's perplexity of 2.0.[37] PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction
Shufan Li,Aditya Grover
Main category: cs.CL
TL;DR: Predictive Generation (PredGen) is a novel framework that can reduce the latency of Large Language Models (LLMs) in real-time voice chat applications by generating candidate responses while the user is still speaking, which can effectively reduce the latency by around 2x across a wide range of use cases.
Details
Motivation: To solve the problem of noticeable latency between the end of user input and the start of audio output in real-time voice chat applications using LLMs. Method: Propose Predictive Generation (PredGen), a framework that uses speculative decoding at input time to generate candidate responses while the user is still speaking. Result: Simulated experiments on the Lmsys and MT-Bench datasets show that PredGen can effectively reduce the latency by around 2x across a wide range of use cases, with only minimal additional computation cost. Conclusion: PredGen is a promising solution to mitigate or even eliminate the delay in real-time voice chat applications using LLMs. Abstract: Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.[38] Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models
Zhengyang Shan,Emily Ruth Diana,Jiawei Zhou
Main category: cs.CL
TL;DR: A new metric called Gender Inclusivity Fairness Index (GIFI) is introduced to evaluate gender fairness in large language models (LLMs), considering both binary and non-binary genders. GIFI assesses various aspects of model behavior related to gender inclusivity.
Details
Motivation: Current evaluations of gender fairness in LLMs mainly focus on binary gender distinctions, lacking a comprehensive approach that considers non-binary genders as well. Method: The authors developed GIFI, which includes a wide range of evaluations from simple probing with gender pronouns to testing different aspects of model generation and cognitive behaviors under various gender assumptions. Result: Significant variations in gender inclusivity were found among 22 evaluated LLMs, indicating the need for improvement in this area. Conclusion: Improving gender inclusivity in LLMs is crucial, and GIFI provides a benchmark for future research and development in enhancing gender fairness in generative models. Abstract: We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers. We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity. Our study highlights the importance of improving LLMs' inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models.[39] SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
Chengye Wang,Yifei Shen,Zexi Kuang,Arman Cohan,Yilun Zhao
Main category: cs.CL
TL;DR: The paper introduces SciVer, a benchmark for evaluating foundation models' ability to verify claims in multimodal scientific contexts. It includes 3,000 annotated examples across 1,113 papers and assesses 21 state-of-the-art models revealing significant performance gaps compared to human experts.
Details
Motivation: To evaluate and improve the performance of foundation models in verifying claims within multimodal scientific contexts, highlighting current limitations and providing directions for future advancements. Method: Developed SciVer, a benchmark with 3,000 expert-annotated examples from 1,113 scientific papers, covering four reasoning types. Evaluated 21 state-of-the-art multimodal models using this benchmark and conducted error analysis through retrieval-augmented generation (RAG) and human evaluations. Result: Significant performance gap exists between state-of-the-art models and human experts on SciVer, revealing critical limitations in open-source models. Conclusion: Current open-source models have notable shortcomings in comprehending and reasoning within multimodal scientific literature; addressing these limitations is essential for advancing model capabilities. Abstract: We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.[40] DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
Shaoqing Lin,Chong Teng,Fei Li,Donghong Ji,Lizhen Qu,Zhuang Li
Main category: cs.CL
TL;DR: 研究人员针对多句视觉描述的挑战,提出了新的任务Discourse-level text Scene Graph parsing(DiscoSG)及相应数据集DiscoSG-DS,并开发了DiscoSG-Refiner模型。该模型通过两个阶段生成场景图,相比GPT-4有更快推理速度和较好性能提升,同时改善下游VLM任务表现。
Details
Motivation: 现有的文本场景图解析方法主要针对单句描述设计,在处理多句视觉描述时容易出现片段化图和跨句共指丢失等问题,影响下游Vision-Language Models (VLMs)任务性能。 Method: 提出新任务Discourse-level text Scene Graph parsing (DiscoSG) 和数据集DiscoSG-DS,包含多句描述与复杂场景图;引入DiscoSG-Refiner模型,利用小型PLM先生成基础图,再用另一PLM迭代优化图结构,降低全图生成开销。 Result: 在DiscoSG-DS上微调大型PLM(如GPT-4)可使SPICE指标提升约48%,但推理成本高且授权受限;DiscoSG-Refiner使用两Flan-T5-Base模型,仍使SPICE提升约30%,且推理速度比GPT-4快86倍,同时改善下游VLM任务表现。 Conclusion: DiscoSG-Refiner为高效生成复杂场景图提供了解决方案,显著提升了多句视觉描述解析性能和下游任务效果,代码和数据已开源。 Abstract: Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: https://github.com/ShaoqLin/DiscoSG[41] WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts
Negar Foroutan,Angelika Romanou,Matin Ansaripour,Julian Martin Eisenschlos,Karl Aberer,Rémi Lebret
Main category: cs.CL
TL;DR: This paper introduces WikiMixQA, a benchmark for evaluating cross-modal reasoning over tables and charts in documents. It assesses 12 vision-language models revealing significant challenges in long-context, multi-modal reasoning.
Details
Motivation: To address the gap in evaluating vision-language models' effectiveness in processing long-context vision inputs and complex document layouts. Method: Introduced WikiMixQA, a benchmark with 1,000 multiple-choice questions based on tables and charts from 4,000 Wikipedia pages. Evaluated 12 state-of-the-art vision-language models on this benchmark. Result: Proprietary models achieved ~70% accuracy with direct context but performed significantly worse when retrieval from long documents was required. GPT-4-o was the only model exceeding 50% accuracy in the latter setting, while open-source models had a maximum accuracy of 27%. Conclusion: The findings highlight the difficulties in long-context, multi-modal reasoning and establish WikiMixQA as an important benchmark for advancing document understanding research. Abstract: Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.[42] From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns
Bernardo Leite,Henrique Lopes Cardoso,Pedro Pinto,Abel Ferreira,Luís Abreu,Isabel Rangel,Sandra Monteiro
Main category: cs.CL
TL;DR: 当前生成模型可以生成与人工编写的质量相当的多项选择题(MCQs),但在语义清晰度、可回答性和干扰项设计方面仍存在问题。
Details
Motivation: 尽管多项选择题(MCQs)对学习和评估有价值,但手动创建具有不同难度水平和针对阅读技能的MCQs仍然是耗时且昂贵的任务。最近生成AI的进步为高效自动化MCQ生成提供了机会,但对生成MCQ的质量和可靠性评估关注有限,特别是在生成失败的情况。此外,大多数MCQ生成研究集中在英语上,其他语言尚未充分探索。 Method: 本研究调查了当前生成模型在为葡萄牙语阅读理解生成MCQs方面的能力。重点是生成与课程相关的叙述元素相符,并涵盖不同难度级别的MCQs。通过专家评审和分析从学生反应中提取的心理测量属性来评估这些MCQs,以评估其对小学生是否合适。 Result: 结果表明,当前模型可以生成与人工编写质量相当的MCQs。然而,确定了与语义清晰度和可回答性相关的问题。此外,在生成符合高质量MCQ选项设计标准并吸引学生的干扰项方面仍存在挑战。 Conclusion: 虽然当前生成模型能够生成高质量的MCQs,但在语义清晰度、可回答性和干扰项设计方面仍需改进,以更好地适用于实际教学环境。 Abstract: While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task. Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently. However, assessing the actual quality and reliability of generated MCQs has received limited attention -- particularly regarding cases where generation fails. This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings. Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language. Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels. We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students. Our results show that current models can generate MCQs of comparable quality to human-authored ones. However, we identify issues related to semantic clarity and answerability. Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design.[43] The Compositional Architecture of Regret in Large Language Models
Xiangxiang Cui,Shu Yang,Tianjin Huang,Wanyu Lin,Lijie Hu,Di Wang
Main category: cs.CL
TL;DR: 本研究探讨了大型语言模型中的后悔机制,提出了一种构建后悔数据集的工作流程、S-CDI指标以识别最佳后悔表示层以及RDS和GIC指标来识别和分析后悔神经元。实验成功识别出最佳后悔表示层,并揭示了信息处理在耦合和解耦阶段之间的交替模式,同时将神经元分为三类功能组。
Details
Motivation: 研究大型语言模型的后悔机制对于提高模型可靠性及揭示神经网络中认知编码方式至关重要。然而,当前存在缺少专门捕捉后悔表达的数据集、缺乏寻找最佳后悔表示层的指标以及缺乏识别和分析后悔神经元的指标等挑战。 Method: 1. 提出一种通过战略性设计提示场景构建全面后悔数据集的工作流程;2. 提出S-CDI指标以识别最佳后悔表示层;3. 提出RDS和GIC指标分别用于识别后悔神经元和分析激活模式。 Result: 成功使用S-CDI指标识别出最佳后悔表示层,显著提升了探测分类实验的性能。发现了模型层间的M型解耦模式,揭示了信息处理在耦合和解耦阶段的交替。通过RDS指标将神经元分为三类功能组:后悔神经元、非后悔神经元和双功能神经元。 Conclusion: 所提出的指标和方法为理解大型语言模型中的后悔机制提供了有效工具,有助于进一步提升模型可靠性和揭示神经网络中的认知编码方式。 Abstract: Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model's hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons.[44] Minding the Politeness Gap in Cross-cultural Communication
Yuka Machino,Matthias Hofer,Max Siegel,Joshua B. Tenenbaum,Robert D. Hawkins
Main category: cs.CL
TL;DR: This paper explores cross-cultural differences in interpretation of intensifiers among British and American English speakers through three experiments and a computational cognitive model. It finds that these differences stem from both literal meanings and weights on utterance cost.
Details
Motivation: To clarify whether cross-cultural communication differences arise from literal word meanings or pragmatic factors like politeness and brevity. Method: Conducted three experiments examining the interpretation of intensifiers such as 'quite' and 'very' by speakers of British and American English. Developed a computational cognitive model where listeners reason about speakers balancing informativity, politeness, and utterance cost. Result: Cross-cultural differences in intensifier interpretation are due to both different literal meanings and different weights on utterance cost. Conclusion: The findings challenge accounts based purely on semantic variation or politeness norms, showing that cross-cultural interpretation differences result from an interplay between literal meaning and pragmatic factors. Abstract: Misunderstandings in cross-cultural communication often arise from subtle differences in interpretation, but it is unclear whether these differences arise from the literal meanings assigned to words or from more general pragmatic factors such as norms around politeness and brevity. In this paper, we report three experiments examining how speakers of British and American English interpret intensifiers like "quite" and "very." To better understand these cross-cultural differences, we developed a computational cognitive model where listeners recursively reason about speakers who balance informativity, politeness, and utterance cost. Our model comparisons suggested that cross-cultural differences in intensifier interpretation stem from a combination of (1) different literal meanings, (2) different weights on utterance cost. These findings challenge accounts based purely on semantic variation or politeness norms, demonstrating that cross-cultural differences in interpretation emerge from an intricate interplay between the two.[45] Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
Main category: cs.CL
TL;DR: In generative commonsense reasoning tasks, LLMs can include all given concepts in sentences but may struggle to follow a specified concept order. This paper proposes Ordered CommonGen, a benchmark assessing LLMs' ability to generate sentences adhering to a specific order. Analysis of 36 LLMs revealed biases towards certain concept orders, resulting in low-diversity outputs and highlighting the need for improvement in instruction-following and compositional generalization capabilities.
Details
Motivation: The motivation is to improve the evaluation of LLMs' abilities in both following instructions and generating diverse sentences with given concepts in a specified order, addressing shortcomings in current benchmarks that only focus on including all concepts without considering order. Method: Propose Ordered CommonGen, a new benchmark designed to evaluate LLMs' compositional generalization and instruction-following abilities by measuring ordered coverage, which assesses whether concepts are generated in the specified order. Result: Analysis of 36 LLMs showed that while they generally understand instructions, they exhibit biases towards specific concept order patterns, leading to low-diversity outputs or identical results when the concept order changes. The best-performing model achieved only about 75% ordered coverage. Conclusion: There is a need to enhance LLMs' instruction-following and compositional generalization capabilities, as demonstrated by the findings from the Ordered CommonGen benchmark. Abstract: In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.[46] Oldies but Goldies: The Potential of Character N-grams for Romanian Texts
Dana Lupsa,Sanda-Maria Avram
Main category: cs.CL
TL;DR: This study addresses Romanian texts authorship attribution using the ROST corpus, evaluating six machine learning techniques with character n-gram features. ANN achieved the highest performance.
Details
Motivation: To systematically evaluate machine learning techniques for authorship attribution in Romanian texts using the ROST corpus and character n-gram features. Method: Six machine learning techniques were evaluated: SVM, LR, k-NN, DT, RF, and ANN, all employing character n-gram features for classification. Result: The ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. Conclusion: Lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, showing potential in resource-constrained or under-studied language settings. Abstract: This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource, constrained or under-studied language settings.[47] CC-LEARN: Cohort-based Consistency Learning
Xiao Ye,Shaswat Shrivastava,Zhaonan Li,Jacob Dineen,Shijie Lu,Avneet Ahuja,Ming Shen,Zhikun Xu,Ben Zhou
Main category: cs.CL
TL;DR: 大型语言模型在许多任务中表现出色,但在一致性和稳健推理方面仍存在困难。本文提出了基于群体的一致性学习(CC-Learn),这是一种强化学习框架,通过在从共享程序抽象中派生的相似问题组上进行训练,提高了LLM推理的可靠性。为了强制执行群体级别的连贯性,我们定义了一个结合群体准确性、检索奖励和拒绝惩罚的综合目标,这是监督微调无法直接优化的。实验表明,CC-Learn在具有挑战性的推理基准测试中提升了准确性和推理稳定性。
Details
Motivation: 尽管大型语言模型在许多任务中表现出色,但它们在一致性和稳健推理方面仍然存在问题。因此,需要一种方法来提高LLM在处理类似问题时的推理一致性。 Method: 提出了一种名为Cohort-based Consistency Learning (CC-Learn)的强化学习框架。该方法通过训练由共享程序抽象衍生的相似问题组(cohorts)来增强LLM推理的可靠性,并定义了一个综合目标函数,包括群体准确性、有效的分解问题的检索奖励以及对琐碎或无效查找的拒绝惩罚,以实现群体级别的一致性。 Result: 实验结果表明,在诸如ARC-Challenge和StrategyQA等具有挑战性的推理基准测试中,CC-Learn显著提高了准确性和推理稳定性,超越了预训练和监督微调基线模型的表现。 Conclusion: 基于群体级别的强化学习(RL)可以有效地提升LLM的推理一致性,CC-Learn为这一领域提供了有力的支持和改进。 Abstract: Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.[48] Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers
Tommaso Green,Martin Gubri,Haritz Puerto,Sangdoo Yun,Seong Joon Oh
Main category: cs.CL
TL;DR: In this paper, we explore how reasoning traces of large models used as personal agents can leak privacy. Despite being considered internal and safe, these traces often contain sensitive data that can be extracted through prompt injections or accidental leaks. We find that test-time compute methods, especially increased reasoning steps, increase such leakage. Although these methods make models more cautious in final outputs, they also lead to more verbose reasoning which increases privacy risks. This highlights a tension between utility improvement through reasoning and the expansion of privacy attack surfaces. We advocate for safety measures to cover not just model outputs but also their internal reasoning processes.
Details
Motivation: The motivation is to address the overlooked privacy risks associated with reasoning traces of large models functioning as personal agents, challenging the assumption that these traces are internal and therefore safe. Method: The method involves probing and agentic evaluations to demonstrate how test-time compute approaches, particularly increased reasoning steps, amplify privacy leakage. It explores the balance between making models more cautious in final answers and the side effect of increasing verbosity in reasoning which leads to more leakage. Result: The result shows that while increasing test-time compute budgets improves caution in final model outputs, it simultaneously amplifies privacy leakage through more verbose reasoning processes. Conclusion: We conclude that there is a fundamental trade-off between enhancing model utility through reasoning and expanding the privacy attack surface. Safety measures should extend beyond just regulating model outputs to include safeguarding the internal reasoning processes. Abstract: We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model's internal thinking, not just its outputs.[49] Gender-Neutral Machine Translation Strategies in Practice
Hillary Dawkins,Isar Nejadgholi,Chi-kiu Lo
Main category: cs.CL
TL;DR: 这篇论文评估了21个机器翻译系统在处理性别模糊时对性别中立的敏感性,发现总体上缺乏性别中立翻译,但有少数系统能够根据目标语言使用特定策略进行性别中立翻译。
Details
Motivation: 研究动机在于解决机器翻译在将性别模糊源文本翻译成语法性别语言时所面临的挑战,避免性别误导和表征伤害。 Method: 评估21个机器翻译系统在三种不同难度翻译方向上的性别中立反应,并对实际观察到的具体性别中立策略进行分类和讨论,同时考察二元性别刻板印象对性别中立翻译的影响。 Result: 大多数机器翻译系统未能提供性别中立翻译,但在某些目标语言下,有少数系统能够切换至性别中立翻译并使用特定策略。 Conclusion: 论文得出结论,当前机器翻译系统在应对性别模糊时普遍缺乏性别中立翻译能力,需要进一步改进以更好地处理性别中立问题。 Abstract: Gender-inclusive machine translation (MT) should preserve gender ambiguity in the source to avoid misgendering and representational harms. While gender ambiguity often occurs naturally in notional gender languages such as English, maintaining that gender neutrality in grammatical gender languages is a challenge. Here we assess the sensitivity of 21 MT systems to the need for gender neutrality in response to gender ambiguity in three translation directions of varying difficulty. The specific gender-neutral strategies that are observed in practice are categorized and discussed. Additionally, we examine the effect of binary gender stereotypes on the use of gender-neutral translation. In general, we report a disappointing absence of gender-neutral translations in response to gender ambiguity. However, we observe a small handful of MT systems that switch to gender neutral translation using specific strategies, depending on the target language.[50] GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee,Ryo Hachiuma,Yong Man Ro,Yu-Chiang Frank Wang,Yueh-Hua Wu
Main category: cs.CL
TL;DR: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios remains challenging due to their substantial computational demands. To address this challenge, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs.
Details
Motivation: Deploying large VLMs in real-world scenarios is challenging due to their substantial computational demands. There is a need for a general-purpose distillation framework that can be applied to different types of VLMs. Method: The paper introduces GenRecal, which incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Result: Through extensive experiments on multiple challenging benchmarks, GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs. Conclusion: GenRecal is a successful framework for distilling knowledge from large VLMs into smaller, more efficient counterparts, overcoming the challenge posed by the diversity of VLM architectures. Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.[51] PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning
Yuhui Shi,Yehan Yang,Qiang Sheng,Hao Mi,Beizhe Hu,Chaoxi Xu,Juan Cao
Main category: cs.CL
TL;DR: 提出了一种名为PhantomHunter的大型语言模型生成文本检测器,专门用于检测来自未见过的、私有调整的大型语言模型的文本。实验表明其优于7个基线和3个工业服务,F1分数超过96%。
Details
Motivation: 随着大型语言模型的普及,诸如虚假信息生产和学术不端等不良社会问题变得更加严重,使得检测大型语言模型生成的文本变得前所未有的重要。然而,现有方法在面对来自私有调整的大型语言模型的文本时存在显著性能下降的问题。 Method: 提出了PhantomHunter,一种专门针对未见过的、私有调整的大型语言模型生成文本的检测器。其家族感知学习框架捕获基础模型及其衍生模型在家族级别共享的特征,而不是记忆个体特征。 Result: 在来自LLaMA、Gemma和Mistral家族的数据上的实验表明,该方法优于7个基线和3个工业服务,F1分数超过96%。 Conclusion: PhantomHunter能够有效检测来自未见过的、私有调整的大型语言模型的文本,解决了现有检测器在面对此类文本时性能下降的问题。 Abstract: With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%.cs.CV [Back]
[52] SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection
Jingxuan Zhou,Yuehao Wu,Yibo Zhang,Yeyubei Zhang,Yunchong Liu,Bolin Huang,Chunhong Yuan
Main category: cs.CV
TL;DR: This paper proposes a Semantic Irony Recognition Network (SemIRNet) with three innovations to solve the problem of identifying graphical implicit correlations in multimodal irony detection tasks, achieving accuracy and F1 improvements.
Details
Motivation: To address the difficulty in accurately identifying graphical implicit correlations in multimodal irony detection tasks. Method: The model introduces ConceptNet knowledge base for common-sense reasoning, designs two cross-modal semantic similarity detection modules at word and sample levels, and applies contrastive learning loss function to optimize sample feature spatial distribution. Result: Experiments on a benchmark dataset show 1.64% improvement in accuracy and 2.88% in F1 value compared to existing optimal methods. Ablation experiments confirm the importance of knowledge fusion and semantic similarity detection. Conclusion: SemIRNet improves performance in multimodal irony detection through its innovative use of ConceptNet, cross-modal semantic similarity detection, and contrastive learning. Abstract: Aiming at the problem of difficulty in accurately identifying graphical implicit correlations in multimodal irony detection tasks, this paper proposes a Semantic Irony Recognition Network (SemIRNet). The model contains three main innovations: (1) The ConceptNet knowledge base is introduced for the first time to acquire conceptual knowledge, which enhances the model's common-sense reasoning ability; (2) Two cross-modal semantic similarity detection modules at the word level and sample level are designed to model graphic-textual correlations at different granularities; and (3) A contrastive learning loss function is introduced to optimize the spatial distribution of the sample features, which improves the separability of positive and negative samples. Experiments on a publicly available multimodal irony detection benchmark dataset show that the accuracy and F1 value of this model are improved by 1.64% and 2.88% to 88.87% and 86.33%, respectively, compared with the existing optimal methods. Further ablation experiments verify the important role of knowledge fusion and semantic similarity detection in improving the model performance.[53] Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?
Yang Yao,Lingyu Li,Jiaxin Song,Chiyu Chen,Zhenqi He,Yixu Wang,Xin Wang,Tianle Gu,Jie Li,Yan Teng,Yingchun Wang
Main category: cs.CV
TL;DR: Multimodal Large Language Models (MLLMs) have improved in cognitive and reasoning skills, but still face challenges in visual fine-grained perception and commonsense causal inference. This paper introduces Argus Inspection, a benchmark for evaluating these capabilities, and Eye of Panoptes, a framework for assessing opinion-based reasoning tasks. Experiments on 26 MLLMs show significant room for improvement.
Details
Motivation: To address the ongoing challenges in visual fine-grained perception and commonsense causal inference in MLLMs, the authors aim to create a comprehensive benchmark and evaluation framework to better assess and enhance these models' capabilities. Method: The authors introduce Argus Inspection, a multimodal benchmark with two levels of difficulty focusing on detailed visual recognition and real-world commonsense understanding. They also present the Eye of Panoptes framework, which incorporates a binary parametric Sigmoid metric and an indicator function for holistic evaluation of MLLMs' responses in opinion-based reasoning tasks. Result: Experiments on 26 mainstream MLLMs indicate that the highest performance in visual fine-grained reasoning is relatively low at 0.46, suggesting substantial potential for improvement. Conclusion: The research provides valuable insights for the continued development and refinement of MLLMs, particularly in enhancing their visual and commonsense reasoning abilities. Abstract: As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs' responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.[54] A Hybrid ConvNeXt-EfficientNet AI Solution for Precise Falcon Disease Detection
Alavikunhu Panthakkan,Zubair Medammal,S M Anzar,Fatma Taher,Hussain Al-Ahmad
Main category: cs.CV
TL;DR: This paper introduces a hybrid ConvNeXt and EfficientNet AI model for classifying falcon diseases, achieving better performance than traditional methods.
Details
Motivation: To improve the health surveillance of falcons by developing an accurate method to classify falcon diseases. Method: The paper employs a hybrid of ConvNeXt and EfficientNet AI models to identify three conditions in falcons: Normal, Liver Disease, and Aspergillosis. The model is trained and validated using a substantial dataset with focus on accuracy, precision, recall, and F1-score. Result: The concatenated AI model outperforms traditional diagnostic methods and individual model architectures in detecting falcon diseases. Conclusion: The implementation of the hybrid AI model represents a significant advancement in precise falcon disease detection and opens opportunities for future AI-powered avian healthcare solutions. Abstract: Falconry, a revered tradition involving the training and hunting with falcons, requires meticulous health surveillance to ensure the health and safety of these prized birds, particularly in hunting scenarios. This paper presents an innovative method employing a hybrid of ConvNeXt and EfficientNet AI models for the classification of falcon diseases. The study focuses on accurately identifying three conditions: Normal, Liver Disease and 'Aspergillosis'. A substantial dataset was utilized for training and validating the model, with an emphasis on key performance metrics such as accuracy, precision, recall, and F1-score. Extensive testing and analysis have shown that our concatenated AI model outperforms traditional diagnostic methods and individual model architectures. The successful implementation of this hybrid AI model marks a significant step forward in precise falcon disease detection and paves the way for future developments in AI-powered avian healthcare solutions.[55] ViLLa: A Neuro-Symbolic approach for Animal Monitoring
Harsha Koduri
Main category: cs.CV
TL;DR: A neuro-symbolic framework named ViLLa is introduced for interpretable animal monitoring, integrating visual detection, language parsing and symbolic reasoning to answer queries about animals in images.
Details
Motivation: Monitoring animal populations requires interpreting visual data and understanding human language queries, prompting the need for a system that can bridge these capabilities. Method: ViLLa integrates three components: a visual detection module for identifying animals and their locations, a language parser for understanding natural language queries, and a symbolic reasoning layer applying logic-based inference to answer those queries. Result: ViLLa successfully grounds visual detections into symbolic facts and uses predefined rules to accurately answer questions related to count, presence, and location of animals in images. Conclusion: ViLLa demonstrates the ability to connect visual content with structured, human-interpretable queries, offering modularity and transparency unlike end-to-end black-box models. Abstract: Monitoring animal populations in natural environments requires systems that can interpret both visual data and human language queries. This work introduces ViLLa (Vision-Language-Logic Approach), a neuro-symbolic framework designed for interpretable animal monitoring. ViLLa integrates three core components: a visual detection module for identifying animals and their spatial locations in images, a language parser for understanding natural language queries, and a symbolic reasoning layer that applies logic-based inference to answer those queries. Given an image and a question such as "How many dogs are in the scene?" or "Where is the buffalo?", the system grounds visual detections into symbolic facts and uses predefined rules to compute accurate answers related to count, presence, and location. Unlike end-to-end black-box models, ViLLa separates perception, understanding, and reasoning, offering modularity and transparency. The system was evaluated on a range of animal imagery tasks and demonstrates the ability to bridge visual content with structured, human-interpretable queries.[56] GraphGSOcc: Semantic and Geometric Graph Transformer for 3D Gaussian Splating-based Occupancy Prediction
Ke Song,Yunhe Wu,Chunchit Siu,Huiyuan Xiong
Main category: cs.CV
TL;DR: The paper presents GraphGSOcc model for 3D semantic occupancy prediction in autonomous driving, which solves two issues of existing methods through Dual Gaussians Graph Attention and Multi-scale Graph Attention framework.
Details
Motivation: To improve the accuracy and efficiency of 3D semantic occupancy prediction for autonomous driving by addressing limitations in current 3D Gaussian Splating (3DGS) methods regarding feature aggregation and boundary ambiguities. Method: Proposes the GraphGSOcc model that integrates semantic and geometric graph Transformer. Introduces Dual Gaussians Graph Attention to dynamically construct dual graph structures for better feature aggregation and encoding semantic relationships. Uses Multi-scale Graph Attention framework for optimizing boundary details and modeling object-level topology. Result: On the SurroundOcc dataset, achieves an mIoU of 24.10%, reduces GPU memory usage to 6.1 GB, showing a 1.97% improvement in mIoU and 13.7% reduction in memory compared to GaussianWorld. Conclusion: The GraphGSOcc model effectively addresses the limitations of previous 3DGS methods, providing improved performance and reduced memory consumption in 3D semantic occupancy prediction. Abstract: Addressing the task of 3D semantic occupancy prediction for autonomous driving, we tackle two key issues in existing 3D Gaussian Splating (3DGS) methods: (1) unified feature aggregation neglecting semantic correlations among similar categories and across regions, and (2) boundary ambiguities caused by the lack of geometric constraints in MLP iterative optimization. We propose the GraphGSOcc model, a novel framework that combines semantic and geometric graph Transformer for 3D Gaussian Splating-based Occupancy Prediction. We propose the Dual Gaussians Graph Attenntion, which dynamically constructs dual graph structures: a geometric graph adaptively calculating KNN search radii based on Gaussian poses, enabling large-scale Gaussians to aggregate features from broader neighborhoods while compact Gaussians focus on local geometric consistency; a semantic graph retaining top-M highly correlated nodes via cosine similarity to explicitly encode semantic relationships within and across instances. Coupled with the Multi-scale Graph Attention framework, fine-grained attention at lower layers optimizes boundary details, while coarse-grained attention at higher layers models object-level topology. Experiments on the SurroundOcc dataset achieve an mIoU of 24.10%, reducing GPU memory to 6.1 GB, demonstrating a 1.97% mIoU improvement and 13.7% memory reduction compared to GaussianWorld[57] DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning
Yifeng Gao,Yifan Ding,Hongyu Su,Juncheng Li,Yunhan Zhao,Lin Luo,Zixing Chen,Li Wang,Xin Wang,Yixu Wang,Xingjun Ma,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 随着AI生成视频在媒体平台上的普及,区分合成内容和真实片段变得至关重要。本文介绍了一个新数据集DAVID-X和模型DAVID-XR1,能够提供细粒度的证据来识别AI生成的视频,并将检测过程从黑箱转变为透明可验证的诊断流程。
Details
Motivation: 现有的方法主要将识别AI生成视频视为二分类任务,缺乏对模型判断依据的深入洞察。实际需求不仅在于检测微小瑕疵,更需要提供能说服审计人员和最终用户的精细证据。 Method: 引入了DAVID-X数据集,包含AI生成视频及其详细的时空注释和书面理由;基于此数据集开发了视频-语言模型DAVID-XR1,该模型通过缺陷分类、时空定位和自然语言解释提供可解释的视觉推理链。使用通用主干网络在小型数据集上微调并结合思维链蒸馏技术以实现强大的泛化能力。 Result: 实验表明,通过在DAVID-X数据集上微调并增强的通用主干网络能够在各种生成器和生成模式下实现良好泛化,展示了可解释检测方法在可信识别AI生成视频内容方面的潜力。 Conclusion: 本文提出的方法将AI生成视频检测从不透明的黑箱决策转变为透明且可验证的诊断过程,为构建值得信赖的AI生成内容识别系统奠定了基础。 Abstract: As AI-generated video becomes increasingly pervasive across media platforms, the ability to reliably distinguish synthetic content from authentic footage has become both urgent and essential. Existing approaches have primarily treated this challenge as a binary classification task, offering limited insight into where or why a model identifies a video as AI-generated. However, the core challenge extends beyond simply detecting subtle artifacts; it requires providing fine-grained, persuasive evidence that can convince auditors and end-users alike. To address this critical gap, we introduce DAVID-X, the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales. Leveraging these rich annotations, we present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning-including defect categorization, temporal-spatial localization, and natural language explanations. This approach fundamentally transforms AI-generated video detection from an opaque black-box decision into a transparent and verifiable diagnostic process. We demonstrate that a general-purpose backbone, fine-tuned on our compact dataset and enhanced with chain-of-thought distillation, achieves strong generalization across a variety of generators and generation modes. Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.[58] Recent Advances in Multi-Agent Human Trajectory Prediction: A Comprehensive Review
Céline Finet,Stephane Da Silva Martins,Jean-Bernard Hayet,Ioannis Karamouzas,Javad Amirian,Sylvie Le Hégarat-Mascle,Julien Pettré,Emanuel Aldea
Main category: cs.CV
TL;DR: This paper reviews recent advancements in deep learning-based multi-agent trajectory prediction, categorizes methods by design and strategy, highlights studies using the ETH/UCY benchmark, and outlines challenges and future directions.
Details
Motivation: To gain a finer understanding of multi-agent interactions through data-driven methods in human trajectory prediction, impacting areas like autonomous navigation and crowd modeling. Method: Categorizing existing methods based on architectural design, input representations, and prediction strategies, with an emphasis on models evaluated using the ETH/UCY benchmark. Result: Provides a comprehensive review of advancements in multi-agent trajectory prediction from 2020 to 2024, identifying key challenges and suggesting future research directions. Conclusion: Deep learning-based multi-agent trajectory prediction has seen significant progress, yet challenges remain that require further investigation. Abstract: With the emergence of powerful data-driven methods in human trajectory prediction (HTP), gaining a finer understanding of multi-agent interactions lies within hand's reach, with important implications in areas such as autonomous navigation and crowd modeling. This survey reviews some of the most recent advancements in deep learning-based multi-agent trajectory prediction, focusing on studies published between 2020 and 2024. We categorize the existing methods based on their architectural design, their input representations, and their overall prediction strategies, placing a particular emphasis on models evaluated using the ETH/UCY benchmark. Furthermore, we highlight key challenges and future research directions in the field of multi-agent HTP.[59] ArchShapeNet:An Interpretable 3D-CNN Framework for Evaluating Architectural Shapes
Jun Yin,Jing Zhong,Pengyu Zeng,Peilin Li,Zixuan Dai,Miao Zhang,Shuai Lu
Main category: cs.CV
TL;DR: In contemporary architectural design, generative plugin tools are essential for producing initial concepts and exploring novel 3D forms. To address the challenge of analyzing differences between human-designed and machine-generated 3D forms, this study built ArchForms-4000, proposed ArchShapeNet, and conducted comparative experiments. The study highlights the advantages of human-designed forms and provides insights for enhancing generative design tools.
Details
Motivation: The growing complexity and diversity of design demands in contemporary architectural design have made generative plugin tools essential, but objectively analyzing the differences between human-designed and machine-generated 3D forms remains a challenge. Method: Built ArchForms-4000, a dataset containing 2,000 architect-designed and 2,000 Evomass-generated 3D forms; Proposed ArchShapeNet, a 3D convolutional neural network tailored for classifying and analyzing architectural forms; Conducted comparative experiments. Result: The model outperforms human experts in distinguishing form origins, achieving 94.29% accuracy, 96.2% precision, and 98.51% recall. Conclusion: This study not only highlights the distinctive advantages of human-designed forms in spatial organization, proportional harmony, and detail refinement but also provides valuable insights for enhancing generative design tools in the future. Abstract: In contemporary architectural design, the growing complexity and diversity of design demands have made generative plugin tools essential for quickly producing initial concepts and exploring novel 3D forms. However, objectively analyzing the differences between human-designed and machine-generated 3D forms remains a challenge, limiting our understanding of their respective strengths and hindering the advancement of generative tools. To address this, we built ArchForms-4000, a dataset containing 2,000 architect-designed and 2,000 Evomass-generated 3D forms; Proposed ArchShapeNet, a 3D convolutional neural network tailored for classifying and analyzing architectural forms, incorporating a saliency module to highlight key spatial features aligned with architectural reasoning; And conducted comparative experiments showing our model outperforms human experts in distinguishing form origins, achieving 94.29% accuracy, 96.2% precision, and 98.51% recall. This study not only highlights the distinctive advantages of human-designed forms in spatial organization, proportional harmony, and detail refinement but also provides valuable insights for enhancing generative design tools in the future.[60] Real-Time, Low-Latency Surveillance Using Entropy-Based Adaptive Buffering and MobileNetV2 on Edge Devices
Poojashree Chandrashekar Pankaj M Sajjanar
Main category: cs.CV
TL;DR: The paper presents a high-performance, low-latency video surveillance system for resource-constrained environments.
Details
Motivation: To create an efficient video surveillance system that can operate with high throughput and low latency on devices with limited resources, such as Raspberry Pi and NVIDIA Jetson Nano. Method: Proposed a formal entropy-based adaptive frame buffering algorithm and integrated it with MobileNetV2 to achieve the desired performance. Result: The system processes live video streams with less than 50ms end-to-end inference latency while maintaining over 92% detection accuracy on standard datasets. Comparative and ablation experiments confirm the design's effectiveness. Conclusion: The architecture is scalable, cost-effective, and adheres to strict data privacy regulations, making it suitable for smart city or embedded security applications. Abstract: This paper describes a high-performance, low-latency video surveillance system designed for resource-constrained environments. We have proposed a formal entropy-based adaptive frame buffering algorithm and integrated that with MobileNetV2 to achieve high throughput with low latency. The system is capable of processing live streams of video with sub-50ms end-to-end inference latency on resource-constrained devices (embedding platforms) such as Raspberry Pi, Amazon, and NVIDIA Jetson Nano. Our method maintains over 92% detection accuracy on standard datasets focused on video surveillance and exhibits robustness to varying lighting, backgrounds, and speeds. A number of comparative and ablation experiments validate the effectiveness of our design. Finally, our architecture is scalable, inexpensive, and compliant with stricter data privacy regulations than common surveillance systems, so that the system could coexist in a smart city or embedded security architecture.[61] MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation
Kiet Dang Vu,Trung Thai Tran,Duc Dung Nguyen
Main category: cs.CV
TL;DR: MonoVQD is a novel framework that improves DETR-based monocular 3D detection through three contributions: Mask Separated Self-Attention for stable matching, Variational Query Denoising to prevent gradient vanishing, and a self-distillation strategy for query quality improvement. It performs well on KITTI benchmark and generalizes effectively.
Details
Motivation: DETR-like architectures have limitations when directly applied to monocular 3D detection from a single image, which hinders their optimal performance. Method: The proposed method MonoVQD includes: 1) Mask Separated Self-Attention to integrate denoising into DETR architecture and stabilize Hungarian matching; 2) Variational Query Denoising to solve the gradient vanishing problem in denoising process by introducing stochastic properties; 3) A self-distillation strategy using insights from later decoder layers to improve earlier layer queries. Result: MonoVQD achieves superior performance on the challenging KITTI monocular benchmark. Its core components also enhance performance in multi-view 3D detection scenarios on the nuScenes dataset. Conclusion: MonoVQD fundamentally advances DETR-based monocular 3D detection with its innovative mechanisms and demonstrates robust generalization capabilities. Abstract: Precisely localizing 3D objects from a single image constitutes a central challenge in monocular 3D detection. While DETR-like architectures offer a powerful paradigm, their direct application in this domain encounters inherent limitations, preventing optimal performance. Our work addresses these challenges by introducing MonoVQD, a novel framework designed to fundamentally advance DETR-based monocular 3D detection. We propose three main contributions. First, we propose the Mask Separated Self-Attention mechanism that enables the integration of the denoising process into a DETR architecture. This improves the stability of Hungarian matching to achieve a consistent optimization objective. Second, we present the Variational Query Denoising technique to address the gradient vanishing problem of conventional denoising methods, which severely restricts the efficiency of the denoising process. This explicitly introduces stochastic properties to mitigate this fundamental limitation and unlock substantial performance gains. Finally, we introduce a sophisticated self-distillation strategy, leveraging insights from later decoder layers to synergistically improve query quality in earlier layers, thereby amplifying the iterative refinement process. Rigorous experimentation demonstrates that MonoVQD achieves superior performance on the challenging KITTI monocular benchmark. Highlighting its broad applicability, MonoVQD's core components seamlessly integrate into other architectures, delivering significant performance gains even in multi-view 3D detection scenarios on the nuScenes dataset and underscoring its robust generalization capabilities.[62] Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction
Chengzhi Xu,Yuyang Wang,Lai Wei,Lichao Sun,Weiran Huang
Main category: cs.CV
TL;DR: 提出了一种基于结构化指令的迭代精炼方法ChartIR,以提升多模态大模型在图表转代码生成任务中的表现。通过区分视觉理解和代码翻译两个任务,并将其分解为初始代码生成和迭代精炼两个阶段,该方法在Qwen2-VL和GPT-4o模型上均表现出优越性能。
Details
Motivation: 尽管多模态大模型在各种视觉任务中取得了令人印象深刻的结果,但在图表转代码生成任务上的表现仍有待提高。此任务需要模型不仅具备精确的视觉理解能力,还要能准确地将视觉元素转化为结构化代码。直接提示模型执行这一复杂任务往往无法得到满意结果。 Method: 首先区分了视觉理解和代码翻译两个任务,并设计了描述和差异两种类型的结构化指令来完成视觉理解部分。描述指令捕捉参考图表的视觉元素,而差异指令则刻画参考图表与生成图表之间的差异。这些指令将视觉特征转换为语言表示,从而促进后续的代码翻译过程。其次,将整体图表生成流程分为初始代码生成和迭代精炼两个阶段,实现最终输出的逐步优化。 Result: 实验结果表明,与其它方法相比,所提出的方法在开源模型Qwen2-VL和闭源模型GPT-4o上都取得了更优的性能。 Conclusion: 所提出的ChartIR方法能够有效提升多模态大模型在图表转代码生成任务上的表现,通过结构化指令和迭代精炼策略,实现了从视觉理解到代码翻译的有效转化。 Abstract: Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction. First, we distinguish two tasks: visual understanding and code translation. To accomplish the visual understanding component, we design two types of structured instructions: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations, thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement, enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.[63] PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers
Lukas Schiesser,Cornelius Wolff,Sophie Haas,Simon Pukrop
Main category: cs.CV
TL;DR: The paper introduces PictSure, an ICL framework that focuses on the role of image embeddings in few-shot image classification. It outperforms existing models on out-of-domain benchmarks while maintaining comparable results on in-domain tasks.
Details
Motivation: Building effective image classification models is challenging in data-scarce domains due to the impracticality of collecting large labeled datasets. In-context learning shows promise for few-shot image classification but lacks focus on the role of image embeddings. Method: Presents PictSure, a framework that systematically examines the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream few-shot image classification performance. Result: Experiments demonstrate that training success and out-of-domain performance are highly dependent on how embedding models are pretrained. PictSure outperforms existing models on out-of-domain benchmarks while performing comparably on in-domain tasks. Conclusion: PictSure highlights the importance of image embeddings in in-context learning for few-shot image classification and provides insights into optimizing embedding model pretraining. Abstract: Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.[64] Finding Optimal Kernel Size and Dimension in Convolutional Neural Networks An Architecture Optimization Approach
Shreyas Rajeev,B Sathish Babu
Main category: cs.CV
TL;DR: This paper proposes BKSEF, a framework for optimal kernel size determination in CNNs. It balances information gain, efficiency, and accuracy, demonstrating significant improvements over traditional models with uniform 3x3 kernels.
Details
Motivation: Kernel size selection in CNNs impacts receptive field, feature extraction, computational cost, and model accuracy, yet it is often overlooked. Current approaches typically use a fixed heuristic (e.g., uniform 3x3 kernels), which may not be optimal for all layers or applications. Method: The Best Kernel Size Estimation Function (BKSEF) integrates principles from information theory, signal processing, and learning theory to determine optimal, layer-wise kernel sizes. It aims to balance information gain, computational efficiency, and accuracy improvements. Result: Experiments on multiple datasets (CIFAR-10, CIFAR-100, ImageNet-lite, ChestX-ray14, GTSRB) show that BKSEF-guided architectures achieve up to 3.1% accuracy improvement and 42.8% reduction in FLOPs compared to traditional models. Real-world case studies in medical image classification and traffic sign recognition further validate the approach. Conclusion: Kernel size can be an active, optimizable parameter rather than a fixed heuristic. BKSEF provides practical heuristics and theoretical support for efficient and application-aware CNN designs, suitable for integration into neural architecture search pipelines and real-time systems. Abstract: Kernel size selection in Convolutional Neural Networks (CNNs) is a critical but often overlooked design decision that affects receptive field, feature extraction, computational cost, and model accuracy. This paper proposes the Best Kernel Size Estimation Function (BKSEF), a mathematically grounded and empirically validated framework for optimal, layer-wise kernel size determination. BKSEF balances information gain, computational efficiency, and accuracy improvements by integrating principles from information theory, signal processing, and learning theory. Extensive experiments on CIFAR-10, CIFAR-100, ImageNet-lite, ChestX-ray14, and GTSRB datasets demonstrate that BKSEF-guided architectures achieve up to 3.1 percent accuracy improvement and 42.8 percent reduction in FLOPs compared to traditional models using uniform 3x3 kernels. Two real-world case studies further validate the approach: one for medical image classification in a cloud-based setup, and another for traffic sign recognition on edge devices. The former achieved enhanced interpretability and accuracy, while the latter reduced latency and model size significantly, with minimal accuracy trade-off. These results show that kernel size can be an active, optimizable parameter rather than a fixed heuristic. BKSEF provides practical heuristics and theoretical support for researchers and developers seeking efficient and application-aware CNN designs. It is suitable for integration into neural architecture search pipelines and real-time systems, offering a new perspective on CNN optimization.[65] Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis
Varun Mannam,Zhenyu Shi
Main category: cs.CV
TL;DR: A deep learning-based approach automates key-frame identification in retail videos, providing automatic annotations of products and customers, saving costs and maintaining accuracy.
Details
Motivation: Accurate video annotation is crucial for modern retail applications but conventional methods are time-consuming and costly due to manual labeling. Method: The method uses deep neural networks to learn discriminative features by embedding video frames and incorporating object detection-based techniques tailored for retail environments to automate key-frame identification and annotation. Result: The approach achieves accuracy comparable to human annotator labeling, leads to an average of 2 times cost savings, and allows human annotators to verify/adjust less than 5% of detected frames without reducing annotation quality. Conclusion: This automation significantly reduces operational costs and saves time and effort in retail video labeling tasks, making it valuable for various retail applications. Abstract: Accurate video annotation plays a vital role in modern retail applications, including customer behavior analysis, product interaction detection, and in-store activity recognition. However, conventional annotation methods heavily rely on time-consuming manual labeling by human annotators, introducing non-robust frame selection and increasing operational costs. To address these challenges in the retail domain, we propose a deep learning-based approach that automates key-frame identification in retail videos and provides automatic annotations of products and customers. Our method leverages deep neural networks to learn discriminative features by embedding video frames and incorporating object detection-based techniques tailored for retail environments. Experimental results showcase the superiority of our approach over traditional methods, achieving accuracy comparable to human annotator labeling while enhancing the overall efficiency of retail video annotation. Remarkably, our approach leads to an average of 2 times cost savings in video annotation. By allowing human annotators to verify/adjust less than 5% of detected frames in the video dataset, while automating the annotation process for the remaining frames without reducing annotation quality, retailers can significantly reduce operational costs. The automation of key-frame detection enables substantial time and effort savings in retail video labeling tasks, proving highly valuable for diverse retail applications such as shopper journey analysis, product interaction detection, and in-store security monitoring.[66] Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction
Zhengquan Zhang,Feng Xu,Mengmi Zhang
Main category: cs.CV
TL;DR: 提出了一种新的主动视角选择(AVS)方法,通过神经不确定性地图来提高3D重建效率和准确性。该方法使用UPNet预测不确定性地图,从而选择最有信息量的视角进行3D神经渲染模型训练,显著减少计算开销并保持重建精度。
Details
Motivation: 当前的3D重建方法在选择最佳视角时通常依赖于计算每个候选视角的不确定性,这种方法计算成本高且效率较低。因此,需要一种更高效、准确的方法来进行主动视角选择(AVS)。 Method: 引入了名为UPNet的轻量级前馈深度神经网络,通过输入单张3D物体图像输出预测的不确定性地图。利用从大量自然物体中观察到的启发式规则,UPNet学习从视角外观到潜在体积表示不确定性的直接映射。接着,将所有之前预测的神经不确定性地图聚合,抑制冗余候选视角,并选择最具信息量的视角。最后,基于这些选中的视角训练3D神经渲染模型并评估其性能。 Result: 尽管使用的视角数量仅为上限的一半,但该方法仍能达到与现有方法相当的重建精度。此外,在AVS过程中显著减少了计算开销,速度提升了400倍,并且CPU、RAM和GPU使用减少了50%以上。重要的是,该方法无需额外训练即可有效推广到包含新物体类别的AVS任务。 Conclusion: 本文提出的AVS方法通过UPNet预测不确定性地图实现了高效的视角选择,不仅显著降低了计算成本,还保持了较高的3D重建精度。此方法在处理新物体类别时表现出良好的泛化能力。 Abstract: Some perspectives naturally provide more information than others. How can an AI system determine which viewpoint offers the most valuable insight for accurate and efficient 3D object reconstruction? Active view selection (AVS) for 3D reconstruction remains a fundamental challenge in computer vision. The aim is to identify the minimal set of views that yields the most accurate 3D reconstruction. Instead of learning radiance fields, like NeRF or 3D Gaussian Splatting, from a current observation and computing uncertainty for each candidate viewpoint, we introduce a novel AVS approach guided by neural uncertainty maps predicted by a lightweight feedforward deep neural network, named UPNet. UPNet takes a single input image of a 3D object and outputs a predicted uncertainty map, representing uncertainty values across all possible candidate viewpoints. By leveraging heuristics derived from observing many natural objects and their associated uncertainty patterns, we train UPNet to learn a direct mapping from viewpoint appearance to uncertainty in the underlying volumetric representations. Next, our approach aggregates all previously predicted neural uncertainty maps to suppress redundant candidate viewpoints and effectively select the most informative one. Using these selected viewpoints, we train 3D neural rendering models and evaluate the quality of novel view synthesis against other competitive AVS methods. Remarkably, despite using half of the viewpoints than the upper bound, our method achieves comparable reconstruction accuracy. In addition, it significantly reduces computational overhead during AVS, achieving up to a 400 times speedup along with over 50\% reductions in CPU, RAM, and GPU usage compared to baseline methods. Notably, our approach generalizes effectively to AVS tasks involving novel object categories, without requiring any additional training.[67] DETONATE: A Benchmark for Text-to-Image Alignment and Kernelized Direct Preference Optimization
Renjith Prasad,Abhilekh Borah,Hasnat Md Abdullah,Chathurangi Shyalika,Gurpreet Singh,Ritvik Garimella,Rajarshi Roy,Harshul Surana,Nasrin Imanpour,Suranjana Trivedy,Amit Sheth,Amitava Das
Main category: cs.CV
TL;DR: The paper presents DPO-Kernels for T2I models with Hybrid Loss, Kernelized Representations, and Divergence Selection to enhance alignment. It introduces DETONATE, a benchmark assessing social bias, and AQI, a measure of safe/unsafe image separability.
Details
Motivation: To improve the alignment of text-to-image models ensuring generated images capture user intent while maintaining safety and fairness. Method: DPO-Kernels for T2I models include Hybrid Loss integrating embedding-based objectives, Kernelized Representations using various kernels, and Divergence Selection expanding beyond KL regularizer. The paper also introduces DETONATE benchmark and AQI measure. Result: Empirical results show strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). Conclusion: DPO-Kernels maintain strong generalization bounds and DETONATE along with complete code is publicly released. Abstract: Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness. Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems. This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybrid Loss, integrating embedding-based objectives with traditional probability-based loss for improved optimization; (ii) Kernelized Representations, employing Radial Basis Function (RBF), Polynomial, and Wavelet kernels for richer feature transformations and better separation between safe and unsafe inputs; and (iii) Divergence Selection, expanding beyond DPO's default Kullback-Leibler (KL) regularizer by incorporating Wasserstein and R'enyi divergences for enhanced stability and robustness. We introduce DETONATE, the first large-scale benchmark of its kind, comprising approximately 100K curated image pairs categorized as chosen and rejected. DETONATE encapsulates three axes of social bias and discrimination: Race, Gender, and Disability. Prompts are sourced from hate speech datasets, with images generated by leading T2I models including Stable Diffusion 3.5 Large, Stable Diffusion XL, and Midjourney. Additionally, we propose the Alignment Quality Index (AQI), a novel geometric measure quantifying latent-space separability of safe/unsafe image activations, revealing hidden vulnerabilities. Empirically, we demonstrate that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). DETONATE and complete code are publicly released.[68] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning
Yizhen Zhang,Yang Ding,Shuoshuo Zhang,Xinchen Zhang,Haoling Li,Zhong-zhi Li,Peijie Wang,Jie Wu,Lei Ji,Yelong Shen,Yujiu Yang,Yeyun Gong
Main category: cs.CV
TL;DR: The paper proposes PeRL, a reinforcement learning approach for multimodal tasks, with strategies to enhance exploration-exploitation trade-off. It outperforms baselines in multi-image benchmarks while maintaining single-image task performance.
Details
Motivation: Existing multimodal reinforcement learning approaches are limited to spatial reasoning within single-image contexts and struggle with more complex scenarios involving multi-image positional reasoning. Method: PeRL is proposed along with a multi-stage strategy that enhances the exploration-exploitation trade-off. This includes permutation of image sequences for diversity and a rollout filtering mechanism for resampling. Result: Experiments show that the PeRL trained model surpasses R1-related and interleaved VLM baselines significantly on multi-image benchmarks and maintains comparable performance on single-image tasks. Conclusion: PeRL achieves state-of-the-art performance in multi-image benchmarks and shows promise in handling complex multimodal reasoning tasks. Abstract: Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.[69] Frequency-Calibrated Membership Inference Attacks on Medical Image Diffusion Models
Xinkai Zhao,Yuta Tokuoka,Junichiro Iwasawa,Keita Oda
Main category: cs.CV
TL;DR: 在医学图像扩散模型上提出了一种频率校准的重建误差(FCRE)方法,用于成员推断攻击(MIA),通过聚焦中频范围的重建误差来提高隐私风险评估的准确性。
Details
Motivation: 现有的MIA方法在医学图像上的应用面临挑战,因为重建误差受图像固有难度影响,且扩散模型难以重建高频细节。 Method: 提出FCRE方法,专注于特定中频范围的重建误差,排除高频和低频区域的影响。分析逆扩散过程,获取中频重建误差,并计算重构图像与原始图像之间的结构相似性指数得分,以该得分与阈值比较确定成员身份。 Result: 在多个医学图像数据集上的实验表明,所提出的FCRE方法优于现有的MIA方法。 Conclusion: FCRE方法通过减少图像固有难度的干扰因素,在医学图像扩散模型的MIA中表现出更优性能,为隐私风险评估提供了改进方案。 Abstract: The increasing use of diffusion models for image generation, especially in sensitive areas like medical imaging, has raised significant privacy concerns. Membership Inference Attack (MIA) has emerged as a potential approach to determine if a specific image was used to train a diffusion model, thus quantifying privacy risks. Existing MIA methods often rely on diffusion reconstruction errors, where member images are expected to have lower reconstruction errors than non-member images. However, applying these methods directly to medical images faces challenges. Reconstruction error is influenced by inherent image difficulty, and diffusion models struggle with high-frequency detail reconstruction. To address these issues, we propose a Frequency-Calibrated Reconstruction Error (FCRE) method for MIAs on medical image diffusion models. By focusing on reconstruction errors within a specific mid-frequency range and excluding both high-frequency (difficult to reconstruct) and low-frequency (less informative) regions, our frequency-selective approach mitigates the confounding factor of inherent image difficulty. Specifically, we analyze the reverse diffusion process, obtain the mid-frequency reconstruction error, and compute the structural similarity index score between the reconstructed and original images. Membership is determined by comparing this score to a threshold. Experiments on several medical image datasets demonstrate that our FCRE method outperforms existing MIA methods.[70] Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images
Md Abrar Jahin,Shahriar Soudeep,Arian Rahman Aditta,M. F. Mridha,Nafiz Fahad,Md. Jakir Hossen
Main category: cs.CV
TL;DR: This paper explores the use of Vision Transformer (ViT) architectures for distinguishing between quark and gluon jets using calorimeter images from simulated CMS Open Data. It finds that ViT-based models, particularly hybrids like ViT+MaxViT and ViT+ConvNeXt, outperform traditional CNNs in various performance metrics.
Details
Motivation: Distinguishing between quark- and gluon-initiated jets is crucial for new physics searches and precision measurements at the Large Hadron Collider. While deep learning methods such as CNNs have been effective, the potential of Vision Transformers (ViTs) for this task has not been fully explored, especially under realistic conditions. Method: The authors construct multi-channel jet-view images from detector-level energy deposits and reconstructed tracks. They systematically evaluate ViT-based models, including hybrid models with CNN components, for quark-gluon jet classification using simulated 2012 CMS Open Data. Result: ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, demonstrating the advantage of capturing long-range spatial correlations within jet substructure. Conclusion: This work establishes a systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, providing a valuable resource for further deep learning research in high-energy physics. Abstract: Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.[71] Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors
Ziteng Li,Malte Kuhlmann,Ilana Nisky,Nicolás Navarro-Guerrero
Main category: cs.CV
TL;DR: 本研究提出两种基于LRCNs和Transformer架构的模型,利用GelSight传感器捕获的RGB触觉图像和其他信息,准确预测柔顺性指标。相比基线模型,新模型性能显著提升,并发现传感器柔顺性与物体柔顺性估计之间的相关性。
Details
Motivation: 传统的柔顺性检测方法缺乏便携性和可扩展性,依赖昂贵设备且不适合机器人应用;现有的基于神经网络的方法在使用视觉触觉传感器时预测精度不足。 Method: 提出两种模型:基于长时递归卷积网络(LRCNs)和Transformer架构,结合GelSight传感器捕获的RGB触觉图像及其他信息进行柔顺性预测。 Result: 所提出的模型通过多项指标验证了其性能,证明能够准确估计柔顺性,相较于基线模型有显著性能提升。此外,研究还发现传感器柔顺性与物体柔顺性估计之间存在相关性,较硬的物体更难估计。 Conclusion: 提出的基于LRCNs和Transformer架构的模型可以有效提高柔顺性预测精度,适合工程、农业和生物医学应用,并揭示了传感器柔顺性对物体柔顺性估计的影响。 Abstract: Compliance is a critical parameter for describing objects in engineering, agriculture, and biomedical applications. Traditional compliance detection methods are limited by their lack of portability and scalability, rely on specialized, often expensive equipment, and are unsuitable for robotic applications. Moreover, existing neural network-based approaches using vision-based tactile sensors still suffer from insufficient prediction accuracy. In this paper, we propose two models based on Long-term Recurrent Convolutional Networks (LRCNs) and Transformer architectures that leverage RGB tactile images and other information captured by the vision-based sensor GelSight to predict compliance metrics accurately. We validate the performance of these models using multiple metrics and demonstrate their effectiveness in accurately estimating compliance. The proposed models exhibit significant performance improvement over the baseline. Additionally, we investigated the correlation between sensor compliance and object compliance estimation, which revealed that objects that are harder than the sensor are more challenging to estimate.[72] Hyper-Local Deformable Transformers for Text Spotting on Historical Maps
Yijun Lin,Yao-Yi Chiang
Main category: cs.CV
TL;DR: This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps with a novel hyper-local sampling module and SynthMap+, a method to generate synthetic map images. Experiments show that PALETTE outperforms SOTA text spotters on two new benchmark datasets of historical maps.
Details
Motivation: Text extraction from historical maps is challenging due to the lack of effective methods and training data. Previous approaches are tailored to specific map styles and recent machine learning-based text spotters have challenges in extracting precise image features for predicting every sub-component in a text instance. Method: PALETTE introduces a novel hyper-local sampling module to learn localized image features around the target boundary points and characters of a text instance for detection and recognition. It also enables hyper-local positional embeddings to learn spatial interactions between boundary points and characters within and across text instances. Additionally, SynthMap+ is proposed to automatically generate synthetic map images for training text spotters. Result: PALETTE with SynthMap+ outperforms SOTA text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. It has been deployed to process over 60,000 maps in the David Rumsey Historical Map collection and generated over 100 million text labels. Conclusion: PALETTE is an effective end-to-end text spotter for scanned historical maps with a wide variety, and SynthMap+ is a novel approach to generate synthetic map images for training text spotters. Abstract: Text on historical maps contains valuable information providing georeferenced historical, political, and cultural contexts. However, text extraction from historical maps is challenging due to the lack of (1) effective methods and (2) training data. Previous approaches use ad-hoc steps tailored to only specific map styles. Recent machine learning-based text spotters (e.g., for scene images) have the potential to solve these challenges because of their flexibility in supporting various types of text instances. However, these methods remain challenges in extracting precise image features for predicting every sub-component (boundary points and characters) in a text instance. This is critical because map text can be lengthy and highly rotated with complex backgrounds, posing difficulties in detecting relevant image features from a rough text region. This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps of a wide variety. PALETTE introduces a novel hyper-local sampling module to explicitly learn localized image features around the target boundary points and characters of a text instance for detection and recognition. PALETTE also enables hyper-local positional embeddings to learn spatial interactions between boundary points and characters within and across text instances. In addition, this paper presents a novel approach to automatically generate synthetic map images, SynthMap+, for training text spotters for historical maps. The experiment shows that PALETTE with SynthMap+ outperforms SOTA text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. We have deployed PALETTE with SynthMap+ to process over 60,000 maps in the David Rumsey Historical Map collection and generated over 100 million text labels to support map searching. The project is released at https://github.com/kartta-foundation/mapkurator-palette-doc.[73] Break Stylistic Sophon: Are We Really Meant to Confine the Imagination in Style Transfer?
Gary Song Yan,Yusen Zhang,Jinyu Zhao,Hao Zhang,Zhangping Yang,Guanye Xiong,Yanfei Liu,Tao Zhang,Yujie He,Siyuan Tian,Yao Gou,Min Li
Main category: cs.CV
TL;DR: This study introduces StyleWallfacer, a novel framework for style transfer that unifies training and inference processes. It includes semantic-based style injection, human feedback-driven data augmentation, and a triple diffusion process without additional training.
Details
Motivation: To address issues in traditional style transfer methods and unify the framework for different tasks, aiming to enable artist-level style transfer and text-driven stylization. Method: 1) Semantic-based style injection using BLIP and large language models to create a semantic gap for efficient style knowledge injection; 2) Data augmentation strategy based on human feedback to reduce overfitting; 3) Training-free triple diffusion process manipulating self-attention layers for style injection while preserving text control. Result: Achieved high-quality image-driven style transfer and text-driven stylization with artist-level results, preserved original image content, and enabled image color editing during style transfer for the first time. Conclusion: StyleWallfacer successfully addresses limitations of traditional style transfer methods, offering a unified framework for various tasks with superior performance. Abstract: In this pioneering study, we introduce StyleWallfacer, a groundbreaking unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to revolutionize the field by enabling artist level style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering artist-level style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time.[74] Enhancing Vector Quantization with Distributional Matching: A Theoretical and Empirical Study
Xianghong Fang,Litao Guo,Hengchao Chen,Yuxuan Zhang,XiaofanXia,Dingjie Song,Yexin Liu,Hao Wang,Harry Yang,Yuan Yuan,Qiang Sun
Main category: cs.CV
TL;DR: This paper addresses the issues of training instability and codebook collapse in vector quantization methods used in autoregressive models.
Details
Motivation: The motivation is to improve vector quantization by solving training instability and codebook collapse issues that arise due to mismatch between feature and code vector distributions. Method: The method involves using Wasserstein distance to align the distributions of features and code vectors, enhancing codebook utilization and reducing quantization error. Result: The approach achieves near 100% codebook utilization and significantly reduces quantization error, with both empirical and theoretical support. Conclusion: Employing Wasserstein distance for distribution alignment in vector quantization effectively resolves training instability and codebook collapse problems. Abstract: The success of autoregressive models largely depends on the effectiveness of vector quantization, a technique that discretizes continuous features by mapping them to the nearest code vectors within a learnable codebook. Two critical issues in existing vector quantization methods are training instability and codebook collapse. Training instability arises from the gradient discrepancy introduced by the straight-through estimator, especially in the presence of significant quantization errors, while codebook collapse occurs when only a small subset of code vectors are utilized during training. A closer examination of these issues reveals that they are primarily driven by a mismatch between the distributions of the features and code vectors, leading to unrepresentative code vectors and significant data information loss during compression. To address this, we employ the Wasserstein distance to align these two distributions, achieving near 100\% codebook utilization and significantly reducing the quantization error. Both empirical and theoretical analyses validate the effectiveness of the proposed approach.[75] SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts
Yufei Liu,Haoke Xiao,Jiaxing Chai,Yongcun Zhang,Rong Wang,Zijie Meng,Zhiming Luo
Main category: cs.CV
TL;DR: The paper introduces SynPo, a training-free few-shot method based on LVMs like SAM, which improves the quality of negative prompts using a Confidence Map Synergy Module combining DINOv2 and SAM. It selects point prompts with top-k pixels as positive points and uses Gaussian distribution for negative points followed by K-means clustering, achieving performance comparable to state-of-the-art training-based methods.
Details
Motivation: Existing training-free methods based on Large Vision Models (LVMs) cannot effectively utilize negative prompts, resulting in poor performance on low-contrast medical images. Method: Propose SynPo, a method that includes: designing a Confidence Map Synergy Module by combining DINOv2 and SAM to select point prompts; selecting top-k pixels as positive points set; choosing negative points set using a Gaussian distribution; performing independent K-means clustering for both sets; leveraging these selected points as high-quality prompts for SAM to obtain segmentation results. Result: Extensive experiments show that SynPo achieves performance comparable to state-of-the-art training-based few-shot methods. Conclusion: SynPo addresses the issue of ineffective negative prompts in existing training-free methods based on LVMs, significantly improving performance on low-contrast medical image segmentation. Abstract: The advent of Large Vision Models (LVMs) offers new opportunities for few-shot medical image segmentation. However, existing training-free methods based on LVMs fail to effectively utilize negative prompts, leading to poor performance on low-contrast medical images. To address this issue, we propose SynPo, a training-free few-shot method based on LVMs (e.g., SAM), with the core insight: improving the quality of negative prompts. To select point prompts in a more reliable confidence map, we design a novel Confidence Map Synergy Module by combining the strengths of DINOv2 and SAM. Based on the confidence map, we select the top-k pixels as the positive points set and choose the negative points set using a Gaussian distribution, followed by independent K-means clustering for both sets. Then, these selected points are leveraged as high-quality prompts for SAM to get the segmentation results. Extensive experiments demonstrate that SynPo achieves performance comparable to state-of-the-art training-based few-shot methods.[76] Enhancing point cloud analysis via neighbor aggregation correction based on cross-stage structure correlation
Jiaqi Shi,Jin Xiao,Xiaoguang Hu,Boyang Song,Hao Jiang,Tianyou Chen,Baochang Zhang
Main category: cs.CV
TL;DR: 提出了一种新的点云分析模块PDSA,通过高维空间相关性来纠正特征分布,提高计算效率和鲁棒性,在语义分割和分类任务中表现出显著的性能提升和泛化能力。
Details
Motivation: 现有的点云分析方法在聚合局部结构时存在无关点干扰和特征层次差距的问题,而直接基于几何结构编码的增强方法又面临高计算开销和噪声敏感的问题。 Method: 提出了Point Distribution Set Abstraction (PDSA) 模块,利用高维空间中的相关性来纠正特征分布,同时引入轻量级的跨阶段结构描述符以区分点的相关性,并通过减少邻居特征矩阵的方差和增加类别可分性来增强结构同质性。此外,引入了关键点机制以优化计算开销。 Result: 实验结果表明,所提出的方法在语义分割和分类任务中实现了显著的性能提升,并且参数成本更低。消融研究和可视化结果进一步证明了该方法的有效性和合理性。 Conclusion: PDSA模块提高了点云分析的计算效率和鲁棒性,并在不同基线模型上验证了其泛化能力,为点云处理提供了更高效、更稳健的解决方案。 Abstract: Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three-dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limitation by refining spatial description though explicit modeling of cross-stage structure, these enhancement methods based on direct geometric structure encoding have problems of high computational overhead and noise sensitivity. To overcome these problems, we propose the Point Distribution Set Abstraction module (PDSA) that utilizes the correlation in the high-dimensional space to correct the feature distribution during aggregation, which improves the computational efficiency and robustness. PDSA distinguishes the point correlation based on a lightweight cross-stage structural descriptor, and enhances structural homogeneity by reducing the variance of the neighbor feature matrix and increasing classes separability though long-distance modeling. Additionally, we introducing a key point mechanism to optimize the computational overhead. The experimental result on semantic segmentation and classification tasks based on different baselines verify the generalization of the method we proposed, and achieve significant performance improvement with less parameter cost. The corresponding ablation and visualization results demonstrate the effectiveness and rationality of our method. The code and training weight is available at: https://github.com/AGENT9717/PointDistribution[77] Echo-DND: A dual noise diffusion model for robust and precise left ventricle segmentation in echocardiography
Abdur Rahman,Keerthiveena Balraj,Manojkumar Ramteke,Anurag Singh Rathore
Main category: cs.CV
TL;DR: Recent advancements in diffusion probabilistic models (DPMs) have revolutionized image processing, particularly in medical applications. Accurate segmentation of the left ventricle (LV) in echocardiograms is crucial but complicated due to noise and low contrast. This paper introduces Echo-DND, a novel dual-noise diffusion model that uses Gaussian and Bernoulli noises, multi-scale fusion conditioning, and spatial coherence calibration to improve segmentation precision. Validated on CAMUS and EchoNet-Dynamic datasets, Echo-DND outperforms existing models with high Dice scores, setting a new standard in echocardiogram segmentation.
Details
Motivation: Accurate segmentation of the left ventricle (LV) in echocardiograms is crucial for diagnostic procedures and necessary treatments, but ultrasound images are noisy with low contrast and ambiguous LV boundaries, complicating the segmentation process. Method: The paper proposes Echo-DND, a dual-noise diffusion model that leverages Gaussian and Bernoulli noises, incorporates a multi-scale fusion conditioning module, and utilizes spatial coherence calibration to maintain spatial integrity in segmentation masks. Result: The model was rigorously validated on the CAMUS and EchoNet-Dynamic datasets. It achieves high Dice scores of 0.962 and 0.939 respectively, outperforming existing state-of-the-art models. Conclusion: Echo-DND establishes a new standard in echocardiogram segmentation and its architecture holds promise for broader applicability in other medical imaging tasks, potentially improving diagnostic accuracy across various medical domains. Abstract: Recent advancements in diffusion probabilistic models (DPMs) have revolutionized image processing, demonstrating significant potential in medical applications. Accurate segmentation of the left ventricle (LV) in echocardiograms is crucial for diagnostic procedures and necessary treatments. However, ultrasound images are notoriously noisy with low contrast and ambiguous LV boundaries, thereby complicating the segmentation process. To address these challenges, this paper introduces Echo-DND, a novel dual-noise diffusion model specifically designed for this task. Echo-DND leverages a unique combination of Gaussian and Bernoulli noises. It also incorporates a multi-scale fusion conditioning module to improve segmentation precision. Furthermore, it utilizes spatial coherence calibration to maintain spatial integrity in segmentation masks. The model's performance was rigorously validated on the CAMUS and EchoNet-Dynamic datasets. Extensive evaluations demonstrate that the proposed framework outperforms existing SOTA models. It achieves high Dice scores of 0.962 and 0.939 on these datasets, respectively. The proposed Echo-DND model establishes a new standard in echocardiogram segmentation, and its architecture holds promise for broader applicability in other medical imaging tasks, potentially improving diagnostic accuracy across various medical domains. Project page: https://abdur75648.github.io/Echo-DND[78] ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections
Ziling Huang,Yidan Zhang,Shin'ichi Satoh
Main category: cs.CV
TL;DR: The paper introduces Referring Search and Discovery (ReSeDis), a new task that combines corpus-level retrieval with pixel-level grounding to locate objects described by sentences in large image collections. A benchmark, metric, and zero-shot baseline are provided.
Details
Motivation: Current techniques for visual search engines only solve one part of the dual problem - either visual grounding or text-to-image retrieval. There is a need for a system that can both identify relevant images and precisely localize the object within them. Method: The method involves introducing ReSeDis, which unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, it decides if an object is in an image and provides its location through bounding boxes or segmentation masks. The paper also presents a curated benchmark and a task-specific metric. Result: A straightforward zero-shot baseline using a frozen vision-language model was provided, showing significant potential for future improvements. The results indicate that ReSeDis offers a realistic testbed for developing robust multimodal search systems. Conclusion: ReSeDis represents a step towards building the next generation of robust and scalable multimodal search systems by combining retrieval and localization tasks into one unified framework. Abstract: Large-scale visual search engines are expected to solve a dual problem at once: (i) locate every image that truly contains the object described by a sentence and (ii) identify the object's bounding box or exact pixels within each hit. Existing techniques address only one side of this challenge. Visual grounding yields tight boxes and masks but rests on the unrealistic assumption that the object is present in every test image, producing a flood of false alarms when applied to web-scale collections. Text-to-image retrieval excels at sifting through massive databases to rank relevant images, yet it stops at whole-image matches and offers no fine-grained localization. We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, a ReSeDis model must decide whether the queried object appears in each image and, if so, where it is, returning bounding boxes or segmentation masks. To enable rigorous study, we curate a benchmark in which every description maps uniquely to object instances scattered across a large, diverse corpus, eliminating unintended matches. We further design a task-specific metric that jointly scores retrieval recall and localization precision. Finally, we provide a straightforward zero-shot baseline using a frozen vision-language model, revealing significant headroom for future study. ReSeDis offers a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems.[79] Conquering the Retina: Bringing Visual in-Context Learning to OCT
Alessio Negrini,Simon Reiß
Main category: cs.CV
TL;DR: This paper explores training generalist models for retinal OCT using visual in-context learning (VICL), proposes an evaluation protocol, evaluates a state-of-the-art VICL approach on multiple datasets, and releases code to promote further research.
Details
Motivation: Specialized models in medical image analysis are limited to predefined tasks and require significant resources. Generalist models could allow medical practitioners more flexibility without needing task-specific model development. Method: Training generalist models for retinal optical coherence tomography using visual in-context learning (VICL) with a few examples provided at inference time and proposing a broad evaluation protocol tailored to VICL in OCT. Result: Extensive evaluation of a state-of-the-art medical VICL approach on multiple retinal OCT datasets established a first baseline, highlighting both the potential and limitations of in-context learning for OCT. Conclusion: The authors have explored VICL for retinal OCT, created an evaluation protocol, identified strengths and limitations through extensive testing, and contributed to future research by releasing their code. Abstract: Recent advancements in medical image analysis have led to the development of highly specialized models tailored to specific clinical tasks. These models have demonstrated exceptional performance and remain a crucial research direction. Yet, their applicability is limited to predefined tasks, requiring expertise and extensive resources for development and adaptation. In contrast, generalist models offer a different form of utility: allowing medical practitioners to define tasks on the fly without the need for task-specific model development. In this work, we explore how to train generalist models for the domain of retinal optical coherence tomography using visual in-context learning (VICL), i.e., training models to generalize across tasks based on a few examples provided at inference time. To facilitate rigorous assessment, we propose a broad evaluation protocol tailored to VICL in OCT. We extensively evaluate a state-of-the-art medical VICL approach on multiple retinal OCT datasets, establishing a first baseline to highlight the potential and current limitations of in-context learning for OCT. To foster further research and practical adoption, we openly release our code.[80] Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models
Xuelin Shen,Jiayin Xu,Kangsheng Yin,Wenhan Yang
Main category: cs.CV
TL;DR: 提出了一种灵活的编码方法Privacy-Shielded Image Compression (PSIC),以在图像压缩阶段保护用户隐私,防止视觉-语言预训练(VLP)模型的利用。该方法通过多解码选项生成比特流,默认情况下保留可感知的质量并阻止VLP模型的解释,同时保持原始图像压缩功能,并可通过自定义条件重建完整语义信息的图像。还设计了CLTG模块和UAEO优化函数以及多目标优化策略,提升加密性能和感知质量。实验表明其有效性。
Details
Motivation: 随着视觉-语言预训练(VLP)模型语义理解能力的提高,公开发布的图像越来越容易被搜索引擎等工具利用,需要一种方法在图像压缩阶段保护用户隐私。 Method: 提出Privacy-Shielded Image Compression (PSIC)方法,生成具有多种解码选项的比特流,默认解码保留感知质量并阻止VLP模型解释,保持原有图像压缩功能,可通过自定义输入条件重建完整语义信息的图像。引入CLTG模块生成偏差信息指导解码过程,设计UAEO优化函数利用软标签进行优化,并采用自适应多目标优化策略在统一训练过程中提升加密性能和感知质量。 Result: 广泛的实验跨越多个下游任务,证明了所提出设计的有效性。 Conclusion: Privacy-Shielded Image Compression (PSIC)方法可以有效保护用户隐私,防止VLP模型的利用,同时保持图像压缩功能和感知质量,且易于集成到现有Learned Image Compression (LIC)模型中。 Abstract: The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. In this context, this paper seeks to protect users' privacy by implementing defenses at the image compression stage to prevent exploitation. Specifically, we propose a flexible coding method, termed Privacy-Shielded Image Compression (PSIC), that can produce bitstreams with multiple decoding options. By default, the bitstream is decoded to preserve satisfactory perceptual quality while preventing interpretation by VLP models. Our method also retains the original image compression functionality. With a customizable input condition, the proposed scheme can reconstruct the image that preserves its full semantic information. A Conditional Latent Trigger Generation (CLTG) module is proposed to produce bias information based on customizable conditions to guide the decoding process into different reconstructed versions, and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function is designed to leverage the soft labels inferred from the target VLP model's uncertainty on the training data. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing Learned Image Compression (LIC) models. Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of our design.[81] DM-FNet: Unified multimodal medical image fusion via diffusion process-trained encoder-decoder
Dan He,Weisheng Li,Guofen Wang,Yuping Huang,Shiqiang Liu
Main category: cs.CV
TL;DR: A two-stage diffusion model-based fusion network (DM-FNet) is proposed to address limitations in existing multimodal medical image fusion methods, enhancing feature capture and cross-modal interaction. It integrates a diffusion process, UNet, fusion modules, and a hybrid loss function, yielding high-quality fused images with balanced brightness, color, contrast, and detail.
Details
Motivation: Existing MMIF methods have limited capacity to capture detailed features and insufficient cross-modal feature interaction, resulting in suboptimal fused image quality. Method: The study proposes DM-FNet, which consists of two stages: Stage I uses a diffusion process to train UNet for image reconstruction, capturing detailed information through progressive denoising. Stage II inputs noisy images into the fusion network, integrating three key fusion modules for adaptive processing of different modalities. A hybrid loss function harmonizes brightness, color, contrast, and detail in the fused image. Result: Experimental results across various medical image types show exceptional performance in objective evaluation metrics. The fused image preserves appropriate brightness, comprehensive distribution of radioactive tracers, rich textures, and clear edges. Conclusion: DM-FNet effectively addresses the limitations of existing MMIF methods by enhancing feature capture and cross-modal interaction, producing high-quality fused images. Abstract: Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality fusion results requires a careful balance of brightness, color, contrast, and detail; this ensures that the fused images effectively display relevant anatomical structures and reflect the functional status of the tissues. However, existing MMIF methods have limited capacity to capture detailed features during conventional training and suffer from insufficient cross-modal feature interaction, leading to suboptimal fused image quality. To address these issues, this study proposes a two-stage diffusion model-based fusion network (DM-FNet) to achieve unified MMIF. In Stage I, a diffusion process trains UNet for image reconstruction. UNet captures detailed information through progressive denoising and represents multilevel data, providing a rich set of feature representations for the subsequent fusion network. In Stage II, noisy images at various steps are input into the fusion network to enhance the model's feature recognition capability. Three key fusion modules are also integrated to process medical images from different modalities adaptively. Ultimately, the robust network structure and a hybrid loss function are integrated to harmonize the fused image's brightness, color, contrast, and detail, enhancing its quality and information density. The experimental results across various medical image types demonstrate that the proposed method performs exceptionally well regarding objective evaluation metrics. The fused image preserves appropriate brightness, a comprehensive distribution of radioactive tracers, rich textures, and clear edges. The code is available at https://github.com/HeDan-11/DM-FNet.[82] video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models
Changli Tang,Yixuan Li,Yudong Yang,Jimin Zhuang,Guangzhi Sun,Wei Li,Zejun Ma,Chao Zhang
Main category: cs.CV
TL;DR: video-SALMONN 2 是一种先进的音视频大语言模型,通过低秩适应(LoRA)和定向偏好优化(DPO)来增强视频(带配对音频)的描述能力。提出的新指标和多轮 DPO 方法显著提高了模型的描述准确性和完整性,减少了 28% 的错误率。最终模型在视频字幕任务上超越了 GPT-4o 和 Gemini-1.5-Pro 等领先模型,并在类似的大小模型中保持了高度竞争力。
Details
Motivation: 视频包含大量信息,生成详细而准确的自然语言描述是视频理解的关键方面。为了提高视频描述的完整性和准确性,需要一种新的模型和优化方法。 Method: 提出了 video-SALMONN 2 模型,使用低秩适应(LoRA)和定向偏好优化(DPO)进行增强的视频(带配对音频)字幕生成。引入了新的评估指标以衡量描述的完整性和准确性,并采用多轮 DPO(MrDPO)方法,包括定期更新 DPO 参考模型、合并和重新初始化 LoRA 模块以及结合真实视频字幕指导以稳定训练过程。 Result: 实验结果表明,MrDPO 显著提高了 video-SALMONN 2 的字幕生成准确性,将错误率降低了 28%。70 亿参数的最终模型在视频字幕任务上超过了 GPT-4o 和 Gemini-1.5-Pro,并在广泛使用的视频问答基准测试中表现出与最先进的模型相当的性能。 Conclusion: video-SALMONN 2 在视频字幕生成任务上表现出色,通过 MrDPO 方法显著提高了描述的准确性和完整性。该模型不仅在视频字幕任务上超越了现有领先模型,还在类似规模的模型中展示了强大的竞争力。 Abstract: Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.[83] Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images
Liangjie Meng,Danxia Li,Jinrong He,Lili Ma,Zhixin Li
Main category: cs.CV
TL;DR: The paper proposes C-AFBiFPN, a new feature enhancement and fusion framework for Synthetic Aperture Radar (SAR) ship detection. It incorporates Convolutional Feature Enhancement (CFE) and BiFormer attention in AFBiFPN to improve detection accuracy, especially for small targets.
Details
Motivation: SAR ship detection is crucial but challenging due to significant scale variations among ships, the presence of small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. Method: C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network to enrich feature representation. Also, it integrates BiFormer attention within the fusion strategy of BiFPN, creating the AFBiFPN network to improve global modeling capability of cross-scale feature fusion. Result: Experimental results on SAR Ship Detection Dataset (SSDD) show that the proposed approach significantly enhances detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features. Conclusion: C-AFBiFPN effectively addresses the challenges faced in SAR ship detection by improving feature representation and fusion. Abstract: Synthetic Aperture Radar (SAR) enables submeter-resolution imaging and all-weather monitoring via active microwave and advanced signal processing. Currently, SAR has found extensive applications in critical maritime domains such as ship detection. However, SAR ship detection faces several challenges, including significant scale variations among ships, the presence of small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. To address these issues, this paper proposes a novel feature enhancement and fusion framework named C-AFBiFPN. C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network, aiming to enrich feature representation and enhance the ability to capture and represent local details and contextual information. Furthermore, C-AFBiFPN innovatively integrates BiFormer attention within the fusion strategy of BiFPN, creating the AFBiFPN network. AFBiFPN improves the global modeling capability of cross-scale feature fusion and can adaptively focus on critical feature regions. The experimental results on SAR Ship Detection Dataset (SSDD) indicate that the proposed approach substantially enhances detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features.[84] RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories
Qingsong Yan,Qiang Wang,Kaiyong Zhao,Jie Chen,Bo Li,Xiaowen Chu,Fei Deng
Main category: cs.CV
TL;DR: RA-NeRF is a novel method that can predict highly accurate camera poses even with complex camera trajectories, achieving state-of-the-art results in both camera pose estimation and visual quality.
Details
Motivation: Existing methods for 3D reconstruction and SLAM tasks rely heavily on accurate camera pose priors. Current approaches fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. Method: The proposed RA-NeRF method uses an incremental pipeline to reconstruct the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, it employs an implicit pose filter to capture the camera movement pattern and eliminate noise for pose estimation. Result: Extensive experiments on the Tanks&Temple dataset and the NeRFBuster dataset show that RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality. Conclusion: RA-NeRF demonstrates its effectiveness and robustness in scene reconstruction under complex pose trajectories. Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks\&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.[85] Retrospective Memory for Camouflaged Object Detection
Chenxi Zhang,Jiayun Wu,Qing Zhang,Yazhe Zhai,Youwei Pang
Main category: cs.CV
TL;DR: 本文提出了一种名为RetroMem的回忆增强型伪装物体检测架构,通过整合历史知识动态调整伪装模式感知和推理,采用两阶段训练范式(学习阶段和回忆阶段),显著提升了模型在复杂伪装场景中的适应性和有效性。
Details
Motivation: 现有的伪装物体检测方法主要基于静态视觉表示建模,缺乏明确的机制来获取历史上下文信息,这限制了它们在处理挑战性伪装场景时的适应性和效果。 Method: RetroMem采用了两阶段训练范式:1) 学习阶段:设计了一个密集多尺度适配器(DMA),以极少的可训练参数增强了预训练编码器捕捉丰富多尺度视觉信息的能力;2) 回忆阶段:提出了动态记忆机制(DMM)和推理模式重建(IPR),利用学习到的知识与当前样本上下文之间的潜在关系,重建伪装模式的推理。 Result: 在多个广泛使用的数据集上的大量实验表明,RetroMem显著优于现有的最先进方法。 Conclusion: RetroMem通过整合历史知识和动态调整伪装模式感知与推理,有效提高了模型在复杂伪装场景中的表现。 Abstract: Camouflaged object detection (COD) primarily focuses on learning subtle yet discriminative representations from complex scenes. Existing methods predominantly follow the parametric feedforward architecture based on static visual representation modeling. However, they lack explicit mechanisms for acquiring historical context, limiting their adaptation and effectiveness in handling challenging camouflage scenes. In this paper, we propose a recall-augmented COD architecture, namely RetroMem, which dynamically modulates camouflage pattern perception and inference by integrating relevant historical knowledge into the process. Specifically, RetroMem employs a two-stage training paradigm consisting of a learning stage and a recall stage to construct, update, and utilize memory representations effectively. During the learning stage, we design a dense multi-scale adapter (DMA) to improve the pretrained encoder's capability to capture rich multi-scale visual information with very few trainable parameters, thereby providing foundational inferences. In the recall stage, we propose a dynamic memory mechanism (DMM) and an inference pattern reconstruction (IPR). These components fully leverage the latent relationships between learned knowledge and current sample context to reconstruct the inference of camouflage patterns, thereby significantly improving the model's understanding of camouflage scenes. Extensive experiments on several widely used datasets demonstrate that our RetroMem significantly outperforms existing state-of-the-art methods.[86] Domain Adaptation for Image Classification of Defects in Semiconductor Manufacturing
Adrian Poniatowski,Natalie Gentner,Manuel Barusco,Davide Dalle Pezze,Samuele Salti,Gian Antonio Susto
Main category: cs.CV
TL;DR: In the semiconductor field, with high market competition, Domain Adaptation (DA) techniques have been tested in semi-supervised and unsupervised settings to reduce manual labeling and re-training costs. The DBACS approach, a CycleGAN-inspired model with additional loss terms, was proposed and validated on real-world Electron Microscope images.
Details
Motivation: To address the challenges of time-to-market and quality in the semiconductor sector, leveraging DA techniques can minimize manual re-labeling and re-training of models, thus reducing computational costs and allowing experts to focus on high-value tasks. Method: The DBACS approach, inspired by CycleGAN and enhanced with additional loss terms, was developed and tested in semi-supervised and unsupervised settings using real-world Electron Microscope images. Result: The DBACS approach was successfully validated on real-world Electron Microscope images, demonstrating its effectiveness in advancing DA techniques for the semiconductor field. Conclusion: Domain Adaptation techniques, particularly the proposed DBACS approach, show promise in improving robustness and scalability in the semiconductor industry, reducing costs and enhancing model performance. Abstract: In the semiconductor sector, due to high demand but also strong and increasing competition, time to market and quality are key factors in securing significant market share in various application areas. Thanks to the success of deep learning methods in recent years in the computer vision domain, Industry 4.0 and 5.0 applications, such as defect classification, have achieved remarkable success. In particular, Domain Adaptation (DA) has proven highly effective since it focuses on using the knowledge learned on a (source) domain to adapt and perform effectively on a different but related (target) domain. By improving robustness and scalability, DA minimizes the need for extensive manual re-labeling or re-training of models. This not only reduces computational and resource costs but also allows human experts to focus on high-value tasks. Therefore, we tested the efficacy of DA techniques in semi-supervised and unsupervised settings within the context of the semiconductor field. Moreover, we propose the DBACS approach, a CycleGAN-inspired model enhanced with additional loss terms to improve performance. All the approaches are studied and validated on real-world Electron Microscope images considering the unsupervised and semi-supervised settings, proving the usefulness of our method in advancing DA techniques for the semiconductor field.[87] MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion
Jun Zhu,Xinfeng Zhang,Lv Tang,JunHao Jiang
Main category: cs.CV
TL;DR: MSNeRV是一种多尺度特征融合框架,用于神经视频表示。通过增强时间一致性、设计多尺度空间解码器和引入多尺度特征块,MSNeRV在INR方法中表现出优越的表示能力,并在动态场景下的压缩效率超越了VTM-23.7(随机访问)。
Details
Motivation: 现有的INR方法难以有效表示细节丰富和快速变化的视频内容,主要由于内部网络特征利用不足以及网络设计缺乏视频特定考虑。 Method: 提出了一种多尺度特征融合框架MSNeRV,包括:1)在编码阶段使用时间窗口增强时间一致性;2)将视频分为多个GoPs并在GoP级别使用网格表示背景;3)设计多尺度空间解码器和尺度自适应损失函数以整合多分辨率和多频率信息;4)引入多尺度特征块以改进特征提取。 Result: 实验结果表明,MSNeRV在HEVC ClassB和UVG数据集上表现出优越的表示能力和更高的压缩效率,特别是在动态场景下超越了VTM-23.7(随机访问)。 Conclusion: MSNeRV通过多尺度特征融合显著提高了INR方法的视频表示能力,并在动态场景中实现了更高效的视频压缩。 Abstract: Implicit Neural representations (INRs) have emerged as a promising approach for video compression, and have achieved comparable performance to the state-of-the-art codecs such as H.266/VVC. However, existing INR-based methods struggle to effectively represent detail-intensive and fast-changing video content. This limitation mainly stems from the underutilization of internal network features and the absence of video-specific considerations in network design. To address these challenges, we propose a multi-scale feature fusion framework, MSNeRV, for neural video representation. In the encoding stage, we enhance temporal consistency by employing temporal windows, and divide the video into multiple Groups of Pictures (GoPs), where a GoP-level grid is used for background representation. Additionally, we design a multi-scale spatial decoder with a scale-adaptive loss function to integrate multi-resolution and multi-frequency information. To further improve feature extraction, we introduce a multi-scale feature block that fully leverages hidden features. We evaluate MSNeRV on HEVC ClassB and UVG datasets for video representation and compression. Experimental results demonstrate that our model exhibits superior representation capability among INR-based approaches and surpasses VTM-23.7 (Random Access) in dynamic scenarios in terms of compression efficiency.[88] BCRNet: Enhancing Landmark Detection in Laparoscopic Liver Surgery via Bezier Curve Refinement
Qian Li,Feng Liu,Shuojue Yang,Daiyun Shen,Yueming Jin
Main category: cs.CV
TL;DR: The paper proposes BCRNet, a novel framework that enhances landmark detection in laparoscopic liver surgery through the Bezier curve refinement strategy. It includes Multi-modal Feature Extraction (MFE), Adaptive Curve Proposal Initialization (ACPI), and Hierarchical Curve Refinement (HCR). Evaluations on L3D and P2ILF datasets show significant improvements over state-of-the-art methods.
Details
Motivation: Laparoscopic liver surgery faces challenges in accurately identifying critical anatomical structures. AR systems offer a solution but require precise detection of curvilinear anatomical landmarks in laparoscopic images. Method: The method involves three key components: MFE module for capturing semantic features, ACPI for generating pixel-aligned Bezier curves and confidence scores, and HCR mechanism for enhancing proposals iteratively via multi-stage process. Result: BCRNet outperforms state-of-the-art methods on L3D and P2ILF datasets, achieving significant performance improvements. Conclusion: BCRNet significantly enhances landmark detection in laparoscopic liver surgery and demonstrates superior performance compared to existing methods. Abstract: Laparoscopic liver surgery, while minimally invasive, poses significant challenges in accurately identifying critical anatomical structures. Augmented reality (AR) systems, integrating MRI/CT with laparoscopic images based on 2D-3D registration, offer a promising solution for enhancing surgical navigation. A vital aspect of the registration progress is the precise detection of curvilinear anatomical landmarks in laparoscopic images. In this paper, we propose BCRNet (Bezier Curve Refinement Net), a novel framework that significantly enhances landmark detection in laparoscopic liver surgery primarily via the Bezier curve refinement strategy. The framework starts with a Multi-modal Feature Extraction (MFE) module designed to robustly capture semantic features. Then we propose Adaptive Curve Proposal Initialization (ACPI) to generate pixel-aligned Bezier curves and confidence scores for reliable initial proposals. Additionally, we design the Hierarchical Curve Refinement (HCR) mechanism to enhance these proposals iteratively through a multi-stage process, capturing fine-grained contextual details from multi-scale pixel-level features for precise Bezier curve adjustment. Extensive evaluations on the L3D and P2ILF datasets demonstrate that BCRNet outperforms state-of-the-art methods, achieving significant performance improvements. Code will be available.[89] AI-driven visual monitoring of industrial assembly tasks
Mattia Nardon,Stefano Messelodi,Antonio Granata,Fabio Poiesi,Alberto Danese,Davide Boscaini
Main category: cs.CV
TL;DR: This paper presents ViMAT, an AI-driven system for real-time visual monitoring of assembly tasks without rigid workspace setups or visual markers. It combines a perception module with a reasoning module and is validated on two assembly tasks.
Details
Motivation: Visual monitoring of industrial assembly tasks is critical for preventing equipment damage and ensuring worker safety, but commercial solutions typically require rigid workspace setups or visual markers which simplify the problem but limit flexibility. Method: ViMAT consists of a perception module that extracts visual observations from multi-view video streams and a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. Result: ViMAT is validated on two assembly tasks involving LEGO components replacement and hydraulic press molds reconfiguration, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios with partial and uncertain visual observations. Conclusion: ViMAT provides a flexible solution for real-time visual monitoring of assembly tasks without the need for rigid workspace setups or visual markers. Abstract: Visual monitoring of industrial assembly tasks is critical for preventing equipment damage due to procedural errors and ensuring worker safety. Although commercial solutions exist, they typically require rigid workspace setups or the application of visual markers to simplify the problem. We introduce ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without these constraints. ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. We validate ViMAT on two assembly tasks, involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Project page: https://tev-fbk.github.io/ViMAT[90] MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering
Xinqi Fan,Jingting Li,John See,Moi Hoon Yap,Wen-Huang Cheng,Xiaobai Li,Xiaopeng Hong,Su-Jing Wang,Adrian K. Davision
Main category: cs.CV
TL;DR: The paper introduces two tasks for facial micro-expression (ME) analysis in the MEGC 2025 challenge: ME spot-then-recognize (ME-STR) and ME visual question answering (ME-VQA). These tasks leverage multimodal large language models and large vision-language models.
Details
Motivation: Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression. Conventional approaches that treat spotting and recognition as separate tasks are suboptimal, especially for analyzing long-duration videos in realistic settings. Method: The ME grand challenge (MEGC) 2025 introduces two tasks: (1) ME spot-then-recognize (ME-STR), which integrates ME spotting and subsequent recognition in a unified sequential pipeline; and (2) ME visual question answering (ME-VQA), which explores ME understanding through visual question answering, leveraging multimodal large language models (MLLMs) or large vision-language models (LVLMs). Result: All participating algorithms will run on the test set and submit their results on a leaderboard. Conclusion: The emerging multimodal large language models and large vision-language models offer promising new avenues for enhancing ME analysis. Abstract: Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. However, conventional approaches that treat spotting and recognition as separate tasks are suboptimal, particularly for analyzing long-duration videos in realistic settings. Concurrently, the emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2025 introduces two tasks that reflect these evolving research directions: (1) ME spot-then-recognize (ME-STR), which integrates ME spotting and subsequent recognition in a unified sequential pipeline; and (2) ME visual question answering (ME-VQA), which explores ME understanding through visual question answering, leveraging MLLMs or LVLMs to address diverse question types related to MEs. All participating algorithms are required to run on this test set and submit their results on a leaderboard. More details are available at https://megc2025.github.io.[91] MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning
Leonid Ivanov,Vasily Yuryev,Dmitry Yudin
Main category: cs.CV
TL;DR: In autonomous driving, an enhanced End-to-End model named MapFM is introduced for online vectorized HD map generation which incorporates powerful foundation model and integrates auxiliary prediction heads for semantic segmentation in the BEV representation.
Details
Motivation: The motivation of this paper is to enhance the feature representation quality by incorporating powerful foundation model for encoding camera images and improve the prediction quality by integrating auxiliary prediction heads for semantic segmentation in the BEV representation. Method: The method used in this paper is to incorporate a powerful foundation model for encoding camera images and integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision. Result: The result of this paper is significantly boosted feature representation quality and higher accuracy and improved quality of the predicted vectorized HD maps. Conclusion: This paper concludes with the introduction of MapFM, an enhanced End-to-End model for online vectorized HD map generation that leads to higher accuracy and improved quality of the predicted maps. Abstract: In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model's understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at https://github.com/LIvanoff/MapFM.[92] OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models
Lanfeng Zhong,Xin Liao,Shichuan Zhang,Shaoting Zhang,Guotai Wang
Main category: cs.CV
TL;DR: 在病理图像分类中,主动学习(AL)通过选择最有信息量的样本进行标注来减少标注工作。然而,传统方法在开放场景下效率低下且浪费标注成本。本文提出OpenPath,一种结合预训练视觉-语言模型(VLM)的新型开放集主动学习方法,通过任务特定提示和多样信息内分布采样(DIS),有效选择内分布和信息量大的样本,避免选择外分布样本。实验表明,OpenPath显著提高了模型性能并优于现有方法。
Details
Motivation: 病理图像分类对于精准医疗诊断和治疗规划至关重要,但训练高性能模型需要大规模标注数据集,获取这些数据既昂贵又耗时。现有的主动学习方法多基于闭集假设,而在实际临床环境中,未标注数据池常包含大量分布外数据,导致传统方法标注效率低,且在开放场景下初始随机选择造成标注成本浪费。 Method: 1. 提出OpenPath方法,利用预训练视觉-语言模型(VLM)进行开放集主动学习。 2. 在首次查询中,设计任务特定提示,结合目标类和相关非目标类提示,从未标注池中有效选择内分布和信息量大的样本。 3. 在后续查询中,提出多样信息内分布采样(DIS),包括基于原型的内分布候选选择(PIS)和熵引导随机采样(EGSS),确保查询中的纯度和信息量,避免选择分布外样本。 Result: 在两个公开病理图像数据集上的实验表明,OpenPath因所选样本的高纯度显著提升了模型性能,并优于多个最先进的开放集主动学习方法。 Conclusion: OpenPath是一种有效的开放集主动学习方法,解决了传统方法在开放场景下的效率问题,显著提高了病理图像分类任务的模型性能,具有潜在的临床应用价值。 Abstract: Pathology image classification plays a crucial role in accurate medical diagnosis and treatment planning. Training high-performance models for this task typically requires large-scale annotated datasets, which are both expensive and time-consuming to acquire. Active Learning (AL) offers a solution by iteratively selecting the most informative samples for annotation, thereby reducing the labeling effort. However, most AL methods are designed under the assumption of a closed-set scenario, where all the unannotated images belong to target classes. In real-world clinical environments, the unlabeled pool often contains a substantial amount of Out-Of-Distribution (OOD) data, leading to low efficiency of annotation in traditional AL methods. Furthermore, most existing AL methods start with random selection in the first query round, leading to a significant waste of labeling costs in open-set scenarios. To address these challenges, we propose OpenPath, a novel open-set active learning approach for pathological image classification leveraging a pre-trained Vision-Language Model (VLM). In the first query, we propose task-specific prompts that combine target and relevant non-target class prompts to effectively select In-Distribution (ID) and informative samples from the unlabeled pool. In subsequent queries, Diverse Informative ID Sampling (DIS) that includes Prototype-based ID candidate Selection (PIS) and Entropy-Guided Stochastic Sampling (EGSS) is proposed to ensure both purity and informativeness in a query, avoiding the selection of OOD samples. Experiments on two public pathology image datasets show that OpenPath significantly enhances the model's performance due to its high purity of selected samples, and outperforms several state-of-the-art open-set AL methods. The code is available at \href{https://github.com/HiLab-git/OpenPath}{https://github.com/HiLab-git/OpenPath}..[93] Open-World Object Counting in Videos
Niki Amini-Naieni,Andrew Zisserman
Main category: cs.CV
TL;DR: An open-world object counting task in videos is introduced, with a model CountVid leveraging image-based counting and video segmentation for accurate counts, evaluated on the new VideoCount dataset.
Details
Motivation: The motivation of this paper is to address the challenge of enumerating all unique instances of target objects in videos given a text description or an image example, particularly in crowded scenes with occlusions and similar objects where avoiding double counting and identifying reappearances is crucial. Method: The method involves introducing CountVid, a model that leverages an image-based counting model and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. Result: Using the VideoCount dataset, it is demonstrated that CountVid provides accurate object counts and significantly outperforms strong baselines. Conclusion: The conclusion is that CountVid successfully performs open-world object counting in videos, as evidenced by its accuracy and performance improvements over baselines. The VideoCount dataset, CountVid model, and code are made publicly available. Abstract: We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and similar objects, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for our novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://github.com/niki-amini-naieni/CountVid/.[94] Unsupervised Pelage Pattern Unwrapping for Animal Re-identification
Aleksandr Algasov,Ekaterina Nepovinnykh,Fedor Zolotarev,Tuomas Eerola,Heikki Kälviäinen,Pavel Zemčík,Charles V. Stewart
Main category: cs.CV
TL;DR: 提出了一种几何感知的纹理映射方法,将动物皮毛图案展开到规范的UV空间,从而提高个体再识别的鲁棒性。在海豹和豹数据集上实验表明,该方法无需真实UV注释即可进行自监督训练,并提高了多达5.4%的再识别精度。
Details
Motivation: 现有的个体再识别方法难以应对由于身体运动和姿势变化引起的动物皮毛或皮肤图案的几何扭曲问题。 Method: 提出了一种几何感知的纹理映射方法,使用表面法线估计来引导展开过程,同时保留3D表面和2D纹理空间之间的几何一致性。并将此模式保持的UV映射与现有的再识别技术相结合。 Result: 实验表明,在不同的姿态和视角下,该方法提高了再识别的准确性,最高可提升5.4%。 Conclusion: 所提出的几何感知纹理映射方法能够在不依赖真实UV注释的情况下,通过自监督学习有效提高个体再识别的准确性。 Abstract: Existing individual re-identification methods often struggle with the deformable nature of animal fur or skin patterns which undergo geometric distortions due to body movement and posture changes. In this paper, we propose a geometry-aware texture mapping approach that unwarps pelage patterns, the unique markings found on an animal's skin or fur, into a canonical UV space, enabling more robust feature matching. Our method uses surface normal estimation to guide the unwrapping process while preserving the geometric consistency between the 3D surface and the 2D texture space. We focus on two challenging species: Saimaa ringed seals (Pusa hispida saimensis) and leopards (Panthera pardus). Both species have distinctive yet highly deformable fur patterns. By integrating our pattern-preserving UV mapping with existing re-identification techniques, we demonstrate improved accuracy across diverse poses and viewing angles. Our framework does not require ground truth UV annotations and can be trained in a self-supervised manner. Experiments on seal and leopard datasets show up to a 5.4% improvement in re-identification accuracy.[95] When Model Knowledge meets Diffusion Model: Diffusion-assisted Data-free Image Synthesis with Alignment of Domain and Class
Yujin Kim,Hyunsoo Kim,Hyunwoo J. Kim,Suhyun Kim
Main category: cs.CV
TL;DR: The paper introduces DDIS, a new method for Data-Free Image Synthesis which uses a text-to-image diffusion model to improve synthetic image quality. It incorporates Domain Alignment Guidance and Class Alignment Token for better alignment with training data distribution, showing superior performance in experiments.
Details
Motivation: Existing Data-Free Image Synthesis (DFIS) methods suffer from generating samples that do not accurately reflect the training data distribution due to lack of prior knowledge about natural images. Method: DDIS leverages a text-to-image diffusion model as an image prior to generate high-quality synthetic images. It introduces Domain Alignment Guidance (DAG) to align synthetic data domain with the training data domain during diffusion sampling and optimizes a single Class Alignment Token (CAT) embedding to capture class-specific attributes. Result: Experiments on PACS and ImageNet datasets show that DDIS outperforms previous DFIS methods by generating samples that more accurately reflect the training data distribution, achieving state-of-the-art (SOTA) performance in data-free applications. Conclusion: DDIS is a novel and effective approach for Data-Free Image Synthesis, producing higher quality synthetic images that better align with the training data distribution. Abstract: Open-source pre-trained models hold great potential for diverse applications, but their utility declines when their training data is unavailable. Data-Free Image Synthesis (DFIS) aims to generate images that approximate the learned data distribution of a pre-trained model without accessing the original data. However, existing DFIS meth ods produce samples that deviate from the training data distribution due to the lack of prior knowl edge about natural images. To overcome this limitation, we propose DDIS, the first Diffusion-assisted Data-free Image Synthesis method that leverages a text-to-image diffusion model as a powerful image prior, improving synthetic image quality. DDIS extracts knowledge about the learned distribution from the given model and uses it to guide the diffusion model, enabling the generation of images that accurately align with the training data distribution. To achieve this, we introduce Domain Alignment Guidance (DAG) that aligns the synthetic data domain with the training data domain during the diffusion sampling process. Furthermore, we optimize a single Class Alignment Token (CAT) embedding to effectively capture class-specific attributes in the training dataset. Experiments on PACS and Ima geNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution, achieving SOTA performance in data-free applications.[96] NERO: Explainable Out-of-Distribution Detection with Neuron-level Relevance
Anju Chhetri,Jari Korhonen,Prashnna Gyawali,Binod Bhattarai
Main category: cs.CV
TL;DR: In the field of deep learning, especially in medical imaging, ensuring reliability is crucial. Identifying out-of-distribution (OOD) samples is a key indicator of model reliability. The paper proposes a new OOD scoring method called NERO, which uses neuron-level relevance at the feature layer to improve OOD detection and achieve better performance than current state-of-the-art methods.
Details
Motivation: To address the limitation of existing OOD detection methods that may not fully capture OOD diversity, thus improving model reliability in medical imaging. Method: The proposed method, NERO, clusters neuron-level relevance for each in-distribution class to form representative centroids. It introduces a relevance distance metric to measure deviation from these centroids and enhances OOD separability. Additionally, it incorporates scaled relevance in the bias term and combines feature norms to refine performance. Result: Validated across multiple deep learning architectures on the Kvasir and GastroVision benchmarks, achieving improvements over state-of-the-art OOD detection methods. Conclusion: NERO provides an effective framework for OOD detection in medical imaging, offering enhanced reliability and explainability. Abstract: Ensuring reliability is paramount in deep learning, particularly within the domain of medical imaging, where diagnostic decisions often hinge on model outputs. The capacity to separate out-of-distribution (OOD) samples has proven to be a valuable indicator of a model's reliability in research. In medical imaging, this is especially critical, as identifying OOD inputs can help flag potential anomalies that might otherwise go undetected. While many OOD detection methods rely on feature or logit space representations, recent works suggest these approaches may not fully capture OOD diversity. To address this, we propose a novel OOD scoring mechanism, called NERO, that leverages neuron-level relevance at the feature layer. Specifically, we cluster neuron-level relevance for each in-distribution (ID) class to form representative centroids and introduce a relevance distance metric to quantify a new sample's deviation from these centroids, enhancing OOD separability. Additionally, we refine performance by incorporating scaled relevance in the bias term and combining feature norms. Our framework also enables explainable OOD detection. We validate its effectiveness across multiple deep learning architectures on the gastrointestinal imaging benchmarks Kvasir and GastroVision, achieving improvements over state-of-the-art OOD detection methods.[97] Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
Team Hunyuan3D,Shuhui Yang,Mingxin Yang,Yifei Feng,Xin Huang,Sheng Zhang,Zebin He,Di Luo,Haolin Liu,Yunfei Zhao,Qingxiang Lin,Zeqiang Lai,Xianghui Yang,Huiwen Shi,Zibo Zhao,Bowen Zhang,Hongyu Yan,Lifu Wang,Sicong Liu,Jihong Zhang,Meng Chen,Liang Dong,Yiwen Jia,Yulin Cai,Jiaao Yu,Yixuan Tang,Dongyuan Guo,Junlin Yu,Hao Zhang,Zheng Ye,Peng He,Runzhou Wu,Shida Wei,Chao Zhang,Yonghao Tan,Yifu Sun,Lin Niu,Shirui Huang,Bojian Zheng,Shu Liu,Shilin Chen,Xiang Yuan,Xiaofeng Yang,Kai Liu,Jianchen Zhu,Peng Chen,Tian Liu,Di Wang,Yuhong Liu,Linus,Jie Jiang,Jingwei Huang,Chunchao Guo
Main category: cs.CV
TL;DR: 本文介绍了Hunyuan3D 2.1系统作为3D AI生成内容(AIGC)的案例研究,提供了一个全面的指南,涵盖了从数据处理、模型训练到性能评估的整个流程,最终目标是帮助用户掌握微调或开发适用于游戏、虚拟现实和工业设计等领域的3D生成模型的知识。
Details
Motivation: 尽管已有多个突破性模型革新了3D生成领域,但该领域仍主要局限于研究人员、开发者和设计师,因为收集、处理和训练3D模型的过程复杂且耗时。因此,需要一个易于使用且高效的系统来降低这一门槛。 Method: 本文以Hunyuan3D 2.1为案例,详细讲解了3D数据处理、模型训练以及性能评估的方法。系统由两个核心组件构成:用于形状生成的Hunyuan3D-DiT和用于纹理合成的Hunyuan3D-Paint。此外,还探讨了数据准备、模型架构、训练策略、评估指标及部署等多个方面。 Result: 通过本教程的学习,用户能够获得微调或开发强大3D生成模型所需的知识,这些模型可以应用于游戏、虚拟现实和工业设计等领域。 Conclusion: Hunyuan3D 2.1系统作为一个先进的工具,能够有效简化3D模型的生成过程,并提升其在不同应用中的表现。这将有助于推动3D AIGC技术的发展并扩大其应用范围。 Abstract: 3D AI-generated content (AIGC) is a passionate field that has significantly accelerated the creation of 3D models in gaming, film, and design. Despite the development of several groundbreaking models that have revolutionized 3D generation, the field remains largely accessible only to researchers, developers, and designers due to the complexities involved in collecting, processing, and training 3D models. To address these challenges, we introduce Hunyuan3D 2.1 as a case study in this tutorial. This tutorial offers a comprehensive, step-by-step guide on processing 3D data, training a 3D generative model, and evaluating its performance using Hunyuan3D 2.1, an advanced system for producing high-resolution, textured 3D assets. The system comprises two core components: the Hunyuan3D-DiT for shape generation and the Hunyuan3D-Paint for texture synthesis. We will explore the entire workflow, including data preparation, model architecture, training strategies, evaluation metrics, and deployment. By the conclusion of this tutorial, you will have the knowledge to finetune or develop a robust 3D generative model suitable for applications in gaming, virtual reality, and industrial design.[98] Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning
Chunlei Li,Jingyang Hou,Yilei Shi,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou
Main category: cs.CV
TL;DR: The paper introduces MRG-LLM, a multimodal large language model combining frozen LLM and learnable visual encoder with dynamic prompt customization for medical report generation. It achieves state-of-the-art performance on IU X-ray and MIMIC-CXR datasets.
Details
Motivation: Medical report generation from imaging data is a challenging task in clinical practice that requires effective integration of large language models with medical imaging data. Method: MRG-LLM uses a novel multimodal large language model architecture that combines a frozen LLM with a learnable visual encoder. It introduces dynamic prompt customization through conditional affine transformations derived from visual features, proposing two implementations: prompt-wise and promptbook-wise customization. Result: Extensive experiments on IU X-ray and MIMIC-CXR datasets show that MRG-LLM achieves state-of-the-art performance in medical report generation. Conclusion: MRG-LLM effectively integrates large language models with medical imaging data using a learnable visual encoder and dynamic prompt customization, achieving superior results in medical report generation. Abstract: Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.[99] GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects
Shujia Li,Haiyu Zhang,Xinyuan Chen,Yaohui Wang,Yutong Ban
Main category: cs.CV
TL;DR: This paper presents GenHOI, a two-stage framework for 4D human-object interaction synthesis that generalizes to unseen objects and generates high-quality sequences. It uses Object-AnchorNet for keyframe reconstruction and Contact-Aware Diffusion Model for interpolation, achieving state-of-the-art results.
Details
Motivation: The motivation of this paper is to address the challenge of synthesizing 4D human-object interactions (HOI) by overcoming the limitation of large-scale 4D HOI datasets, focusing on generalization to unseen objects and high-fidelity sequence generation. Method: The method involves a two-stage framework called GenHOI. In the first stage, an Object-AnchorNet reconstructs sparse 3D HOI keyframes for unseen objects using only 3D HOI datasets. In the second stage, a Contact-Aware Diffusion Model (ContactDM) interpolates these keyframes into dense 4D HOI sequences. Within ContactDM, a novel Contact-Aware Encoder extracts human-object contact patterns and Contact-Aware HOI Attention integrates these signals into the diffusion model. Result: The experimental results show state-of-the-art performance on the OMOMO and 3D-FUTURE datasets, indicating strong generalization abilities to unseen objects and high-fidelity 4D HOI generation. Conclusion: GenHOI successfully achieves generalization to unseen objects and high-fidelity 4D HOI sequence synthesis, demonstrating its effectiveness through state-of-the-art results on benchmark datasets. Abstract: While diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, extending these advances to 4D human-object interaction (HOI) remains challenging, mainly due to the limited availability of large-scale 4D HOI datasets. In our study, we introduce GenHOI, a novel two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences. In the initial stage of our framework, we employ an Object-AnchorNet to reconstruct sparse 3D HOI keyframes for unseen objects, learning solely from 3D HOI datasets, thereby mitigating the dependence on large-scale 4D HOI datasets. Subsequently, we introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate sparse 3D HOI keyframes into densely temporally coherent 4D HOI sequences. To enhance the quality of generated 4D HOI sequences, we propose a novel Contact-Aware Encoder within ContactDM to extract human-object contact patterns and a novel Contact-Aware HOI Attention to effectively integrate the contact signals into diffusion models. Experimental results show that we achieve state-of-the-art results on the publicly available OMOMO and 3D-FUTURE datasets, demonstrating strong generalization abilities to unseen objects, while enabling high-fidelity 4D HOI generation.[100] NTIRE 2025 Image Shadow Removal Challenge Report
Florin-Alexandru Vasluianu,Tim Seizinger,Zhuyun Zhou,Cailian Chen,Zongwei Wu,Radu Timofte,Mingjia Li,Jin Hu,Hainuo Wang,Hengxing Liu,Jiarui Wang,Qiming Hu,Xiaojie Guo,Xin Lu,Jiarong Yang,Yuanfei Bao,Anya Hu,Zihao Fan,Kunyu Wang,Jie Xiao,Xi Wang,Xueyang Fu,Zheng-Jun Zha,Yu-Fan Lin,Chia-Ming Lee,Chih-Chung Hsu,Xingbo Wang,Dong Li,Yuxu Chen,Bin Chen,Yuanbo Zhou,Yuanbin Chen,Hongwei Wang,Jiannan Lin,Qinquan Gao,Tong Tong,Zhao Zhang,Yanyan Wei,Wei Dong,Han Zhou,Seyed Amirreza Mousavi,Jun Chen,Haobo Liang,Jiajie Jing,Junyu Li,Yan Yang,Seoyeon Lee,Chaewon Kim,Ziyu Feng,Shidi Chen,Bowen Luan,Zewen Chen,Vijayalaxmi Ashok Aralikatti,G Gyaneshwar Rao,Nikhil Akalwadi,Chaitra Desai,Ramesh Ashok Tabib,Uma Mudenagudi,Anas M. Ali,Bilel Benjdira,Wadii Boulila,Alexandru Brateanu,Cosmin Ancuti,Tanmay Chaturvedi,Manish Kumar,Anmol Srivastav,Daksh Trivedi,Shashwat Thakur,Kishor Upla,Zeyu Xiao,Zhuoyuan Li,Boda Zhou,Shashank Shekhar,Kele Xu,Qisheng Xu,Zijian Gao,Tianjiao Wan,Suiyi Zhao,Bo Wang,Yan Luo,Mingshen Wang,Yilin Zhang
Main category: cs.CV
TL;DR: This paper examines the results of the NTIRE 2025 Shadow Removal Challenge, which had two evaluation tracks focusing on reconstruction fidelity and visual perception. 17 teams submitted solutions using images from the WSRD+ dataset.
Details
Motivation: To advance the state-of-the-art in shadow removal techniques by challenging participants to develop innovative solutions that can handle interactions between self- and cast-shadows with diverse objects, textures, and materials. Method: The challenge included two evaluation tracks: one for reconstruction fidelity and the other for visual perception through a user study, both using images from the WSRD+ dataset. Result: A total of 306 participants registered and 17 teams successfully submitted their solutions during the final evaluation phase. Conclusion: The NTIRE 2025 Shadow Removal Challenge provided valuable insights into current shadow removal techniques and highlighted areas for future research. Abstract: This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.[101] CLAIM: Clinically-Guided LGE Augmentation for Realistic and Diverse Myocardial Scar Synthesis and Segmentation
Farheen Ramzan,Yusuf Kiberu,Nikesh Jathanna,Shahnaz Jamil-Copley,Richard H. Clayton,Chen,Chen
Main category: cs.CV
TL;DR: An abstract about a deep learning-based framework named CLAIM for myocardial scar segmentation from LGE cardiac MRI, which introduces SMILE module and joint training strategy.
Details
Motivation: The limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models for myocardial scar segmentation from LGE cardiac MRI. Method: Introduce CLAIM framework containing SMILE module (Scar Mask generation guided by cLinical knowledgE) to synthesize images with anatomically consistent and spatially diverse scar patterns. Also, employ a joint training strategy in which the scar segmentation network is optimized alongside the generator. Result: Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Conclusion: CLAIM enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task. Abstract: Deep learning-based myocardial scar segmentation from late gadolinium enhancement (LGE) cardiac MRI has shown great potential for accurate and timely diagnosis and treatment planning for structural cardiac diseases. However, the limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models. To address this, we introduce CLAIM: \textbf{C}linically-Guided \textbf{L}GE \textbf{A}ugmentation for Real\textbf{i}stic and Diverse \textbf{M}yocardial Scar Synthesis and Segmentation framework, a framework for anatomically grounded scar generation and segmentation. At its core is the SMILE module (Scar Mask generation guided by cLinical knowledgE), which conditions a diffusion-based generator on the clinically adopted AHA 17-segment model to synthesize images with anatomically consistent and spatially diverse scar patterns. In addition, CLAIM employs a joint training strategy in which the scar segmentation network is optimized alongside the generator, aiming to enhance both the realism of synthesized scars and the accuracy of the scar segmentation performance. Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Our approach enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task.[102] RaCalNet: Radar Calibration Network for Sparse-Supervised Metric Depth Estimation
Xingrui Qin,Wentao Zhao,Chuan Cao,Yihe Niu,Houcheng Jiang,Jingchuan Wang
Main category: cs.CV
TL;DR: RaCalNet is a new framework for dense metric depth estimation using millimeter-wave radar which eliminates the need for dense LiDAR supervision by leveraging sparse LiDAR and refined radar measurements, achieving superior performance with only about 1% supervision density.
Details
Motivation: The motivation is to eliminate the requirement of dense LiDAR supervision in depth estimation from millimeter-wave radar, making the process less costly and data-intensive. Method: RaCalNet recalibrates and refines sparse radar points to construct accurate depth priors, which serve as anchors to guide monocular depth prediction without needing dense supervision. This method uses sparse LiDAR to supervise the learning of refined radar measurements. Result: RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. It reduces RMSE by 35.30% on the ZJU-4DRadarCam dataset and by 34.89% in real-world deployment scenarios. Conclusion: RaCalNet successfully achieves dense metric depth estimation with only about 1% supervision density compared to dense-supervised methods, demonstrating its effectiveness through extensive experiments. Abstract: Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, to guide the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1% compared to dense-supervised methods. Unlike previous approaches that associate radar points with broad image regions and rely heavily on dense labels, RaCalNet first recalibrates and refines sparse radar points to construct accurate depth priors. These priors then serve as reliable anchors to guide monocular depth prediction, enabling metric-scale estimation without resorting to dense supervision. This design improves structural consistency and preserves fine details. Despite relying solely on sparse supervision, RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively.[103] Control and Realism: Best of Both Worlds in Layout-to-Image without Training
Bonan Li,Yinhan Hu,Songhua Liu,Xinchao Wang
Main category: cs.CV
TL;DR: WinWinLay是一种无需训练的新方法,通过非局部注意力能量函数和自适应更新策略,解决了现有文本到图像扩散模型在布局到图像生成中的定位不精确和不真实伪影的问题,从而提高了控制精度和现实主义。
Details
Motivation: 现有的文本到图像扩散模型虽然无需特定数据训练就能实现复杂场景的精确控制,但存在定位不精确和不真实伪影的问题。 Method: 提出了一种名为WinWinLay的无需训练的方法,包含两个核心策略:非局部注意力能量函数和自适应更新。前者通过重新分配注意力分数来减少空间分布偏差,使对象更好地符合布局指令;后者基于Langevin动力学引入自适应更新方案,以促进域内更新并尊重布局约束。 Result: 广泛的实验表明,WinWinLay在控制元素放置和实现照片级视觉保真度方面表现出色,优于当前最先进的方法。 Conclusion: WinWinLay方法通过改进的策略提升了布局到图像生成任务中的精度和现实主义,无需额外训练即可超越现有技术。 Abstract: Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies, Non-local Attention Energy Function and Adaptive Update, that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.[104] Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie,Zhenheng Yang,Mike Zheng Shou
Main category: cs.CV
TL;DR: This paper presents Show-o2, an improved multimodal model using autoregressive modeling and flow matching for effective understanding and generation of text, images, and videos. Code and models are released on GitHub.
Details
Motivation: To advance the capabilities of unified multimodal models in handling diverse modalities such as text, images, and videos effectively. Method: The method involves a 3D causal variational autoencoder space with dual-path spatial(-temporal) fusion for constructing unified visual representations. Autoregressive modeling is applied to the language head and flow matching to the flow head, facilitated by a two-stage training approach. Result: Show-o2 models exhibit versatility in multimodal understanding and generation tasks across different modalities, demonstrating scalability and effectiveness. Conclusion: Show-o2 successfully leverages autoregressive modeling and flow matching to enhance multimodal tasks, with open-source code and models available. Abstract: This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.[105] Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification
Junhao Wu,Aboagye-Ntow Stephen,Chuyuan Wang,Gang Chen,Xin Huang
Main category: cs.CV
TL;DR: The paper proposes a parameter-efficient semi-supervised segmentation framework for ultra-high spatial resolution land cover classification imagery at 0.3 m resolution, incorporating SAM2 knowledge and FreqWeaver Adapter to enhance detail modeling while keeping lightweight design, achieving better results with minimal parameter overhead.
Details
Motivation: Ultra-high spatial resolution land cover classification is crucial but challenging due to high annotation costs, scale variation, and limited adaptability of large-scale vision models. Current methods focus on lower resolution imagery and rely heavily on annotated data, which limits their applicability for higher-resolution imagery under weak supervision. Method: A parameter-efficient semi-supervised segmentation framework for 0.3 m spatial resolution imagery was developed. This framework leverages the knowledge of SAM2 and introduces a remote sensing-specific FreqWeaver Adapter, enhancing fine-grained detail modeling with only 5.96% of total model parameters. Result: The method effectively uses unlabeled data and maintains minimal parameter overhead, delivering robust segmentation results with superior structural consistency. It achieves a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches. Conclusion: The proposed framework successfully addresses challenges in ultra-high spatial resolution land cover classification by providing a lightweight solution that enhances detail modeling and delivers improved performance. Abstract: Ultra-high Spatial Resolution Land Cover Classification is essential for fine-grained land cover analysis, yet it remains challenging due to the high cost of pixel-level annotations, significant scale variation, and the limited adaptability of large-scale vision models. Existing methods typically focus on 1-meter spatial resolution imagery and rely heavily on annotated data, whereas practical applications often require processing higher-resolution imagery under weak supervision. To address this, we propose a parameter-efficient semi-supervised segmentation framework for 0.3 m spatial resolution imagery, which leverages the knowledge of SAM2 and introduces a remote sensing-specific FreqWeaver Adapter to enhance fine-grained detail modeling while maintaining a lightweight design at only 5.96% of the total model parameters. By effectively leveraging unlabeled data and maintaining minimal parameter overhead, the proposed method delivers robust segmentation results with superior structural consistency, achieving a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches.[106] A Unified Graph-based Framework for Scalable 3D Tree Reconstruction and Non-Destructive Biomass Estimation from Point Clouds
Di Wang,Shi Li
Main category: cs.CV
TL;DR: This paper presents a novel unified framework for end-to-end processing of large-scale point clouds using an innovative graph-based pipeline to estimate forest above-ground biomass (AGB) non-destructively. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS.
Details
Motivation: Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Current Quantitative Structural Model (QSM) methods face limitations in scalability and practical deployment due to reliance on high-quality point cloud data and multiple pre-processing steps. Method: The study proposes a unified framework enabling end-to-end processing of large-scale point clouds through a graph-based pipeline. This approach integrates tree segmentation, leaf-wood separation, and 3D skeletal reconstruction via dedicated graph operations such as pathing and abstracting for tree topology reasoning. Result: The method showed strong performance under challenging conditions, including leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). It provides a robust and scalable solution for large-scale, non-destructive AGB estimation. Conclusion: This advancement improves the feasibility of QSM-based AGB estimation by reducing dependency on specialized pre-processing tools and establishing ULS as a viable alternative to TLS, paving the way for broader applications in forest inventory and climate change research. Abstract: Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction. However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment. This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS). Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales. This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research.[107] One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution
Yujing Sun,Lingchen Sun,Shuaizheng Liu,Rongyuan Wu,Zhengqiang Zhang,Lei Zhang
Main category: cs.CV
TL;DR: DLoRAL是一种新的方法,通过双阶段训练(C-LoRA和D-LoRA)结合稳定扩散模型,实现视频超分辨率中细节丰富与时间连贯性的平衡。实验表明,该方法在准确性和速度上表现出色。
Details
Motivation: 当前基于稳定扩散的现实视频超分辨率方法在保持时间一致性的同时往往牺牲了空间细节,导致视觉质量不理想。因此,需要一种能有效提取低质量输入视频中的退化鲁棒时间一致性先验,并增强视频细节的方法。 Method: 提出了一种双LoRA学习(DLoRAL)范式,包含两个模块:跨帧检索(CFR)模块用于聚合帧间互补信息;一致性LoRA(C-LoRA)用于从降质输入中学习鲁棒的时间表示。之后固定CFR和C-LoRA模块,训练细节LoRA(D-LoRA)以增强空间细节并保持时间一致性。这两个阶段交替迭代优化,最终将两个LoRA分支合并到稳定扩散模型中,实现高效高质量的单步视频恢复。 Result: 实验结果表明,DLoRAL在准确性和速度方面都表现出强大的性能,能够提供一致且细节丰富的输出。 Conclusion: DLoRAL通过有效的双阶段训练策略,在视频超分辨率任务中实现了时间和空间维度上的良好平衡,提升了视觉质量和处理效率。 Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.[108] Mono-Modalizing Extremely Heterogeneous Multi-Modal Medical Image Registration
Kyobin Choo,Hyunkyung Han,Jinyeong Kim,Chanyong Yoon,Seong Jae Hwang
Main category: cs.CV
TL;DR: 提出了一种新的框架M2M-Reg,用于多模态可变形图像配准(DIR),通过仅使用单模态相似性进行训练,解决了高度异构模态之间的配准问题。实验表明,M2M-Reg在PET-MRI和FA-MRI配准上比现有方法高出两倍的DSC指标。
Details
Motivation: 由于功能成像模式(如PET和FA)与结构参考(如MRI、CT)之间的极端异质性,传统的无监督DIR方法难以学习可靠的映射并经常扭曲图像。这是因为指导这些模型的相似性度量无法捕捉高度不同的模态之间的对齐。 Method: 提出了M2M-Reg(Multi-to-Mono Registration),一种新的框架,利用仅单模态相似性来训练多模态DIR模型,同时保留了现有的架构范式。还引入了GradCyCon正则化器,利用M2M-Reg的循环训练方案促进微分同胚。此外,该框架可以自然扩展到半监督设置中,整合预对齐和未对齐的对,而不需要地面真值变换或分割掩码。 Result: 在ADNI数据集上的实验表明,M2M-Reg在PET-MRI和FA-MRI配准中达到了比先前方法高两倍的DSC指标,证明了其在处理高度异构多模态DIR中的有效性。 Conclusion: M2M-Reg框架提供了一种有效的解决方案,用于处理高度异构模态之间的多模态可变形图像配准,具有更高的准确性和可靠性。 Abstract: In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to standard structural scans, conventional unsupervised DIR methods struggle to learn reliable spatial mappings and often distort images. We find that the similarity metrics guiding these models fail to capture alignment between highly disparate modalities. To address this, we propose M2M-Reg (Multi-to-Mono Registration), a novel framework that trains multi-modal DIR models using only mono-modal similarity while preserving the established architectural paradigm for seamless integration into existing models. We also introduce GradCyCon, a regularizer that leverages M2M-Reg's cyclic training scheme to promote diffeomorphism. Furthermore, our framework naturally extends to a semi-supervised setting, integrating pre-aligned and unaligned pairs only, without requiring ground-truth transformations or segmentation masks. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that M2M-Reg achieves up to 2x higher DSC than prior methods for PET-MRI and FA-MRI registration, highlighting its effectiveness in handling highly heterogeneous multi-modal DIR. Our code is available at https://github.com/MICV-yonsei/M2M-Reg.[109] BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion
Yuqing Lan,Chenyang Zhu,Zhirui Gao,Jiazhao Zhang,Yihan Cao,Renjiao Yi,Yijie Wang,Kai Xu
Main category: cs.CV
TL;DR: 提出了一种无需重建的在线框架,用于高效的3D物体检测,结合Cubify Anything和CLIP模型,并通过关联和优化模块处理多视角数据,在ScanNetV2和CA-1M数据集上表现优越。
Details
Motivation: 现有的3D物体检测方法依赖密集点云重建,导致计算开销大、内存限制严重,难以实现实时部署。 Method: 利用流式RGB-D视频输入,采用Cubify Anything作为预训练视觉基础模型进行单视图3D物体检测,结合CLIP捕捉开放词汇表语义;使用关联模块(含3D NMS和匹配模块)和优化模块(基于粒子滤波的IoU引导随机优化技术)融合多视角检测结果。 Result: 在ScanNetV2和CA-1M数据集上的广泛实验表明,该方法在在线方法中达到最先进的性能,具有良好的泛化能力,可实现实时感知,即使在超过1000平方米的环境中也是如此。 Conclusion: 所提出的无需重建的3D物体检测框架在保持低计算复杂度的同时,实现了高效、实时的3D物体检测,并展现出强大的泛化能力。 Abstract: Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.[110] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization
Roey Ron,Guy Tevet,Haim Sawdayee,Amit H. Bermano
Main category: cs.CV
TL;DR: HOIDiNi is a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). It uses Diffusion Noise Optimization (DNO) to optimize in the noise space of a pretrained diffusion model, separating the process into an object-centric phase and a human-centric phase. This approach ensures precise hand-object contact while maintaining motion naturalness. Evaluations on the GRAB dataset show HOIDiNi outperforms prior works in contact accuracy, physical validity, and overall quality.
Details
Motivation: Current methods for generating human-object interactions trade off between realism and physical correctness. The authors aim to create a system that achieves both simultaneously. Method: The method, HOIDiNi, utilizes Diffusion Noise Optimization (DNO) to directly optimize in the noise space of a pretrained diffusion model. The process is separated into two phases: an object-centric phase for making discrete choices of hand-object contact locations, and a human-centric phase that refines full-body motion based on these choices. Result: HOIDiNi demonstrates superior performance in contact accuracy, physical validity, and overall quality compared to prior works and baselines, as evaluated quantitatively, qualitatively, and subjectively on the GRAB dataset. It can generate complex, controllable interactions such as grasping, placing, and full-body coordination using only textual prompts. Conclusion: HOIDiNi successfully synthesizes realistic and physically correct human-object interactions by optimizing in the noise space of a diffusion model and employing a structured two-phase approach. Abstract: We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. https://hoidini.github.io.[111] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents
Karmesh Yadav,Yusuf Ali,Gunshi Gupta,Yarin Gal,Zsolt Kira
Main category: cs.CV
TL;DR: 为了评估长期记忆在实体任务中的应用,本文提出了一个新的基准测试,并结合最先进的视觉-语言模型和低级别的导航策略进行了一系列实验。
Details
Motivation: 当前的大型视觉-语言模型(VLMs)难以处理长时间跨度的任务,特别是在需要整合多日经验的实体场景中。现有的长视频问答基准未能充分涵盖实体挑战,例如物体操作和导航。因此,需要一个新基准来专门评估涉及长期记忆的任务表现。 Method: 作者引入了一个新的基准测试,用于在Habitat模拟器中评估长期实体任务。该基准包括60个任务,可程序化扩展为更长、更具挑战性的版本。此外,还提供了将最先进的VLMs与低级别导航策略相结合的基线方法,以评估这些模型在记忆密集型任务中的表现。 Result: 通过使用提出的基准和基线方法,研究揭示了现有VLMs在长期记忆任务中的局限性,并指出了改进的方向。 Conclusion: 新基准测试能够有效评估VLMs在长期实体任务中的表现,强调了整合记忆和推理的重要性。未来的研究应致力于提升模型处理大规模图像集合和执行基于历史信息的动作的能力。 Abstract: Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.[112] Demystifying the Visual Quality Paradox in Multimodal Large Language Models
Shuo Xing,Lanqing Guo,Hongyuan Hua,Seoyoung Lee,Peiran Li,Yufei Wang,Zhangyang Wang,Zhengzhong Tu
Main category: cs.CV
TL;DR: 研究发现,输入图像的视觉质量对多模态大语言模型(MLLMs)的表现有复杂影响,有时低质量图像反而能提高模型性能。为此,提出了一种名为VQ-TTT的方法,通过调整输入图像以适应模型特定任务的偏好,从而提升MLLMs在各种数据集上的平均准确率。
Details
Motivation: 当前对于输入图像质量如何影响MLLMs的表现知之甚少,尤其是高感知质量的图像是否一定能带来更好的模型理解尚未明确。因此,需要系统地研究视觉质量与MLLMs表现之间的关系,并探索如何优化输入图像以匹配模型偏好。 Method: 通过在一系列领先的MLLMs和多种视觉-语言基准测试中应用受控退化和风格转变,揭示了视觉质量悖论。为解决这一问题,提出了Visual-Quality Test-Time Tuning (VQ-TTT),该方法包括:1) 在冻结的视觉编码器前插入一个可学习的低秩核以调节频率内容;2) 通过LoRA仅微调视觉编码器的浅层。此方法能在单次前向传播中动态调整输入图像,使其与特定任务的模型偏好对齐。 Result: VQ-TTT显著提升了在所有评估的MLLMs和数据集上的平均准确率,且无需外部模型、缓存特征或额外训练数据。这表明,适应性而非普遍“清洁”的图像更符合AI作为主要数据客户的需求。 Conclusion: 高质量的视觉输入并不一定等同于MLLMs更好的理解,模型表现可能因图像偏离人类感知的保真度而改善。未来应关注适应性图像处理方法,以更好地满足AI系统的独特需求。 Abstract: Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.[113] Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Ankan Deria,Adinath Madhavrao Dukre,Feilong Tang,Sara Atito,Sudipta Roy,Muhammad Awais,Muhammad Haris Khan,Imran Razzak
Main category: cs.CV
TL;DR: The paper presents ViMaR, a two-stage inference framework that enhances efficiency and output fidelity in vision-language models by combining a temporal-difference value model with a margin-aware reward adjustment. It achieves significant improvements in caption quality and speedup compared to existing methods, demonstrates robust cross-model guidance, and shows potential for self-improving VLM pipelines.
Details
Motivation: Existing approaches for inference-time search in vision-language models are computationally expensive and prone to low-confidence generations leading to persistent hallucinations. Method: ViMaR uses a two-stage inference framework: (1) identifying the highest-value caption among diverse candidates in a single pass; (2) selectively refining overlooked or weakly grounded segments. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Result: ViMaR generates captions significantly more reliable, factually accurate, detailed, and explanatory, achieving over 4x speedup compared to existing value-guided methods. It generalizes effectively to guide decoding in stronger unseen models and leads to consistent improvements in caption quality across different models. Conclusion: ViMaR is a scalable and transferable inference-time decoding strategy with flexibility and modularity. It has potential for self-improving vision-language model pipelines through self-training. Abstract: Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.[114] UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting
Kai He,Ruofan Liang,Jacob Munkberg,Jon Hasselgren,Nandita Vijaykumar,Alexander Keller,Sanja Fidler,Igor Gilitschenski,Zan Gojcic,Zian Wang
Main category: cs.CV
TL;DR: 提出了一种基于视频扩散模型的单次通过方法,用于估计albedo和合成重光照输出,克服了现有方法的数据稀缺和误差积累问题,实现了高质量的重光照效果。
Details
Motivation: 现有的端到端重光照模型受限于多光照配对数据的稀缺性,而两阶段管道虽然可以缓解数据需求,但在复杂光照或材料情况下容易产生误差累积且结果不真实。 Method: 引入了一种通用方法,在单一过程中联合估计albedo并合成重光照输出,利用视频扩散模型的生成能力,增强隐式场景理解,生成逼真的光照效果和复杂的材质交互。 Result: 该模型在各种领域中表现出强大的泛化能力,在视觉保真度和时间一致性方面超越了先前的方法。 Conclusion: 本研究提出的联合估计和合成方法显著提升了重光照任务的质量和适用范围,为未来的研究提供了新的方向。 Abstract: We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.[115] Sekai: A Video Dataset towards World Exploration
Zhen Li,Chuanhao Li,Xiaofeng Mao,Shaoheng Lin,Ming Li,Shitian Zhao,Zhaopan Xu,Xinyue Li,Yukang Feng,Jianwen Sun,Zizhen Li,Fanrui Zhang,Jiaxin Ai,Zhixiang Wang,Yuwei Wu,Tong He,Jiangmiao Pang,Yu Qiao,Yunde Jia,Kaipeng Zhang
Main category: cs.CV
TL;DR: The paper presents Sekai, a large-scale first-person video dataset with rich annotations for world exploration, and YUME, a model trained on a subset of Sekai for interactive video world exploration.
Details
Motivation: Existing video generation datasets are not well-suited for world exploration training due to their limitations such as limited locations, short duration, static scenes, and lack of annotations about exploration and the world. Method: Introduced Sekai, a high-quality first-person view worldwide video dataset with rich annotations including location, scene, weather, crowd density, captions, and camera trajectories. Developed an efficient toolbox to collect, pre-process and annotate videos. Used a subset of Sekai to train YUME, an interactive video world exploration model. Result: Experiments demonstrate the quality of the Sekai dataset and the effectiveness of the YUME model for interactive video world exploration. Conclusion: Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. Abstract: Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.[116] Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model
Anirud Aggarwal,Abhinav Shrivastava,Matthew Gwilliam
Main category: cs.CV
TL;DR: ECAD uses a genetic algorithm to learn caching schedules for diffusion models, achieving significant inference speedups and quality-latency trade-offs without modifying network parameters. Evaluated on multiple benchmarks, it outperforms previous methods.