Skip to content

Table of Contents

cs.CL [Back]

[1] Adverse Event Extraction from Discharge Summaries: A New Dataset, Annotation Scheme, and Initial Findings

Imane Guellil,Salomé Andres,Atul Anand,Bruce Guthrie,Huayu Zhang,Abul Hasan,Honghan Wu,Beatrice Alex

Main category: cs.CL

TL;DR: The paper presents a manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients, evaluates multiple models using FlairNLP across three annotation granularities, and concludes that significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language.

Details Motivation: There is a lack of clinical NLP resources for elderly patients, a population often underrepresented. Also, previous work rarely addresses the challenges of discontinuous and overlapping entities in AE extraction. Method: The authors developed a manually annotated corpus including 14 clinically significant AEs and contextual attributes. They evaluated multiple models using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. Result: Transformer-based models achieved strong performance on document-level coarse-grained extraction (F1 = 0.943), but performance dropped notably for fine-grained entity-level tasks (e.g., F1 = 0.675), especially for rare events and complex attributes. Conclusion: Despite high-level scores, significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language. The dataset serves as a robust benchmark for evaluating AE extraction methods. Abstract: In this work, we present a manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients, a population often underrepresented in clinical NLP resources. The dataset includes 14 clinically significant AEs-such as falls, delirium, and intracranial haemorrhage, along with contextual attributes like negation, diagnosis type, and in-hospital occurrence. Uniquely, the annotation schema supports both discontinuous and overlapping entities, addressing challenges rarely tackled in prior work. We evaluate multiple models using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. While transformer-based models (e.g., BERT-cased) achieve strong performance on document-level coarse-grained extraction (F1 = 0.943), performance drops notably for fine-grained entity-level tasks (e.g., F1 = 0.675), particularly for rare events and complex attributes. These results demonstrate that despite high-level scores, significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language. Developed within a Trusted Research Environment (TRE), the dataset is available upon request via DataLoch and serves as a robust benchmark for evaluating AE extraction methods and supporting future cross-dataset generalisation.

[2] Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction

Marija Šakota,Robert West

Main category: cs.CL

TL;DR: BoostCD通过结合受约束和不受约束的解码来提高结构化NLP任务中的输出质量,其在闭合信息提取任务中表现出色。

Details Motivation: 许多近期的方法在训练时并未让模型明确了解约束条件,这可能导致测试时受约束解码的输出质量较低。 Method: BoostCD分两个阶段进行:第一阶段从基础模型M以受约束和不受约束两种模式解码获得两个初步预测;第二阶段使用一个学习到的自回归增强模型将这两个初步预测组合成最终预测。 Result: BoostCD应用于闭合信息提取任务时,其模型BoostIE无论在分布内还是分布外数据上都优于先前的方法,并解决了这些方法中常见的几种错误。 Conclusion: BoostCD能够有效提升受约束解码的质量,对于结构化NLP任务具有重要意义。 Abstract: Many recent approaches to structured NLP tasks use an autoregressive language model $M$ to map unstructured input text $x$ to output text $y$ representing structured objects (such as tuples, lists, trees, code, etc.), where the desired output structure is enforced via constrained decoding. During training, these approaches do not require the model to be aware of the constraints, which are merely implicit in the training outputs $y$. This is advantageous as it allows for dynamic constraints without requiring retraining, but can lead to low-quality output during constrained decoding at test time. We overcome this problem with Boosted Constrained Decoding (BoostCD), which combines constrained and unconstrained decoding in two phases: Phase 1 decodes from the base model $M$ twice, in constrained and unconstrained mode, obtaining two weak predictions. In phase 2, a learned autoregressive boosted model combines the two weak predictions into one final prediction. The mistakes made by the base model with vs. without constraints tend to be complementary, which the boosted model learns to exploit for improved performance. We demonstrate the power of BoostCD by applying it to closed information extraction. Our model, BoostIE, outperforms prior approaches both in and out of distribution, addressing several common errors identified in those approaches.

[3] CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision

Dyah Adila,Shuai Zhang,Boran Han,Bonan Min,Yuyang Wang

Main category: cs.CL

TL;DR: The paper presents CrEst, a weakly supervised framework to assess context document credibility for LLMs without manual annotations. It offers two integration strategies and shows significant performance improvements in experiments.

Details Motivation: Existing methods integrating contextual information into LLMs often overlook the challenge of varying credibility of context documents, which can lead to unreliable information propagation. Method: CrEst is a weakly supervised framework that estimates document credibility based on semantic coherence with other credible documents. It proposes two integration strategies: a black-box approach for models without internal access and a white-box method modifying attention mechanisms. Result: Experiments across three model architectures and five datasets show CrEst outperforms strong baselines with up to 26.86% improvement in accuracy and 3.49% increase in F1 score. It also maintains robust performance under high-noise conditions. Conclusion: CrEst effectively addresses the issue of document credibility in LLM inference, significantly enhancing performance and reliability. Abstract: The integration of contextual information has significantly enhanced the performance of large language models (LLMs) on knowledge-intensive tasks. However, existing methods often overlook a critical challenge: the credibility of context documents can vary widely, potentially leading to the propagation of unreliable information. In this paper, we introduce CrEst, a novel weakly supervised framework for assessing the credibility of context documents during LLM inference--without requiring manual annotations. Our approach is grounded in the insight that credible documents tend to exhibit higher semantic coherence with other credible documents, enabling automated credibility estimation through inter-document agreement. To incorporate credibility into LLM inference, we propose two integration strategies: a black-box approach for models without access to internal weights or activations, and a white-box method that directly modifies attention mechanisms. Extensive experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines, achieving up to a 26.86% improvement in accuracy and a 3.49% increase in F1 score. Further analysis shows that CrEst maintains robust performance even under high-noise conditions.

[4] MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance

Joseph J. Peper,Wenzhao Qiu,Ali Payani,Lu Wang

Main category: cs.CL

TL;DR: MDBench 是一个新数据集,用于评估大语言模型在多文档推理任务上的表现。通过一种新颖的合成生成过程创建,可以可控且高效地生成具有挑战性的文档集和对应的问答示例。研究发现MDBench对所有方法都构成了显著挑战,并且知识引导生成技术能够进行有针对性的分析并适应新的挑战和未来的模型改进。

Details Motivation: 当前缺乏严格检验大型语言模型在多文档推理方面行为的基准测试,而多文档推理对于处理更长上下文输入的能力至关重要。此外,由于标注长输入的成本高昂,多文档设置对于基准测试的创建一直是个难题。 Method: 通过浓缩的结构化种子知识,利用大语言模型辅助编辑对其进行修改,以诱导特定于多文档的推理挑战。然后将这种结构化知识转换为自然文本形式,生成文档集和相应的问答示例。 Result: 发现MDBench对所有方法,即使是在相对较短的文档集上,也构成了显著挑战。知识引导生成技术可轻松执行针对多文档特定推理能力的分析,并能迅速适应新的挑战和未来模型改进。 Conclusion: MDBench 的引入填补了多文档推理评估基准的空白,其新颖的合成生成过程使得可控且高效地生成具有挑战性的文档集和问答示例成为可能。这有助于深入理解当前大语言模型在多文档推理方面的局限性,并指导未来模型的改进方向。 Abstract: Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs). New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs are expanding at a rapid pace. In particular, while multi-document (MD) reasoning is an area of extreme relevance given LLM capabilities in handling longer-context inputs, few benchmarks exist to rigorously examine model behavior in this setting. Moreover, the multi-document setting is historically challenging for benchmark creation due to the expensive cost of annotating long inputs. In this work, we introduce MDBench, a new dataset for evaluating LLMs on the task of multi-document reasoning. Notably, MDBench is created through a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets and the corresponding question-answer (QA) examples. Our novel technique operates on condensed structured seed knowledge, modifying it through LLM-assisted edits to induce MD-specific reasoning challenges. We then convert this structured knowledge into a natural text surface form, generating a document set and corresponding QA example. We analyze the behavior of popular LLMs and prompting techniques, finding that MDBENCH poses significant challenges for all methods, even with relatively short document sets. We also see our knowledge-guided generation technique (1) allows us to readily perform targeted analysis of MD-specific reasoning capabilities and (2) can be adapted quickly to account for new challenges and future modeling improvements.

[5] From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction?

Shadman Sakib,Oishy Fatema Akhand,Ajwad Abrar

Main category: cs.CL

TL;DR: 研究了大型语言模型(LLMs)在糖尿病预测中的应用,发现专有LLM性能优于开源LLM,并且LLM在某些指标上超过了传统机器学习模型。

Details Motivation: 尽管机器学习和深度学习模型被广泛用于糖尿病预测,但大型语言模型在结构化数值数据上的应用尚未得到充分探索。 Method: 使用Pima Indian Diabetes Database (PIDD)进行实证分析,测试了六种LLM(四种开源、两种专有)以及三种传统机器学习模型的性能,采用零样本、一样本和三样本提示方法,并以准确率、精确率、召回率和F1分数为评估标准。 Result: 专有的LLM表现优于开源LLM,GPT-4o和Gemma-2-27B在少量样本设置中达到最高准确率,Gemma-2-27B在F1分数上也超过传统机器学习模型。但也存在一些问题,如提示策略之间的性能差异和需要领域特定的微调。 Conclusion: LLM在医疗预测任务中有潜力,未来的研究可以关注提示工程和混合方法以提高健康护理预测。 Abstract: While Machine Learning (ML) and Deep Learning (DL) models have been widely used for diabetes prediction, the use of Large Language Models (LLMs) for structured numerical data is still not well explored. In this study, we test the effectiveness of LLMs in predicting diabetes using zero-shot, one-shot, and three-shot prompting methods. We conduct an empirical analysis using the Pima Indian Diabetes Database (PIDD). We evaluate six LLMs, including four open-source models: Gemma-2-27B, Mistral-7B, Llama-3.1-8B, and Llama-3.2-2B. We also test two proprietary models: GPT-4o and Gemini Flash 2.0. In addition, we compare their performance with three traditional machine learning models: Random Forest, Logistic Regression, and Support Vector Machine (SVM). We use accuracy, precision, recall, and F1-score as evaluation metrics. Our results show that proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings. Notably, Gemma-2-27B also outperforms the traditional ML models in terms of F1-score. However, there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning. This study shows that LLMs can be useful for medical prediction tasks and encourages future work on prompt engineering and hybrid approaches to improve healthcare predictions.

[6] Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings

Ignacio Sastre,Aiala Rosá

Main category: cs.CL

TL;DR: 研究人员发现,通过优化特殊记忆标记的嵌入,可以生成可逆句子嵌入,使大语言模型在不修改权重的情况下精确重建原始文本。此方法在英语和西班牙语数据集、多达约240个标记的序列以及从1亿到8亿参数的不同模型规模上进行了评估。Llama 3.1 8B成功重建了所有测试序列,揭示了大语言模型的一种有趣能力,并提出了其在基于记忆的检索、压缩和受控文本生成中的潜在应用。

Details Motivation: 探索大语言模型是否可以在不改变权重的情况下,通过生成可逆句子嵌入来精确重建原始文本,从而揭示模型的能力并拓展其应用场景。 Method: 引入一个特殊的记忆标记,优化其嵌入通过在一个固定序列上的训练。当模型被提示这个嵌入时,它能够精确重建该固定序列。此方法在不同语言(英语和西班牙语)、序列长度(最多约240个标记)和模型规模(从1亿到8亿参数)上进行评估。 Result: 该方法成功实现了精确文本重建,尤其是在Llama 3.1 8B模型上,成功重建了所有测试序列。 Conclusion: 大语言模型具有通过可逆句子嵌入精确重建文本的能力,这为基于记忆的检索、压缩和受控文本生成等应用提供了新的可能性。 Abstract: In this work, we observe an interesting phenomenon: it is possible to generate reversible sentence embeddings that allow an LLM to reconstruct the original text exactly, without modifying the model's weights. This is achieved by introducing a special memory token, whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. We evaluate this phenomenon across English and Spanish datasets, sequences of up to approximately 240 tokens, and model scales ranging from 100M to 8B parameters. Notably, Llama 3.1 8B successfully reconstructs all tested sequences. Our findings highlight an interesting capability of LLMs and suggest potential applications in memory-based retrieval, compression, and controlled text generation.

[7] Identifying social isolation themes in NVDRS text narratives using topic modeling and text-classification methods

Drew Walker,Swati Rajwal,Sudeshna Das,Snigdha Peddireddy,Abeed Sarker

Main category: cs.CL

TL;DR: The paper explores the use of NLP techniques to identify social isolation and loneliness from narratives in US violent death reports, developing classifiers that reveal significant predictors for suicide, suggesting methods to improve surveillance and prevention.

Details Motivation: Social isolation and loneliness have been on the rise and significantly contribute to suicide rates. However, these factors are not currently recorded within the NVDRS structured variables, indicating a need for new methods to extract this information. Method: The study employed topic modeling for lexicon development and supervised learning classifiers to identify constructs of social isolation and loneliness in law enforcement and coroner medical examiner narratives from the NVDRS database. Result: High-quality classifiers were developed with an average F1 score of .86 and accuracy of .82. Among 300,000 suicides analyzed, 1,198 mentioned chronic social isolation. Significant predictors included being male, gay, or divorced. Conclusion: The methods presented can enhance the surveillance and prevention strategies for addressing social isolation and loneliness in the United States. Abstract: Social isolation and loneliness, which have been increasing in recent years strongly contribute toward suicide rates. Although social isolation and loneliness are not currently recorded within the US National Violent Death Reporting System's (NVDRS) structured variables, natural language processing (NLP) techniques can be used to identify these constructs in law enforcement and coroner medical examiner narratives. Using topic modeling to generate lexicon development and supervised learning classifiers, we developed high-quality classifiers (average F1: .86, accuracy: .82). Evaluating over 300,000 suicides from 2002 to 2020, we identified 1,198 mentioning chronic social isolation. Decedents had higher odds of chronic social isolation classification if they were men (OR = 1.44; CI: 1.24, 1.69, p<.0001), gay (OR = 3.68; 1.97, 6.33, p<.0001), or were divorced (OR = 3.34; 2.68, 4.19, p<.0001). We found significant predictors for other social isolation topics of recent or impending divorce, child custody loss, eviction or recent move, and break-up. Our methods can improve surveillance and prevention of social isolation and loneliness in the United States.

[8] Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

Zongxia Li,Yapei Chang,Yuhang Zhou,Xiyang Wu,Zichao Liang,Yoo Yeon Sung,Jordan Lee Boyd-Graber

Main category: cs.CL

TL;DR: The paper introduces PrefBERT, a scoring model for evaluating open-ended long-form generation which provides better feedback than traditional metrics and aligns well with human preferences.

Details Motivation: Evaluating open-ended long-form generation is difficult due to unclear separation of good and bad outputs and limitations in existing methods missing key aspects or being biased by pretraining data. Method: PrefBERT is trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality. It offers semantic reward feedback for guiding the training of policy models in GRPO. Result: PrefBERT shows reliability across varied long passages and aligns well with the rewards needed for GRPO through comprehensive evaluations including LLM-as-a-judge, human ratings, and qualitative analysis. Conclusion: Using PrefBERT as the reward signal yields responses better aligned with human preferences compared to traditional metrics. Abstract: Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.

[9] Learning-Time Encoding Shapes Unlearning in LLMs

Ruihan Wu,Konstantin Garov,Kamalika Chaudhuri

Main category: cs.CL

TL;DR: The paper explores how learning-time choices in knowledge encoding affect the unlearning of factual knowledge in large language models, revealing that paraphrased descriptions improve unlearning and removing individual pieces of knowledge is challenging.

Details Motivation: To understand how different knowledge encoding methods during training can impact the ability to unlearn specific pieces of knowledge in large language models. Method: Empirical investigation through experiments on unlearning factual knowledge using various learning-time knowledge encoding techniques. Result: Paraphrased descriptions improve unlearning performance. Unlearning individual piece of knowledge from a chunk of text is difficult. Conclusion: Learning-time knowledge encoding plays a crucial role in enabling reliable post-hoc unlearning. Abstract: As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn'', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge. Our experiments reveal two key findings: (1) learning with paraphrased descriptions improves unlearning performance and (2) unlearning individual piece of knowledge from a chunk of text is challenging. Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.

[10] Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification

Yaxin Fan,Peifeng Li,Qiaoming Zhu

Main category: cs.CL

TL;DR: The paper proposes a Discourse-aware Clarification Module (DCM) and Contribution-aware Preference Optimization (CPO) to enhance dialogue discourse parsing, resolving ambiguities caused by linguistic features and reducing errors. Experiments show significant improvement over SOTA baselines.

Details Motivation: Linguistic features in dialogues like omission and idiom introduce ambiguities that obscure discourse relations, challenging parsers. Method: Proposes DCM with two reasoning processes (clarification type and discourse goal) and CPO to optimize clarifications and align with parser requirements. Result: Experiments on STAC and Molweni datasets show effective ambiguity resolution and significant outperformance of SOTA baselines. Conclusion: The approach using DCM and CPO successfully enhances dialogue discourse parsing performance. Abstract: Dialogue discourse parsing aims to identify and analyze discourse relations between the utterances within dialogues. However, linguistic features in dialogues, such as omission and idiom, frequently introduce ambiguities that obscure the intended discourse relations, posing significant challenges for parsers. To address this issue, we propose a Discourse-aware Clarification Module (DCM) to enhance the performance of the dialogue discourse parser. DCM employs two distinct reasoning processes: clarification type reasoning and discourse goal reasoning. The former analyzes linguistic features, while the latter distinguishes the intended relation from the ambiguous one. Furthermore, we introduce Contribution-aware Preference Optimization (CPO) to mitigate the risk of erroneous clarifications, thereby reducing cascading errors. CPO enables the parser to assess the contributions of the clarifications from DCM and provide feedback to optimize the DCM, enhancing its adaptability and alignment with the parser's requirements. Extensive experiments on the STAC and Molweni datasets demonstrate that our approach effectively resolves ambiguities and significantly outperforms the state-of-the-art (SOTA) baselines.

[11] CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records

Junke Wang,Hongshun Ling,Li Zhang,Longqian Zhang,Fang Wang,Yuan Gao,Zhi Li

Main category: cs.CL

TL;DR: This paper proposes CKD-EHR framework which uses knowledge distillation techniques to improve disease risk prediction from EHR data, significantly outperforming baseline models in accuracy and efficiency.

Details Motivation: Existing large language models used for disease prediction from EHR data have insufficient medical knowledge representation and low efficiency in clinical deployment. Method: The CKD-EHR framework fine-tunes the Qwen2.5-7B model on medical knowledge-enhanced data as a teacher model, generates interpretable soft labels via multi-granularity attention distillation mechanism, and transfers this knowledge to a lightweight BERT student model. Result: On the MIMIC-III dataset, CKD-EHR increases diagnostic accuracy by 9%, improves F1-score by 27%, and achieves a 22.2 times inference speedup compared to the baseline model. Conclusion: CKD-EHR not only improves resource utilization efficiency but also enhances diagnostic accuracy and timeliness, providing a practical technical solution for optimizing resources in clinical settings. Abstract: Electronic Health Records (EHR)-based disease prediction models have demonstrated significant clinical value in promoting precision medicine and enabling early intervention. However, existing large language models face two major challenges: insufficient representation of medical knowledge and low efficiency in clinical deployment. To address these challenges, this study proposes the CKD-EHR (Clinical Knowledge Distillation for EHR) framework, which achieves efficient and accurate disease risk prediction through knowledge distillation techniques. Specifically, the large language model Qwen2.5-7B is first fine-tuned on medical knowledge-enhanced data to serve as the teacher model.It then generates interpretable soft labels through a multi-granularity attention distillation mechanism. Finally, the distilled knowledge is transferred to a lightweight BERT student model. Experimental results show that on the MIMIC-III dataset, CKD-EHR significantly outperforms the baseline model:diagnostic accuracy is increased by 9%, F1-score is improved by 27%, and a 22.2 times inference speedup is achieved. This innovative solution not only greatly improves resource utilization efficiency but also significantly enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for resource optimization in clinical settings. The code and data for this research are available athttps://github.com/209506702/CKD_EHR.

[12] Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs

Jing Yang Lee,Kong-Aik Lee,Woon-Seng Gan

Main category: cs.CL

TL;DR: 通过将开放域对话生成分解为多响应生成和基于偏好的选择两个任务,并利用专门设计的数据集o2mDial,提出的方法提高了小规模LLM在开放域对话中的响应多样性与质量。

Details Motivation: 尽管先前研究表明建模开放域对话的一对多特性可以提高响应多样性,但大多数现代基于LLM的对话代理并未明确这样做。 Method: 将开放域对话生成分为多响应生成(MRG)和偏好选择(PS)两个任务,创建了一个捕捉一对多特性的对话语料库o2mDial,并提出了新的上下文学习和指令微调策略以及针对MRG的新评估指标和基于模型的PS方法。 Result: 实证结果表明,将所提出的两阶段框架应用于较小的LLM进行开放域对话生成,在保持上下文连贯性的同时增强了响应多样性,响应质量最多可提高90%,使其性能接近较大模型。 Conclusion: 所提出的方法有效地提升了小规模LLM在开放域对话中的响应多样性和质量,缩小了与大规模模型之间的差距。 Abstract: Open-domain Dialogue (OD) exhibits a one-to-many (o2m) property, whereby multiple appropriate responses exist for a single dialogue context. Despite prior research showing that modeling this property boosts response diversity, most modern LLM-based dialogue agents do not explicitly do so. In this work, we model the o2m property of OD in LLMs by decomposing OD generation into two key tasks: Multi-Response Generation (MRG) and Preference-based Selection (PS), which entail generating a set of n semantically and lexically diverse high-quality responses for a given dialogue context, followed by selecting a single response based on human preference, respectively. To facilitate MRG and PS, we introduce o2mDial, a dialogue corpus explicitly designed to capture the o2m property by featuring multiple plausible responses for each context. Leveraging o2mDial, we propose new in-context learning and instruction-tuning strategies, as well as novel evaluation metrics for MRG, alongside a model-based approach for PS. Empirical results demonstrate that applying the proposed two-stage framework to smaller LLMs for OD generation enhances overall response diversity while maintaining contextual coherence, improving response quality by up to 90%, bringing them closer to the performance of larger models.

[13] Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Gyeongje Cho,Yeonkyoun So,Chanwoo Park,Sangmin Lee,Sungmok Jung,Jaejin Lee

Main category: cs.CL

TL;DR: This paper introduces Thunder-Tok, a Korean tokenizer that reduces token fertility by 10% without compromising model performance across various downstream tasks.

Details Motivation: To design an efficient tokenizer for the Korean language that can reduce token fertility without affecting model performance. Method: Using a rule-based pre-tokenization method aligned with Korean linguistic structure and creating a seed vocabulary containing tokens resembling linguistic units, combined with a branching entropy-based selection algorithm to increase average token length. Result: Thunder-Tok reduces token fertility by approximately 10%, which translates to a 10% reduction in the number of tokens and a 10% improvement in inference speed compared to BPE, all without compromising performance on various downstream tasks. Conclusion: The linguistically informed approach used in designing Thunder-Tok is effective and practical for creating efficient tokenizers for language models. Abstract: This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.

[14] Emergence of Primacy and Recency Effect in Mamba: A Mechanistic Point of View

Muhammad Cendekia Airlangga,Hilal AlQuabeh,Munachiso S Nwadike,Kentaro Inui

Main category: cs.CL

TL;DR: This paper explores memory in state-space language models like Mamba, using primacy and recency effects. It identifies mechanisms of long-term and short-term memory, as well as dynamic memory allocation influenced by semantic regularity.

Details Motivation: To understand how information is retained and forgotten over time in state-space language models. Method: Applying structured recall tasks to the Mamba architecture and analyzing the U-shaped accuracy profile through three identified mechanisms: long-term memory via sparse channels, short-term memory governed by delta-modulated recurrence, and memory allocation modulated by semantic regularity. Result: Observed consistent U-shaped accuracy profile in Mamba models with 1.4B and 7B parameters, validated through targeted ablations and input perturbations. Conclusion: State-space language models exhibit distinct memory mechanisms that can be analyzed through primacy and recency effects. Abstract: We study memory in state-space language models using primacy and recency effects as behavioral tools to uncover how information is retained and forgotten over time. Applying structured recall tasks to the Mamba architecture, we observe a consistent U-shaped accuracy profile, indicating strong performance at the beginning and end of input sequences. We identify three mechanisms that give rise to this pattern. First, long-term memory is supported by a sparse subset of channels within the model's selective state space block, which persistently encode early input tokens and are causally linked to primacy effects. Second, short-term memory is governed by delta-modulated recurrence: recent inputs receive more weight due to exponential decay, but this recency advantage collapses when distractor items are introduced, revealing a clear limit to memory depth. Third, we find that memory allocation is dynamically modulated by semantic regularity: repeated relations in the input sequence shift the delta gating behavior, increasing the tendency to forget intermediate items. We validate these findings via targeted ablations and input perturbations on two large-scale Mamba-based language models: one with 1.4B and another with 7B parameters.

[15] A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals

Andrea Cadeddu,Alessandro Chessa,Vincenzo De Leo,Gianni Fenu,Enrico Motta,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino,Luca Secchi

Main category: cs.CL

TL;DR: This paper explores the application of different LLMs in a single-label, multi-class text classification task related to SDGs and evaluates various learning techniques.

Details Motivation: To address the challenge of tracking progress towards the UN's 17 Sustainable Development Goals due to the extensive scale and complexity of the data involved. Method: Analyze various proprietary and open-source LLMs for text classification on SDGs and evaluate the effectiveness of Zero-Shot, Few-Shot Learning, and Fine-Tuning techniques. Result: Smaller models optimized through prompt engineering can perform as well as larger models like OpenAI's GPT. Conclusion: LLMs and appropriate adaptation techniques can significantly contribute to automating and improving the analysis of large amounts of text related to SDGs. Abstract: In 2012, the United Nations introduced 17 Sustainable Development Goals (SDGs) aimed at creating a more sustainable and improved future by 2030. However, tracking progress toward these goals is difficult because of the extensive scale and complexity of the data involved. Text classification models have become vital tools in this area, automating the analysis of vast amounts of text from a variety of sources. Additionally, large language models (LLMs) have recently proven indispensable for many natural language processing tasks, including text classification, thanks to their ability to recognize complex linguistic patterns and semantics. This study analyzes various proprietary and open-source LLMs for a single-label, multi-class text classification task focused on the SDGs. Then, it also evaluates the effectiveness of task adaptation techniques (i.e., in-context learning approaches), namely Zero-Shot and Few-Shot Learning, as well as Fine-Tuning within this domain. The results reveal that smaller models, when optimized through prompt engineering, can perform on par with larger models like OpenAI's GPT (Generative Pre-trained Transformer).

[16] ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs

Feng He,Zijun Chen,Xinnian Liang,Tingting Ma,Yunqi Qiu,Shuangzhi Wu,Junchi Yan

Main category: cs.CL

TL;DR: ProtoReasoning is a framework that improves reasoning ability of LLMs using prototypical representations, resulting in significant performance improvements across various tasks.

Details Motivation: The motivation is to understand and enhance the cross-domain generalization capabilities of Large Reasoning Models by leveraging shared abstract reasoning prototypes. Method: ProtoReasoning features an automated prototype construction pipeline, a comprehensive verification system, and scalability within prototype space. It uses Prolog for logical reasoning and PDDL for planning. Result: ProtoReasoning achieves 4.7% improvement on logical reasoning, 6.3% on planning tasks, 4.0% on general reasoning, and 1.0% on mathematics. Ablation studies confirm enhanced generalization. Conclusion: Reasoning prototypes serve as the foundation for generalizable reasoning in large language models, validated through improved performance and generalization. Abstract: Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer remain poorly understood. We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes -- fundamental reasoning patterns that capture the essence of problems across domains. These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24). Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models.

[17] MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs

Yongqi Fan,Yating Wang,Guandong Wang,Jie Zhai,Jingping Liu,Qi Ye,Tong Ruan

Main category: cs.CL

TL;DR: MinosEval是一种新的开放问题回答评估方法,能更好对齐人类标注并提供更可解释结果。

Details Motivation: 开放性问答任务中,传统评估指标难以捕捉语义相似性,现有LLM评估方法缺乏直观解释性且未能区分事实型与非事实型问题。 Method: 提出MinosEval方法,首先区分开放性问题类型,然后使用不同策略排名候选答案:事实型问题采用自适应关键点评分策略,非事实型问题使用实例感知列表排名策略。 Result: 在多个开放性QA数据集(包括自建数据集)上的实验表明,MinosEval更符合人类标注且结果更具可解释性。 Conclusion: MinosEval为开放性问答提供了更精细和可解释的自动评估方法。 Abstract: Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose \textbf{MinosEval}, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.

[18] Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri

Main category: cs.CL

TL;DR: The paper evaluates current language technologies' capacity to understand Basque and Spanish language varieties using NLI as a pivot task, introducing a novel parallel dataset. Empirical analysis shows performance drops when handling linguistic variation, especially in Basque.

Details Motivation: To assess the ability of current language technologies in understanding Basque and Spanish language varieties. Method: Using Natural Language Inference (NLI) as a pivot task, creating a novel parallel dataset in Basque and Spanish with their variants, and conducting empirical analysis of crosslingual and in-context learning experiments with encoder-only and decoder-based Large Language Models. Result: Performance drops were observed when handling linguistic variation, particularly in Basque. Encoder-only models struggle more with Western Basque. Conclusion: Linguistic variation poses challenges for language models, with peripheral dialects being more difficult to handle. Abstract: In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.

[19] Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs

Yang Fan,Zhang Qi,Xing Wenqian,Liu Chang,Liu Liu

Main category: cs.CL

TL;DR: This paper proposes the Graph RAG framework for historical text analysis, which reduces manual labor and enhances historical knowledge extraction. Experiments show that Xunzi-Qwen1.5-14B performs optimally in relation extraction, while DeepSeek with GraphRAG improves F1 scores and interpretability.

Details Motivation: To address domain knowledge gaps in large language models for historical text analysis and provide a low-resource solution for classical text knowledge extraction. Method: The Graph RAG framework combines chain-of-thought prompting, self-instruction generation, and process supervision to create a character relationship dataset with minimal manual annotation. It also introduces a collaborative mechanism between knowledge graphs and retrieval-augmented generation. Result: Xunzi-Qwen1.5-14B achieves optimal performance in relation extraction with an F1 score of 0.68. DeepSeek integrated with GraphRAG improves F1 scores by 11% on the C-CLUE dataset, surpassing Xunzi-Qwen1.5-14B's F1 value and improving interpretability. Conclusion: The Graph RAG framework offers a promising low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research. Abstract: This article addresses domain knowledge gaps in general large language models for historical text analysis in the context of computational humanities and AIGC technology. We propose the Graph RAG framework, combining chain-of-thought prompting, self-instruction generation, and process supervision to create a The First Four Histories character relationship dataset with minimal manual annotation. This dataset supports automated historical knowledge extraction, reducing labor costs. In the graph-augmented generation phase, we introduce a collaborative mechanism between knowledge graphs and retrieval-augmented generation, improving the alignment of general models with historical knowledge. Experiments show that the domain-specific model Xunzi-Qwen1.5-14B, with Simplified Chinese input and chain-of-thought prompting, achieves optimal performance in relation extraction (F1 = 0.68). The DeepSeek model integrated with GraphRAG improves F1 by 11% (0.08-0.19) on the open-domain C-CLUE relation extraction dataset, surpassing the F1 value of Xunzi-Qwen1.5-14B (0.12), effectively alleviating hallucinations phenomenon, and improving interpretability. This framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research.

[20] TopClustRAG at SIGIR 2025 LiveRAG Challenge

Juli Bakagianni,John Pavlopoulos,Aristidis Likas

Main category: cs.CL

TL;DR: The paper introduces TopClustRAG, a RAG system using hybrid retrieval and clustering for question answering. It ranked 2nd in faithfulness and 7th in correctness on the leaderboard, showing the effectiveness of clustering-based methods.

Details Motivation: To improve answer diversity, relevance, and faithfulness in large-scale question answering systems over web corpora by employing advanced context filtering and prompt aggregation techniques. Method: TopClustRAG uses a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages are used to create cluster-specific prompts for an LLM, generating intermediate answers that are filtered, reranked, and synthesized into a final response. Result: TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the FineWeb Sample-10BT dataset leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation. Conclusion: Clustering-based context filtering and prompt aggregation enhance the performance of large-scale RAG systems. Abstract: We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.

[21] Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

Sungen Hahm,Heejin Kim,Gyuseong Lee,Hyunji Park,Jaejin Lee

Main category: cs.CL

TL;DR: To address the challenges of de-identifying court judgments in South Korea, this paper proposes Thunder-DeID, a framework that includes constructing a Korean legal dataset, categorizing PII, and developing a DNN-based de-identification pipeline.

Details Motivation: The current de-identification process for court judgments in South Korea is insufficient for handling large volumes while meeting strict legal requirements. Legal definitions of personal identifiers are unclear, making it hard to implement technical solutions. Method: Propose Thunder-DeID framework which involves: constructing and releasing a Korean legal dataset with annotated judgments, introducing systematic categorization of PII, and developing an end-to-end DNN-based de-identification pipeline. Result: Experimental results show that the proposed model achieves state-of-the-art performance in de-identifying court judgments. Conclusion: Thunder-DeID provides an effective solution for de-identifying court judgments in South Korea, balancing open access to justice with personal data protection. Abstract: To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.

[22] Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment

Shrestha Ghosh,Moritz Schneider,Carina Reinicke,Carsten Eickhoff

Main category: cs.CL

TL;DR: This paper surveys the application of LLMs in trial-patient matching within clinical trial recruitment, analyzing existing benchmarks, approaches, and evaluation frameworks, while identifying challenges and future directions.

Details Motivation: To address the limitation of current methods in matching trials and patients in clinical trial recruitment by leveraging the knowledge aggregation and reasoning abilities of LLMs. Method: Surveying and critically examining existing benchmarks, approaches, and evaluation frameworks for LLM-based methods in clinical trial recruitment. Result: Identified challenges in adopting LLM technologies in clinical research and proposed exciting future directions. Conclusion: LLMs hold potential to build a more general solution for trial-patient matching but face challenges that need to be addressed. Abstract: Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.

[23] ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Negar Foroutan,Jakhongir Saydaliev,Ye Eun Kim,Antoine Bosselut

Main category: cs.CL

TL;DR: 通过提出一种新的监督对比学习方法,提升低资源语言在领域外数据上的语言识别性能。

Details Motivation: 尽管许多关于语言识别模型训练的研究都集中在收集多样化的训练数据以提高性能,但低资源语言(通常仅限于圣经等单领域数据)的表现仍然不佳,存在类别不平衡和偏差问题。 Method: 提出了一种新的监督对比学习(SCL)方法,用于学习低资源语言的领域不变表示。 Result: 实验结果表明,该方法在领域外数据上将低资源语言的语言识别性能提高了3.2%。 Conclusion: 这一方法有效地增强了语言识别模型的性能。 Abstract: Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. Through an extensive analysis, we show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%, demonstrating its effectiveness in enhancing LID models.

[24] DeVisE: Behavioral Testing of Medical Large Language Models

Camila Zurdo Tagliabue,Heloisa Oss Boll,Aykut Erdem,Erkut Erdem,Iacer Calixto

Main category: cs.CL

TL;DR: 研究人员开发了DeVisE框架,用于评估大型语言模型在临床环境中的推理能力。通过构造真实和合成的ICU出院记录数据集,并使用反事实变量测试模型,他们发现零样本模型表现出更一致的反事实推理模式,而微调模型则更加稳定但对临床有意义的变化反应较小。此外,人口统计因素对模型输出有微妙但持续的影响。这项工作强调了行为测试在揭示临床LLM推理策略方面的价值。

Details Motivation: 尽管大型语言模型越来越多地被用于临床决策支持,但现有的评估方法往往无法区分真实的医学推理和表面模式。因此,需要一个能够深入探测临床理解的评估框架。 Method: 引入DeVisE框架,构建包含真实和合成版本的ICU出院记录数据集,使用控制单一变量的反事实来针对人口统计和生命体征属性进行测试。评估五种大型语言模型在零样本和微调设置下的表现,通过输入级敏感性和下游推理两个方面分析模型行为。 Result: 零样本模型表现出更连贯的反事实推理模式,而微调模型则更加稳定但对临床有意义的变化反应较小。人口统计因素对模型输出有微妙但持续的影响。 Conclusion: 行为测试有助于揭示临床LLM的推理策略,并为设计更安全、更透明的医疗AI系统提供信息。 Abstract: Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.

[25] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture

Arijit Maji,Raghvendra Kumar,Akash Ghosh,Anushka,Sriparna Saha

Main category: cs.CL

TL;DR: SANSKRITI is a new benchmark with 21,853 question-answer pairs evaluating LMs' understanding of Indian cultural diversity across sixteen attributes. It highlights disparities in models' cultural comprehension and sets a standard for improvement.

Details Motivation: To evaluate language models' comprehension of India's rich cultural diversity, which is crucial for their global effectiveness. Method: Introduced SANSKRITI, a benchmark with 21,853 curated question-answer pairs covering 28 states, 8 union territories, and sixteen key attributes of Indian culture. Evaluated it on leading LLMs, ILMs, and SLMs. Result: Significant disparities found in models' ability to handle culturally nuanced queries, especially in region-specific contexts. Conclusion: SANSKRITI sets a new standard for assessing and enhancing the cultural understanding of language models. Abstract: Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models' comprehension of India's rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India's cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.

[26] COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation

Raghvendra Kumar,S. A. Mohammed Salman,Aryan Sahu,Tridib Nandi,Pragathi Y. P.,Sriparna Saha,Jose G. Moreno

Main category: cs.CL

TL;DR: The paper presents COSMMIC, a new dataset for comment-aware multimodal and multilingual summarization in nine Indian languages, which includes article-image pairs, reader comments, and ground-truth summaries. It explores various summarization configurations and evaluates the effectiveness using advanced models like LLama3 and GPT-4.

Details Motivation: To address the lack of research and resources in comment-aware multimodal and multilingual summarization for Indian languages. Method: Developed the COSMMIC dataset with 4,959 article-image pairs and 24,484 reader comments in nine Indian languages. Explored four summarization configurations incorporating text, comments, and images. Used state-of-the-art models (LLama3, GPT-4) and tools (IndicBERT, CLIP-based classifier) for evaluation. Result: COSMMIC uniquely integrates text, images, and user feedback, offering effective configurations for NLG tasks. The study identifies supportive comments, filters noise, and extracts insights from images. Conclusion: COSMMIC bridges gaps in Indian language resources, advances NLP research, and promotes inclusivity by integrating text, images, and user feedback for summarization. Abstract: Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset's effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.

[27] Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning

Stanley Ngugi

Main category: cs.CL

TL;DR: This paper introduces Targeted Lexical Injection (TLI), a novel fine-tuning approach for improving lexical alignment in Swahili-English word pairs using a Swahili-centric LLM. TLI significantly enhances output-level lexical alignment and generalizes well to unseen word pairs.

Details Motivation: Large Language Models often underperform in low-resource languages like Swahili due to data scarcity and underrepresentation in pre-training. The challenge lies in achieving robust cross-lingual lexical alignment, which is crucial for tasks such as translation and cross-lingual information retrieval. Method: The method involves leveraging the strong lexical alignment found in early internal layers of Lugha-Llama-8B-wura, a Swahili-centric LLM. Using Low-Rank Adaptation (LoRA) and a contrastive learning objective, the model is fine-tuned specifically targeting embeddings from this optimal early layer. Result: Experiments show that TLI significantly improves the output-level lexical alignment for both trained and unseen Swahili-English word pairs, increasing average cosine similarity by approximately 28% in both cases. Conclusion: TLI offers a parameter-efficient and effective strategy for enhancing lexical alignment in LRL-focused LLMs by preserving and propagating inherent early-layer cross-lingual knowledge. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their performance in low-resource languages (LRLs), such as Swahili, often lags due to data scarcity and underrepresentation in pre-training. A key challenge is achieving robust cross-lingual lexical alignment, crucial for tasks like translation and cross-lingual information retrieval. This paper introduces Targeted Lexical Injection (TLI), a novel and efficient fine-tuning approach. We first demonstrate that Lugha-Llama-8B-wura, a Swahili-centric LLM, exhibits strong, near-perfect lexical alignment for Swahili-English word pairs in its early internal layers (specifically Layer 2, with ~0.99998 average cosine similarity based on a pilot study), a capability not fully reflected in its final output representations (baseline ~0.32 similarity on our evaluation set). TLI leverages this insight by using Low-Rank Adaptation (LoRA) and a contrastive learning objective to fine-tune the model, specifically targeting embeddings from this empirically identified optimal early layer. Our experiments show that TLI significantly improves the output-level lexical alignment for 623 trained Swahili-English word pairs, increasing average cosine similarity from 0.3211 to 0.4113 (+28.08%, p < 1.33 x 10^-240). More importantly, these improvements generalize remarkably well to 63 unseen control word pairs, with similarity increasing from 0.3143 to 0.4033 (+28.32%, p < 7.17 x 10^-27). These findings suggest TLI enhances the model's ability to preserve and propagate its inherent early-layer cross-lingual knowledge, offering a parameter-efficient and effective strategy for improving lexical alignment in LRL-focused LLMs.

[28] Understanding GUI Agent Localization Biases through Logit Sharpness

Xingjian Tao,Yiwei Wang,Yujun Cai,Zhicheng Yang,Jing Tang

Main category: cs.CL

TL;DR: Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems, but they frequently exhibit hallucinations. The authors propose a fine-grained evaluation framework that categorizes model predictions into four distinct types and introduce the Peak Sharpness Score (PSS), as well as Context-Aware Cropping, a training-free technique.

Details Motivation: To address the issue of hallucinations in MLLMs which are systematic localization errors that compromise reliability. Method: Propose a fine-grained evaluation framework categorizing predictions into four types, introduce Peak Sharpness Score (PSS) for quantifying model uncertainty, and Context-Aware Cropping for improving model performance without additional training. Result: Extensive experiments show that the proposed framework and methods provide actionable insights and enhance interpretability and robustness of GUI agent behavior. Conclusion: The evaluation framework and techniques introduced offer ways to better understand and improve the reliability of MLLMs in GUI interactions. Abstract: Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions. Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability. We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics. To better quantify model uncertainty, we introduce the Peak Sharpness Score (PSS), a metric that evaluates the alignment between semantic continuity and logits distribution in coordinate prediction. Building on this insight, we further propose Context-Aware Cropping, a training-free technique that improves model performance by adaptively refining input context. Extensive experiments demonstrate that our framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior.

[29] AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need

Zhouhong Gu,Xiaoxuan Zhu,Yin Cai,Hao Shen,Xingzhou Chen,Qingyi Wang,Jialin Li,Xiaoran Shi,Haoran Guo,Wenxuan Huang,Hongwei Feng,Yanghua Xiao,Zheyu Ye,Yao Hu,Shaosheng Cao

Main category: cs.CL

TL;DR: AgentGroupChat-V2,一个通过分而治之的全并行架构、自适应协作引擎和代理组织优化策略来解决系统架构设计、跨域泛化性和性能保证挑战的框架,在多项实验中表现出优越性能。

Details Motivation: 现有的基于大型语言模型的多智能体系统在社会模拟和复杂任务解决领域展现出巨大潜力,但在系统架构设计、跨域泛化性和性能保证方面面临重大挑战。 Method: 提出了一种名为AgentGroupChat-V2的新框架,包含三个核心创新:(1) 分而治之的全并行架构,将用户查询分解为层次化的任务森林结构;(2) 自适应协作引擎,根据任务特性动态选择异构LLM组合和交互模式;(3) 代理组织优化策略,结合分而治之的方法进行高效的问题分解。 Result: 广泛的实验证明了AgentGroupChat-V2在多个领域的优越性能,包括GSM8K、AIME和HumanEval等数据集上的显著提升,并且在高难度任务上表现尤为突出。 Conclusion: AgentGroupChat-V2提供了一个全面的解决方案,用于构建高效的、通用的LLM多智能体系统,尤其在复杂的推理场景中具有显著优势。 Abstract: Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at https://github.com/MikeGu721/AgentGroupChat-V2.

[30] RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

Xinnuo Xu,Rachel Lawrence,Kshitij Dubey,Atharva Pandey,Risa Ueno,Fabian Falck,Aditya V. Nori,Rahul Sharma,Amit Sharma,Javier Gonzalez

Main category: cs.CL

TL;DR: Recent Large Language Models (LLMs) have high accuracy on reasoning benchmarks, but it is unclear whether this is due to true reasoning or statistical recall. This paper introduces RE-IMAGINE, a framework that characterizes the hierarchy of reasoning ability in LLMs and generates problem variations at different levels of this hierarchy. The assessments indicate reliance on statistical recall.

Details Motivation: To determine whether the high accuracy of LLMs on reasoning benchmarks is due to true reasoning or just statistical recall of the training set. Method: Introduction of RE-IMAGINE, a framework based on the ladder of causation that characterizes a hierarchy of reasoning ability in LLMs and generates problem variations at different levels of this hierarchy through an intermediate symbolic representation. Result: Performance reductions were observed when models were queried with problem variations, indicating reliance on statistical recall. Conclusion: The results open the door to further research targeting skills across the reasoning hierarchy. Abstract: Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

[31] Context-Informed Grounding Supervision

Hyunji Lee,Seunghyun Yoon,Yunjae Won,Hanseok Oh,Geewook Kim,Trung Bui,Franck Dernoncourt,Elias Stengel-Eskin,Mohit Bansal,Minjoon Seo

Main category: cs.CL

TL;DR: CINGS is a post-training supervision method that improves model grounding in both textual and visual domains without degrading general performance.

Details Motivation: Large language models often require external knowledge to provide accurate information and reduce hallucination, but simply appending context at inference time does not ensure grounded generation. Method: Context-INformed Grounding Supervision (CINGS) involves training the model with relevant context prepended to the response while computing the loss only over the response tokens and masking out the context. Result: Models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models. In the text domain, it outperforms other methods across 11 datasets and complements inference-time grounding techniques. In the vision-language domain, it reduces hallucinations across four benchmarks and maintains factual consistency. Conclusion: CINGS improves model grounding without degrading general downstream performance by inducing a shift in the model's prior knowledge and behavior, encouraging greater reliance on external context. Abstract: Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination. In such cases, we expect the model to generate responses by grounding its response in the provided external context. However, prior work has shown that simply appending context at inference time does not ensure grounded generation. To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context. Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models. In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques. In the vision-language domain, replacing a vision-language model's LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. This improved grounding comes without degradation in general downstream performance. Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model's prior knowledge and behavior, implicitly encouraging greater reliance on the external context.

[32] SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Md Imbesat Hassan Rizvi,Xiaodan Zhu,Iryna Gurevych

Main category: cs.CL

TL;DR: SPARE is a new framework for single-pass, step-wise annotation that enhances process supervision in LLMs and improves reasoning performance.

Details Motivation: Process or step-wise supervision significantly advances the multi-step reasoning capabilities of LLMs. However, there's a need for more efficient, high-quality automated process annotation methods. Method: SPARE aligns each solution step to one or multiple steps in a reference solution with explicit reasoning for evaluation. It uses reference-guided step-level evaluation to facilitate process supervision. Result: SPARE improves reasoning performance when used for fine-tuning models in an offline RL setup and training reward models for ranking/aggregating LLM-generated outputs. It achieves competitive performance on mathematical datasets while being 2.6 times more efficient than tree search-based automatic annotation. Conclusion: SPARE provides an effective method for process supervision, improving reasoning performance and efficiency in LLMs. The codebase and a trained model are publicly released. Abstract: Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.

[33] Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge

Li Zheng,Sihang Wang,Hao Fei,Zuquan Peng,Fei Li,Jianming Fu,Chong Teng,Donghong Ji

Main category: cs.CL

TL;DR: The paper proposes EmoBi, a framework for hyperbole and metaphor detection using emotion-guided bidirectional dynamic interaction. It surpasses baseline methods on multiple datasets.

Details Motivation: Hyperbole and metaphor are challenging to identify due to their semantic obscurity and expressive diversity. Current methods focus on superficial text features, ignoring associations between these rhetorical devices and the effect of implicit emotion. Method: The EmoBi framework includes an emotion analysis module, an emotion-based domain mapping module, and a bidirectional dynamic interaction module with a verification mechanism. Result: Experiments show that EmoBi outperforms all baseline methods on four datasets. Significant F1 score improvements are noted: 28.1% for hyperbole detection on TroFi and 23.1% for metaphor detection on HYPO-L. Conclusion: EmoBi demonstrates effectiveness and potential in advancing hyperbole and metaphor detection. Abstract: Text-based hyperbole and metaphor detection are of great significance for natural language processing (NLP) tasks. However, due to their semantic obscurity and expressive diversity, it is rather challenging to identify them. Existing methods mostly focus on superficial text features, ignoring the associations of hyperbole and metaphor as well as the effect of implicit emotion on perceiving these rhetorical devices. To implement these hypotheses, we propose an emotion-guided hyperbole and metaphor detection framework based on bidirectional dynamic interaction (EmoBi). Firstly, the emotion analysis module deeply mines the emotion connotations behind hyperbole and metaphor. Next, the emotion-based domain mapping module identifies the target and source domains to gain a deeper understanding of the implicit meanings of hyperbole and metaphor. Finally, the bidirectional dynamic interaction module enables the mutual promotion between hyperbole and metaphor. Meanwhile, a verification mechanism is designed to ensure detection accuracy and reliability. Experiments show that EmoBi outperforms all baseline methods on four datasets. Specifically, compared to the current SoTA, the F1 score increased by 28.1% for hyperbole detection on the TroFi dataset and 23.1% for metaphor detection on the HYPO-L dataset. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing hyperbole and metaphor detection.

[34] Lessons from Training Grounded LLMs with Verifiable Rewards

Shang Hong Sim,Tej Deep Pala,Vernon Toh,Hai Leong Chieu,Amir Zadeh,Chuan Li,Navonil Majumder,Soujanya Poria

Main category: cs.CL

TL;DR: 通过强化学习和内部推理增强大语言模型的依据生成能力,使用GRPO方法训练模型,结合可验证的结果奖励,无需黄金推理轨迹或昂贵注释。实验表明,推理增强模型显著优于仅指令变体,特别是在处理无法回答的问题和生成引用良好的响应方面。分阶段优化和结果驱动的强化学习对构建更可验证和可靠的大语言模型具有重要价值。

Details Motivation: 大语言模型在生成有依据和可信的回答方面面临挑战,即使在简单场景中也常出现错误,如遗漏明确答案、错误引用或拒绝提供可用证据的回答。 Method: 探索强化学习和内部推理对大语言模型依据生成的增强效果,使用GRPO方法进行训练,采用可验证的结果奖励,针对回答正确性、引用充分性和拒绝质量进行优化,同时进行综合实验验证。采用两阶段训练设置,先优化回答和引用行为,再优化拒绝行为,以稳定学习信号。重新审视通过GPT-4蒸馏进行指令调优,并将其与GRPO结合以提高长篇生成性问答任务的性能。 Result: 推理增强模型在ASQA、QAMPARI、ELI5和ExpertQA等数据集上的表现显著优于仅指令变体,特别是在处理无法回答的问题和生成良好引用的响应方面。分阶段训练进一步提高了依据生成的效果。结合指令调优和GRPO可以提升长篇生成性问答任务的性能。 Conclusion: 研究发现强调了推理、分阶段优化和结果驱动的强化学习在构建更可验证和可靠的大语言模型中的重要价值。 Abstract: Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.

[35] RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

Bailin Wang,Chang Lan,Chong Wang,Ruoming Pang

Main category: cs.CL

TL;DR: Recent local-global attention models like Gemma2 and Mistral face a Pareto tradeoff with window size impacting both performance and efficiency. This paper introduces RATTENTION, integrating local attention with a specialized linear attention mechanism to address the limitation of ignoring out-of-window tokens. Experiments show RATTENTION achieves superior performance-efficiency tradeoff, even with smaller window sizes (e.g., 512), matching full-attention models across settings. It also enhances long-context performance without compromising training efficiency.

Details Motivation: The motivation is to overcome the inherent limitation of local attention mechanisms which completely disregard tokens outside the defined window, aiming to improve the efficiency of local-global attention models in short-context scenarios while maintaining performance. Method: RATTENTION is a variant of local attention combined with a specialized linear attention mechanism designed to capture information from tokens outside the defined window. This approach allows for efficient processing with smaller window sizes without sacrificing performance. Result: RATTENTION demonstrates superior performance-efficiency tradeoff in pretraining experiments at the 3B and 12B scales. With a window size of just 512, it consistently matches the performance of full-attention models across diverse settings and enhances long-context performance. Training efficiency is maintained due to a specialized kernel implementation. Conclusion: RATTENTION successfully shifts the Pareto frontier for local-global attention models, enabling significant efficiency gains in short-context regimes without compromising performance or training speed. Abstract: Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches.

[36] Approximating Language Model Training Data from Weights

John X. Morris,Junjie Oscar Yin,Woojeong Kim,Vitaly Shmatikov,Alexander M. Rush

Main category: cs.CL

TL;DR: 现代语言模型通常权重开放但训练数据封闭。本文提出了从模型权重进行数据逼近的问题,并提出了几种基线和度量方法。通过梯度法从大型公共文本语料库中选择最匹配的数据,证明了仅使用原始模型及其微调模型的权重即可恢复有用数据的有效性。即使在不知道真实训练数据的情况下,该方法也能定位一小部分公共网络文档,用于训练接近原始模型性能的模型。

Details Motivation: 现代语言模型虽然权重开放,但其训练数据往往是封闭的。为了克服这一问题,研究者们希望找到一种方法,可以从模型权重出发,近似恢复出模型的训练数据。 Method: 提出了一种基于梯度的方法,该方法从一个大型公共文本语料库中选择与模型权重最匹配的数据。这种方法不仅适用于原始模型,也适用于经过微调的模型。此外,还提出了几个基线和评估指标来衡量数据逼近的效果。 Result: 在AG News分类任务上,使用随机选择的数据时性能为65%,而使用本文方法后性能提升至80%,接近专家基准的88%。在MSMARCO网页文档的监督微调任务上,本文方法将困惑度从3.3降低到2.3,接近专家LLAMA模型的困惑度2.0。 Conclusion: 通过仅使用模型权重,本文提出的方法能够有效地从公共数据集中选择合适的子集,从而训练出接近原始模型性能的新模型。这为解决训练数据封闭的问题提供了一种可行的解决方案。 Abstract: Modern language models often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models. Even when none of the true training data is known, our method is able to locate a small subset of public Web documents can be used to train a model to close to the original model performance given models trained for both classification and supervised-finetuning. On the AG News classification task, our method improves performance from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, our method reduces perplexity from 3.3 to 2.3, compared to an expert LLAMA model's perplexity of 2.0.

[37] PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

Shufan Li,Aditya Grover

Main category: cs.CL

TL;DR: Large Language Models (LLMs) cause latency in voice chat applications due to the time taken to generate the first sentence for Text-to-Speech (TTS) systems. To solve this, Predictive Generation (PredGen) is proposed, which generates candidate responses while the user is still speaking, reducing latency by about 2 times with minimal extra computation cost.

Details Motivation: To reduce the latency between the end of user input and the start of audio output in real-time voice chat applications that use LLMs in combination with TTS systems, especially when deployed on consumer-grade hardware with limited computing capacity. Method: Propose Predictive Generation (PredGen), a framework that performs speculative decoding at input time, generating candidate responses while the user is still speaking, thus allowing the system to begin TTS processing with minimal delay. Result: Simulated experiments on the Lmsys and MT-Bench datasets demonstrate that PredGen can effectively reduce latency by around 2 times across various use cases, with only minimal additional computation cost. Conclusion: Predictive Generation (PredGen) successfully mitigates or eliminates the delay caused by LLMs in generating the first sentence, enhancing the user experience in real-time voice chat applications. Abstract: Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.

[38] Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models

Zhengyang Shan,Emily Ruth Diana,Jiawei Zhou

Main category: cs.CL

TL;DR: 研究人员提出了一种新的度量标准——性别包容公平指数(GIFI),用于全面评估大型语言模型在处理二元和非二元性别方面的表现。通过在22个知名开源和专有大模型上的广泛评估,发现模型在性别包容性方面存在显著差异,强调了改进大模型包容性的重要性,并为未来生成模型的性别公平性研究提供了关键基准。

Details Motivation: 现有的关于大型语言模型性别公平性的研究主要集中在二元性别区分上,忽略了对非二元性别的考量。为了更全面地评估模型在性别包容性方面的表现,需要一个能够量化多样性别包容性的新指标。 Method: 引入了性别包容公平指数(GIFI)这一新度量标准,涵盖了从简单的性别代词探测到模型生成和认知行为测试等多个层面的评估,揭示与不同性别标识相关的偏见。 Result: 通过对22个不同规模和能力的大型语言模型进行广泛评估,发现了这些模型在性别包容性方面的显著差异。 Conclusion: 本研究强调了提高大型语言模型性别包容性的重要性,并提供了一个关键的基准,以推动未来在生成模型性别公平性方面的进步。 Abstract: We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers. We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity. Our study highlights the importance of improving LLMs' inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models.

[39] SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Chengye Wang,Yifei Shen,Zexi Kuang,Arman Cohan,Yilun Zhao

Main category: cs.CL

TL;DR: The paper introduces SciVer, a benchmark for evaluating foundation models' ability to verify claims in multimodal scientific contexts. It assesses 21 models and finds a significant performance gap compared to human experts, identifying limitations in current models through RAG analysis and error evaluations.

Details Motivation: To evaluate the ability of foundation models to verify claims within a multimodal scientific context, highlighting the need for a specific benchmark due to the unique challenges posed by scientific literature. Method: Created SciVer, a benchmark with 3,000 expert-annotated examples across 1,113 scientific papers, covering four reasoning types. Assessed 21 state-of-the-art multimodal foundation models using this benchmark, incorporating retrieval-augmented generation (RAG) and human-conducted error evaluations. Result: Found a substantial performance gap between the assessed models and human experts on SciVer, revealing critical limitations in current open-source models. Conclusion: SciVer provides key insights into advancing models' comprehension and reasoning in multimodal scientific literature tasks, indicating areas for improvement in current models. Abstract: We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.

[40] DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement

Shaoqing Lin,Chong Teng,Fei Li,Donghong Ji,Lizhen Qu,Zhuang Li

Main category: cs.CL

TL;DR: 研究人员提出了一个新任务DiscoSG和相应的数据集DiscoSG-DS,以解决多句视觉描述的场景图解析问题。他们还提出了一种名为DiscoSG-Refiner的方法,该方法通过两个步骤生成场景图,显著提高了效率和性能。

Details Motivation: 现有的文本场景图解析器主要针对单句描述设计,在处理多句视觉描述时容易出现跨句共指等现象遗漏,导致图碎片化和下游任务性能下降。因此需要一种新的方法来有效解析多句视觉描述。 Method: 1. 提出新任务Discourse-level text Scene Graph parsing (DiscoSG)并构建相关数据集DiscoSG-DS。 2. 使用两个小规模PLM模型开发DiscoSG-Refiner方法,分两步完成场景图生成:先用一个小PLM模型生成基础图,再用另一个PLM模型迭代地对图进行编辑。 Result: 1. 在DiscoSG-DS上微调大型PLM(如GPT-4)可将SPICE指标提升约48%,但推理成本高且开源受限。 2. DiscoSG-Refiner使用两个Flan-T5-Base模型,比最佳基线提高约30%的SPICE指标,同时推理速度比GPT-4快86倍。 3. 一致改善了下游VLM任务,如话语级标题评估和幻觉检测。 Conclusion: DiscoSG任务和DiscoSG-Refiner方法为多句视觉描述的场景图解析提供了高效解决方案,显著提升了性能和推理速度,同时促进了下游VLM任务的发展。 Abstract: Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: https://github.com/ShaoqLin/DiscoSG

[41] WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

Negar Foroutan,Angelika Romanou,Matin Ansaripour,Julian Martin Eisenschlos,Karl Aberer,Rémi Lebret

Main category: cs.CL

TL;DR: The paper presents WikiMixQA, a new benchmark for evaluating cross-modal reasoning in documents with complex layouts, tables, and charts. It assesses 12 vision-language models, showing significant challenges in long-context, multi-modal reasoning.

Details Motivation: To address the gap in evaluating vision-language models' ability to process long-context vision inputs and perform complex reasoning across multiple modalities in document understanding. Method: Introduced WikiMixQA, a benchmark with 1,000 MCQs based on tables and charts from 4,000 Wikipedia pages across seven topics. Evaluated 12 state-of-the-art vision-language models on this benchmark. Result: Proprietary models achieved ~70% accuracy with direct context but performed poorly with retrieval from long documents. GPT-4-o was the only model exceeding 50% accuracy in this setting, while open-source models reached a maximum of 27% accuracy. Conclusion: WikiMixQA highlights the difficulties in long-context, multi-modal reasoning and serves as an important benchmark for advancing document understanding research. Abstract: Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.

[42] From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns

Bernardo Leite,Henrique Lopes Cardoso,Pedro Pinto,Abel Ferreira,Luís Abreu,Isabel Rangel,Sandra Monteiro

Main category: cs.CL

TL;DR: This paper investigates the capabilities of current generative models in producing Portuguese MCQs for reading comprehension, evaluating their quality and reliability through expert review and psychometric analysis.

Details Motivation: Manually creating MCQs with varying difficulty levels and targeted reading skills is time-consuming and costly. Recent advances in generative AI provide an opportunity to automate this process, but assessing the quality and reliability of generated MCQs has received limited attention. Method: The study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels for Portuguese reading comprehension. These MCQs are evaluated through expert review and by analyzing the psychometric properties extracted from student responses. Result: Current models can generate MCQs of comparable quality to human-authored ones for elementary school students. However, issues related to semantic clarity and answerability were identified. Challenges remain in generating engaging and high-quality distractors. Conclusion: Generative models have the potential to produce high-quality MCQs for Portuguese reading comprehension, but further improvements are needed in semantic clarity, answerability, and distractor generation. Abstract: While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task. Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently. However, assessing the actual quality and reliability of generated MCQs has received limited attention -- particularly regarding cases where generation fails. This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings. Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language. Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels. We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students. Our results show that current models can generate MCQs of comparable quality to human-authored ones. However, we identify issues related to semantic clarity and answerability. Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design.

[43] The Compositional Architecture of Regret in Large Language Models

Xiangxiang Cui,Shu Yang,Tianjin Huang,Wanyu Lin,Lijie Hu,Di Wang

Main category: cs.CL

TL;DR: The paper explores the regret mechanism in Large Language Models, proposing solutions to three key challenges: constructing a regret dataset, identifying optimal regret representation layers with S-CDI, and categorizing neurons using RDS. Experiments revealed an M-shaped decoupling pattern and successfully enhanced probe classification performance.

Details Motivation: To enhance model reliability and understand how cognition is coded in neural networks by studying the regret mechanism in Large Language Models. Method: Proposed a workflow for creating a regret dataset, introduced S-CDI metric for identifying optimal regret representation layers, and developed RDS and GIC metrics for neuron categorization and activation pattern analysis. Result: Successfully identified the optimal regret representation layer with S-CDI, enhancing probe classification performance. Discovered an M-shaped decoupling pattern across layers and categorized neurons into three functional groups. Conclusion: The proposed methods effectively address challenges in studying regret mechanisms, providing insights into information processing and contributing to more reliable language models. Abstract: Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model's hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons.

[44] Minding the Politeness Gap in Cross-cultural Communication

Yuka Machino,Matthias Hofer,Max Siegel,Joshua B. Tenenbaum,Robert D. Hawkins

Main category: cs.CL

TL;DR: Experiments on British and American English speakers' interpretation of intensifiers reveal cross-cultural differences stem from both literal meanings and utterance cost, challenging accounts based purely on semantic variation or politeness norms.

Details Motivation: To clarify whether differences in cross-cultural communication arise from literal word meanings or pragmatic factors like politeness and brevity. Method: Three experiments were conducted on speakers of British and American English regarding their interpretation of intensifiers. A computational cognitive model was developed where listeners reason about speakers balancing informativity, politeness, and utterance cost. Result: Cross-cultural differences in intensifier interpretation come from different literal meanings and different weights on utterance cost. Conclusion: The findings challenge pure semantic variation or politeness norm accounts, showing cross-cultural interpretation differences emerge from the interplay between these factors. Abstract: Misunderstandings in cross-cultural communication often arise from subtle differences in interpretation, but it is unclear whether these differences arise from the literal meanings assigned to words or from more general pragmatic factors such as norms around politeness and brevity. In this paper, we report three experiments examining how speakers of British and American English interpret intensifiers like "quite" and "very." To better understand these cross-cultural differences, we developed a computational cognitive model where listeners recursively reason about speakers who balance informativity, politeness, and utterance cost. Our model comparisons suggested that cross-cultural differences in intensifier interpretation stem from a combination of (1) different literal meanings, (2) different weights on utterance cost. These findings challenge accounts based purely on semantic variation or politeness norms, demonstrating that cross-cultural differences in interpretation emerge from an intricate interplay between the two.

[45] Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability

Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe

Main category: cs.CL

TL;DR: In generative commonsense reasoning tasks, LLMs generally understand instructions but struggle with concept order diversity. The proposed Ordered CommonGen benchmark evaluates LLMs' instruction-following and compositional generalization abilities by measuring ordered coverage.

Details Motivation: To better evaluate LLMs' ability to follow instructions and their compositional generalization, especially in maintaining concept order in sentence generation. Method: Proposed the Ordered CommonGen benchmark which measures whether concepts are generated in a specified order, allowing simultaneous evaluation of instruction-following and compositional generalization abilities. Result: Analysis of 36 LLMs showed that while they generally understand instructions, there is a tendency towards low-diversity outputs due to biases in concept order patterns. Even top-performing models achieved only about 75% ordered coverage. Conclusion: LLMs need improvement in both instruction-following and compositional generalization capabilities to enhance output diversity and accuracy in concept order. Abstract: In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.

[46] Oldies but Goldies: The Potential of Character N-grams for Romanian Texts

Dana Lupsa,Sanda-Maria Avram

Main category: cs.CL

TL;DR: This study explores authorship attribution for Romanian texts using six machine learning techniques and character n-gram features. Among these, the ANN model performed best with perfect classification in four out of fifteen runs when using 5-gram features.

Details Motivation: The motivation is to address the problem of authorship attribution for Romanian texts by evaluating various machine learning techniques on the ROST corpus, a standard benchmark in the field. Method: Six machine learning techniques were systematically evaluated: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN). Character n-gram features were used for classification. Result: The ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. Conclusion: Lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, showing potential in resource-constrained or under-studied language settings. Abstract: This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource, constrained or under-studied language settings.

[47] CC-LEARN: Cohort-based Consistency Learning

Xiao Ye,Shaswat Shrivastava,Zhaonan Li,Jacob Dineen,Shijie Lu,Avneet Ahuja,Ming Shen,Zhikun Xu,Ben Zhou

Main category: cs.CL

TL;DR: 大型语言模型在许多任务中表现出色,但在一致性和稳健推理方面仍面临挑战。本文提出了基于群体的一致性学习(CC-Learn)框架,通过强化学习提高LLM推理的可靠性。该方法在共享程序抽象的相似问题组上进行训练,定义了结合群体准确率、检索奖励和拒绝惩罚的综合目标,以实现群体级一致性。实验表明,CC-Learn在困难推理基准上提高了准确性和推理稳定性。

Details Motivation: 尽管大型语言模型在许多任务中表现出色,但它们在一致性和稳健推理方面仍然存在不足。因此,需要一种新的方法来提高LLM推理的可靠性和一致性。 Method: 提出了一种名为Cohort-based Consistency Learning (CC-Learn)的强化学习框架。该框架通过对从共享程序抽象中派生的相似问题组进行训练,定义了一个综合目标函数,包括群体准确率、检索奖励和拒绝惩罚,从而在群体级别上强制执行一致性。与监督微调不同,强化学习可以直接优化这个奖励函数,引导模型在所有群体成员中采用统一的推理模式。 Result: 在多个具有挑战性的推理基准测试(如ARC-Challenge和StrategyQA)中,CC-Learn显著提高了准确性和推理稳定性,超越了预训练和监督微调基线模型的表现。 Conclusion: 研究结果表明,基于群体级别的强化学习可以有效增强大型语言模型的推理一致性。 Abstract: Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.

[48] Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

Tommaso Green,Martin Gubri,Haritz Puerto,Sangdoo Yun,Seong Joon Oh

Main category: cs.CL

TL;DR: 研究发现,大型推理模型的推理痕迹中存在隐私泄露问题,尽管通常认为这些痕迹是内部且安全的。测试时计算方法(如增加推理步骤)虽然使模型最终回答更谨慎,但也增加了隐私泄露风险。推理虽提高实用性,却也扩大了隐私攻击面。因此,安全工作需扩展到模型的内部思考,而不仅仅是输出。

Details Motivation: 研究动机在于挑战关于推理痕迹安全性的假设,即推理痕迹可能包含敏感用户数据,并可通过提示注入或意外泄露到输出中。 Method: 通过探测和代理评估,研究展示了测试时计算方法(特别是增加推理步骤)如何放大隐私泄露。 Result: 结果表明,增加推理预算虽然使模型在最终答案上更加谨慎,但也会导致模型推理更加冗长并泄露更多内容。 Conclusion: 安全性努力不仅应关注模型的输出,还需延伸至模型的内部思考过程。 Abstract: We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model's internal thinking, not just its outputs.

[49] Gender-Neutral Machine Translation Strategies in Practice

Hillary Dawkins,Isar Nejadgholi,Chi-kiu Lo

Main category: cs.CL

TL;DR: 这篇论文研究了21个机器翻译系统在处理性别模糊时保持性别中立的能力,发现整体表现令人失望,但有少数系统能够根据目标语言使用特定策略进行性别中立翻译。

Details Motivation: 研究动机在于解决机器翻译在将性别模糊的源文本翻译成具有语法性别的语言时,避免性别错误标记和表征伤害的问题。 Method: 评估21个机器翻译系统在三种不同难度翻译方向上对性别中立的需求的敏感性,分类并讨论实际观察到的具体性别中立策略,并检查二元性别刻板印象对性别中立翻译的影响。 Result: 大多数机器翻译系统未能在面对性别模糊时提供性别中立的翻译,但有少量系统能够根据目标语言采用特定策略实现性别中立翻译。 Conclusion: 机器翻译系统在处理性别模糊时保持性别中立方面存在显著不足,需要进一步改进以减少性别偏见和表征伤害。 Abstract: Gender-inclusive machine translation (MT) should preserve gender ambiguity in the source to avoid misgendering and representational harms. While gender ambiguity often occurs naturally in notional gender languages such as English, maintaining that gender neutrality in grammatical gender languages is a challenge. Here we assess the sensitivity of 21 MT systems to the need for gender neutrality in response to gender ambiguity in three translation directions of varying difficulty. The specific gender-neutral strategies that are observed in practice are categorized and discussed. Additionally, we examine the effect of binary gender stereotypes on the use of gender-neutral translation. In general, we report a disappointing absence of gender-neutral translations in response to gender ambiguity. However, we observe a small handful of MT systems that switch to gender neutral translation using specific strategies, depending on the target language.

[50] GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

Byung-Kwan Lee,Ryo Hachiuma,Yong Man Ro,Yu-Chiang Frank Wang,Yueh-Hua Wu

Main category: cs.CL

TL;DR: Recent advancements in vision-language models (VLMs) have been significant, but deploying them in real-world scenarios is challenging due to their computational demands. To address this, the paper presents Generation after Recalibration (GenRecal), a novel distillation framework for VLMs that incorporates a Recalibrator to align and adapt feature representations between heterogeneous VLMs.

Details Motivation: To overcome the challenges of deploying large VLMs in real-world scenarios, especially on resource-constrained devices, by distilling knowledge from large VLMs into smaller, more efficient counterparts. Method: The method involves using a novel distillation framework called GenRecal which includes a Recalibrator to align and adapt feature representations between different types of VLMs. Result: Through extensive experiments on multiple challenging benchmarks, GenRecal significantly improves baseline performances and outperforms both open- and closed-source large-scale VLMs. Conclusion: GenRecal is an effective solution for overcoming the challenge posed by the diversity of VLM architectures, enabling successful knowledge transfer across different types of VLMs. Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

[51] PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning

Yuhui Shi,Yehan Yang,Qiang Sheng,Hao Mi,Beizhe Hu,Chaoxi Xu,Juan Cao

Main category: cs.CL

TL;DR: With the rise of large language models (LLMs), issues such as misinformation and academic misconduct have become more severe, making the detection of LLM-generated text crucial. Existing methods struggle with detecting text from privately tuned LLMs, a challenge that remains underexplored. The paper introduces PhantomHunter, an LLM-generated text detector designed to detect text from unseen, privately-tuned LLMs. It uses a family-aware learning framework to capture shared traits across base models and their derivatives, achieving F1 scores over 96% on data from LLaMA, Gemma, and Mistral families.

Details Motivation: The motivation behind this research is the increasing severity of societal problems caused by LLMs, such as misinformation and academic misconduct, which highlights the importance of detecting LLM-generated text. However, existing detectors perform poorly when faced with text from privately tuned LLMs, necessitating a new approach. Method: The method proposed in this paper is PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. It employs a family-aware learning framework that captures family-level traits shared across base models and their derivatives, rather than focusing on individual characteristics. Result: Experiments conducted on data from LLaMA, Gemma, and Mistral families demonstrate the superiority of PhantomHunter over 7 baselines and 3 industrial services, achieving F1 scores of over 96%. Conclusion: PhantomHunter effectively addresses the challenge posed by text from privately tuned LLMs, showcasing its potential as a robust solution for detecting LLM-generated text. Abstract: With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%.

cs.CV [Back]

[52] SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection

Jingxuan Zhou,Yuehao Wu,Yibo Zhang,Yeyubei Zhang,Yunchong Liu,Bolin Huang,Chunhong Yuan

Main category: cs.CV

TL;DR: This paper proposes a Semantic Irony Recognition Network (SemIRNet) for multimodal irony detection tasks, which utilizes ConceptNet knowledge base, cross-modal semantic similarity detection modules and contrastive learning loss function. It improves accuracy and F1 value on a benchmark dataset and ablation experiments confirm the importance of knowledge fusion and semantic similarity detection.

Details Motivation: There is difficulty in accurately identifying graphical implicit correlations in multimodal irony detection tasks. Method: The proposed method SemIRNet includes three innovations: introducing ConceptNet knowledge base to enhance common-sense reasoning ability, designing two cross-modal semantic similarity detection modules at word level and sample level to model graphic-textual correlations, and introducing a contrastive learning loss function to optimize spatial distribution of sample features. Result: Experiments on a publicly available multimodal irony detection benchmark dataset show that the accuracy and F1 value are improved by 1.64% and 2.88% respectively compared with existing optimal methods. Ablation experiments verify the important role of knowledge fusion and semantic similarity detection in improving the model performance. Conclusion: SemIRNet effectively enhances the model's common-sense reasoning ability and improves the separability of positive and negative samples through its innovative components. Abstract: Aiming at the problem of difficulty in accurately identifying graphical implicit correlations in multimodal irony detection tasks, this paper proposes a Semantic Irony Recognition Network (SemIRNet). The model contains three main innovations: (1) The ConceptNet knowledge base is introduced for the first time to acquire conceptual knowledge, which enhances the model's common-sense reasoning ability; (2) Two cross-modal semantic similarity detection modules at the word level and sample level are designed to model graphic-textual correlations at different granularities; and (3) A contrastive learning loss function is introduced to optimize the spatial distribution of the sample features, which improves the separability of positive and negative samples. Experiments on a publicly available multimodal irony detection benchmark dataset show that the accuracy and F1 value of this model are improved by 1.64% and 2.88% to 88.87% and 86.33%, respectively, compared with the existing optimal methods. Further ablation experiments verify the important role of knowledge fusion and semantic similarity detection in improving the model performance.

[53] Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?

Yang Yao,Lingyu Li,Jiaxin Song,Chiyu Chen,Zhenqi He,Yixu Wang,Xin Wang,Tianle Gu,Jie Li,Yan Teng,Yingchun Wang

Main category: cs.CV

TL;DR: Multimodal Large Language Models (MLLMs) have advanced in cognitive and reasoning abilities, but still face challenges in visual fine-grained perception and commonsense causal inference. This paper presents Argus Inspection, a benchmark for evaluating these capabilities, and Eye of Panoptes, a framework for assessing opinion-based reasoning tasks. Experiments on 26 MLLMs show significant room for improvement.

Details Motivation: To address the limitations of current MLLMs in visual fine-grained perception and commonsense causal inference. Method: Introduced Argus Inspection, a multimodal benchmark with two difficulty levels for evaluating detailed visual recognition and real-world commonsense understanding. Developed the Eye of Panoptes framework, which uses a binary parametric Sigmoid metric and an indicator function to evaluate MLLMs' responses in opinion-based reasoning tasks. Result: Experiments on 26 mainstream MLLMs showed the highest performance in visual fine-grained reasoning was only 0.46, indicating substantial room for improvement. Conclusion: The research provides valuable insights for the ongoing development and refinement of MLLMs. Abstract: As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs' responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.

[54] A Hybrid ConvNeXt-EfficientNet AI Solution for Precise Falcon Disease Detection

Alavikunhu Panthakkan,Zubair Medammal,S M Anzar,Fatma Taher,Hussain Al-Ahmad

Main category: cs.CV

TL;DR: This paper introduces a hybrid ConvNeXt and EfficientNet AI model for classifying falcon diseases, surpassing traditional methods in accuracy.

Details Motivation: To enhance the health surveillance of falcons by developing an accurate method for disease classification. Method: The study uses a hybrid AI model combining ConvNeXt and EfficientNet for identifying three conditions: Normal, Liver Disease, and Aspergillosis. It relies on performance metrics such as accuracy, precision, recall, and F1-score. Result: The concatenated AI model performs better than traditional diagnostic methods and individual model architectures. Conclusion: This hybrid AI model represents a significant advancement in precise falcon disease detection and opens avenues for future AI-driven avian healthcare solutions. Abstract: Falconry, a revered tradition involving the training and hunting with falcons, requires meticulous health surveillance to ensure the health and safety of these prized birds, particularly in hunting scenarios. This paper presents an innovative method employing a hybrid of ConvNeXt and EfficientNet AI models for the classification of falcon diseases. The study focuses on accurately identifying three conditions: Normal, Liver Disease and 'Aspergillosis'. A substantial dataset was utilized for training and validating the model, with an emphasis on key performance metrics such as accuracy, precision, recall, and F1-score. Extensive testing and analysis have shown that our concatenated AI model outperforms traditional diagnostic methods and individual model architectures. The successful implementation of this hybrid AI model marks a significant step forward in precise falcon disease detection and paves the way for future developments in AI-powered avian healthcare solutions.

[55] ViLLa: A Neuro-Symbolic approach for Animal Monitoring

Harsha Koduri

Main category: cs.CV

TL;DR: ViLLa is a neuro-symbolic framework integrating visual detection, language parsing and symbolic reasoning for interpretable animal monitoring in natural environments.

Details Motivation: Current systems lack the ability to effectively interpret both visual data and human language queries for animal population monitoring. Method: ViLLa integrates three core components: a visual detection module for identifying animals and their spatial locations in images, a language parser for understanding natural language queries, and a symbolic reasoning layer that applies logic-based inference to answer those queries. Result: The system was evaluated on a range of animal imagery tasks and demonstrates the ability to bridge visual content with structured, human-interpretable queries. Conclusion: ViLLa offers modularity and transparency by separating perception, understanding, and reasoning, making it suitable for interpretable animal monitoring. Abstract: Monitoring animal populations in natural environments requires systems that can interpret both visual data and human language queries. This work introduces ViLLa (Vision-Language-Logic Approach), a neuro-symbolic framework designed for interpretable animal monitoring. ViLLa integrates three core components: a visual detection module for identifying animals and their spatial locations in images, a language parser for understanding natural language queries, and a symbolic reasoning layer that applies logic-based inference to answer those queries. Given an image and a question such as "How many dogs are in the scene?" or "Where is the buffalo?", the system grounds visual detections into symbolic facts and uses predefined rules to compute accurate answers related to count, presence, and location. Unlike end-to-end black-box models, ViLLa separates perception, understanding, and reasoning, offering modularity and transparency. The system was evaluated on a range of animal imagery tasks and demonstrates the ability to bridge visual content with structured, human-interpretable queries.

[56] GraphGSOcc: Semantic and Geometric Graph Transformer for 3D Gaussian Splating-based Occupancy Prediction

Ke Song,Yunhe Wu,Chunchit Siu,Huiyuan Xiong

Main category: cs.CV

TL;DR: The paper presents GraphGSOcc, a new model using semantic and geometric graph Transformer for 3D Gaussian Splating-based Occupancy Prediction in autonomous driving. It solves issues in existing methods through Dual Gaussians Graph Attention and Multi-scale Graph Attention framework, improving mIoU and reducing GPU memory.

Details Motivation: To address limitations in current 3D Gaussian Splating (3DGS) methods such as unified feature aggregation ignoring semantic correlations and boundary ambiguities due to lack of geometric constraints. Method: Proposed the GraphGSOcc model which includes Dual Gaussians Graph Attention that constructs dual graph structures (geometric and semantic graphs) and is coupled with a Multi-scale Graph Attention framework. Result: Experiments on the SurroundOcc dataset showed an mIoU of 24.10%, with reduced GPU memory to 6.1 GB, indicating a 1.97% mIoU improvement and 13.7% memory reduction compared to GaussianWorld. Conclusion: GraphGSOcc effectively improves the accuracy of 3D semantic occupancy prediction while significantly reducing the required GPU memory. Abstract: Addressing the task of 3D semantic occupancy prediction for autonomous driving, we tackle two key issues in existing 3D Gaussian Splating (3DGS) methods: (1) unified feature aggregation neglecting semantic correlations among similar categories and across regions, and (2) boundary ambiguities caused by the lack of geometric constraints in MLP iterative optimization. We propose the GraphGSOcc model, a novel framework that combines semantic and geometric graph Transformer for 3D Gaussian Splating-based Occupancy Prediction. We propose the Dual Gaussians Graph Attenntion, which dynamically constructs dual graph structures: a geometric graph adaptively calculating KNN search radii based on Gaussian poses, enabling large-scale Gaussians to aggregate features from broader neighborhoods while compact Gaussians focus on local geometric consistency; a semantic graph retaining top-M highly correlated nodes via cosine similarity to explicitly encode semantic relationships within and across instances. Coupled with the Multi-scale Graph Attention framework, fine-grained attention at lower layers optimizes boundary details, while coarse-grained attention at higher layers models object-level topology. Experiments on the SurroundOcc dataset achieve an mIoU of 24.10%, reducing GPU memory to 6.1 GB, demonstrating a 1.97% mIoU improvement and 13.7% memory reduction compared to GaussianWorld

[57] DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning

Yifeng Gao,Yifan Ding,Hongyu Su,Juncheng Li,Yunhan Zhao,Lin Luo,Zixing Chen,Li Wang,Xin Wang,Yixu Wang,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 随着AI生成视频在媒体平台上的普及,区分合成内容和真实影像变得越来越重要。现有的方法大多将此挑战视为二元分类任务,但核心挑战还在于提供细粒度、有说服力的证据。我们引入了DAVID-X数据集,它将AI生成的视频与详细的时空注释和书面理由配对。基于这些注释,我们提出了DAVID-XR1模型,旨在提供可解释的视觉推理链,包括缺陷分类、时空定位和自然语言解释。实验表明,使用我们的小型数据集微调并增强的一般用途主干模型,在各种生成器和生成模式上具有强大的泛化能力。结果强调了可解释检测方法对于可信识别AI生成视频内容的潜力。

Details Motivation: AI生成视频在媒体平台上的广泛应用使得区分合成内容和真实影像变得至关重要。现有方法主要关注二元分类,缺乏对检测原因和位置的深入分析,因此需要一种更透明和可验证的方法来解决这一问题。 Method: 提出DAVID-X数据集,包含AI生成视频及其详细缺陷级别的时空注释和书面理由。基于此数据集开发了DAVID-XR1视频-语言模型,该模型通过缺陷分类、时空定位和自然语言解释提供可解释的视觉推理链。采用一般用途主干模型,并通过链式思维蒸馏进行增强以提高泛化能力。 Result: 实验结果表明,微调后的模型在多种生成器和生成模式下表现出强大的泛化能力,证明了可解释检测方法在可靠识别AI生成视频方面的潜力。 Conclusion: 本研究通过引入DAVID-X数据集和DAVID-XR1模型,将AI生成视频检测从黑箱决策转变为透明且可验证的过程,为实现可信的AI生成视频识别提供了新途径。 Abstract: As AI-generated video becomes increasingly pervasive across media platforms, the ability to reliably distinguish synthetic content from authentic footage has become both urgent and essential. Existing approaches have primarily treated this challenge as a binary classification task, offering limited insight into where or why a model identifies a video as AI-generated. However, the core challenge extends beyond simply detecting subtle artifacts; it requires providing fine-grained, persuasive evidence that can convince auditors and end-users alike. To address this critical gap, we introduce DAVID-X, the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales. Leveraging these rich annotations, we present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning-including defect categorization, temporal-spatial localization, and natural language explanations. This approach fundamentally transforms AI-generated video detection from an opaque black-box decision into a transparent and verifiable diagnostic process. We demonstrate that a general-purpose backbone, fine-tuned on our compact dataset and enhanced with chain-of-thought distillation, achieves strong generalization across a variety of generators and generation modes. Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.

[58] Recent Advances in Multi-Agent Human Trajectory Prediction: A Comprehensive Review

Céline Finet,Stephane Da Silva Martins,Jean-Bernard Hayet,Ioannis Karamouzas,Javad Amirian,Sylvie Le Hégarat-Mascle,Julien Pettré,Emanuel Aldea

Main category: cs.CV

TL;DR: 这篇论文综述了2020年至2024年间基于深度学习的多智能体轨迹预测的最新进展,探讨了模型架构、输入表示和预测策略,并强调了使用ETH/UCY基准评估的模型,同时指出了该领域关键挑战和未来研究方向。

Details Motivation: 随着数据驱动方法在人类轨迹预测(HTP)中的出现,对多智能体交互进行更精细的理解变得触手可及,这在自主导航和人群建模等领域具有重要意义。 Method: 根据模型架构设计、输入表示和整体预测策略对现有的深度学习方法进行了分类,特别关注使用ETH/UCY基准评估的模型。 Result: 总结了当前最先进的多智能体轨迹预测技术,并明确了关键挑战和未来的研究方向。 Conclusion: 尽管取得了显著进展,但在多智能体的人类轨迹预测方面仍存在许多挑战,需要进一步研究以提高预测精度和模型泛化能力。 Abstract: With the emergence of powerful data-driven methods in human trajectory prediction (HTP), gaining a finer understanding of multi-agent interactions lies within hand's reach, with important implications in areas such as autonomous navigation and crowd modeling. This survey reviews some of the most recent advancements in deep learning-based multi-agent trajectory prediction, focusing on studies published between 2020 and 2024. We categorize the existing methods based on their architectural design, their input representations, and their overall prediction strategies, placing a particular emphasis on models evaluated using the ETH/UCY benchmark. Furthermore, we highlight key challenges and future research directions in the field of multi-agent HTP.

[59] ArchShapeNet:An Interpretable 3D-CNN Framework for Evaluating Architectural Shapes

Jun Yin,Jing Zhong,Pengyu Zeng,Peilin Li,Zixuan Dai,Miao Zhang,Shuai Lu

Main category: cs.CV

TL;DR: In contemporary architectural design, generative plugin tools are essential for producing initial concepts and exploring novel 3D forms. To analyze the differences between human-designed and machine-generated 3D forms, we built ArchForms-4000 dataset, Proposed ArchShapeNet model and conducted comparative experiments. This study highlights the advantages of human-designed forms and provides insights for enhancing generative design tools in the future.

Details Motivation: The growing complexity and diversity of design demands have made generative plugin tools essential in contemporary architectural design. However, analyzing the differences between human-designed and machine-generated 3D forms remains a challenge. Method: Built ArchForms-4000 dataset containing 2,000 architect-designed and 2,000 Evomass-generated 3D forms; Proposed ArchShapeNet, a 3D convolutional neural network tailored for classifying and analyzing architectural forms; Conducted comparative experiments. Result: Our model outperforms human experts in distinguishing form origins, achieving 94.29% accuracy, 96.2% precision, and 98.51% recall. Conclusion: This study highlights the distinctive advantages of human-designed forms in spatial organization, proportional harmony, and detail refinement and provides valuable insights for enhancing generative design tools in the future. Abstract: In contemporary architectural design, the growing complexity and diversity of design demands have made generative plugin tools essential for quickly producing initial concepts and exploring novel 3D forms. However, objectively analyzing the differences between human-designed and machine-generated 3D forms remains a challenge, limiting our understanding of their respective strengths and hindering the advancement of generative tools. To address this, we built ArchForms-4000, a dataset containing 2,000 architect-designed and 2,000 Evomass-generated 3D forms; Proposed ArchShapeNet, a 3D convolutional neural network tailored for classifying and analyzing architectural forms, incorporating a saliency module to highlight key spatial features aligned with architectural reasoning; And conducted comparative experiments showing our model outperforms human experts in distinguishing form origins, achieving 94.29% accuracy, 96.2% precision, and 98.51% recall. This study not only highlights the distinctive advantages of human-designed forms in spatial organization, proportional harmony, and detail refinement but also provides valuable insights for enhancing generative design tools in the future.

[60] Real-Time, Low-Latency Surveillance Using Entropy-Based Adaptive Buffering and MobileNetV2 on Edge Devices

Poojashree Chandrashekar Pankaj M Sajjanar

Main category: cs.CV

TL;DR: This paper presents a high-performance, low-latency video surveillance system for resource-constrained environments. It combines an entropy-based adaptive frame buffering algorithm with MobileNetV2 to achieve high throughput and accuracy.

Details Motivation: To develop a video surveillance system that operates efficiently in resource-constrained environments without compromising on performance or accuracy. Method: The system integrates an entropy-based adaptive frame buffering algorithm with MobileNetV2. This combination allows the system to process live video streams with sub-50ms end-to-end inference latency on devices like Raspberry Pi, Amazon, and NVIDIA Jetson Nano. Result: The system maintains over 92% detection accuracy on standard datasets related to video surveillance while showing robustness to varying conditions. Comparative and ablation experiments validate the design's effectiveness. Conclusion: The proposed architecture is scalable, cost-effective, and adheres to stricter data privacy regulations, making it suitable for smart city or embedded security architectures. Abstract: This paper describes a high-performance, low-latency video surveillance system designed for resource-constrained environments. We have proposed a formal entropy-based adaptive frame buffering algorithm and integrated that with MobileNetV2 to achieve high throughput with low latency. The system is capable of processing live streams of video with sub-50ms end-to-end inference latency on resource-constrained devices (embedding platforms) such as Raspberry Pi, Amazon, and NVIDIA Jetson Nano. Our method maintains over 92% detection accuracy on standard datasets focused on video surveillance and exhibits robustness to varying lighting, backgrounds, and speeds. A number of comparative and ablation experiments validate the effectiveness of our design. Finally, our architecture is scalable, inexpensive, and compliant with stricter data privacy regulations than common surveillance systems, so that the system could coexist in a smart city or embedded security architecture.

[61] MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation

Kiet Dang Vu,Trung Thai Tran,Duc Dung Nguyen

Main category: cs.CV

TL;DR: MonoVQD是一种改进的DETR架构,用于单目3D检测,通过引入掩码分离自注意力机制、变分查询去噪技术和自蒸馏策略来解决现有方法的局限性,显著提高了KITTI基准测试和nuScenes数据集上的性能。

Details Motivation: 尽管DETR类架构具有强大的潜力,但在单目3D检测中的直接应用存在固有限制,影响了其性能表现。为了克服这些限制并提升检测精度,提出了一种新的框架MonoVQD。 Method: 1. 提出掩码分离自注意力机制,将去噪过程集成到DETR架构中,提高匈牙利匹配的稳定性。 2. 引入变分查询去噪技术,解决传统去噪方法中的梯度消失问题,通过引入随机属性提升去噪效率。 3. 开发一种复杂的自蒸馏策略,利用解码器后期层的知识协同优化早期层的查询质量,增强迭代细化过程。 Result: 在KITTI单目基准测试中表现出色,并且其核心组件能够无缝集成到其他架构中,在nuScenes数据集的多视图3D检测场景中也实现了显著的性能提升,证明了其广泛的适用性和强大的泛化能力。 Conclusion: MonoVQD通过提出的三种创新技术,有效解决了DETR架构在单目3D检测中的局限性,提升了检测性能和模型的泛化能力,适用于多种3D检测场景。 Abstract: Precisely localizing 3D objects from a single image constitutes a central challenge in monocular 3D detection. While DETR-like architectures offer a powerful paradigm, their direct application in this domain encounters inherent limitations, preventing optimal performance. Our work addresses these challenges by introducing MonoVQD, a novel framework designed to fundamentally advance DETR-based monocular 3D detection. We propose three main contributions. First, we propose the Mask Separated Self-Attention mechanism that enables the integration of the denoising process into a DETR architecture. This improves the stability of Hungarian matching to achieve a consistent optimization objective. Second, we present the Variational Query Denoising technique to address the gradient vanishing problem of conventional denoising methods, which severely restricts the efficiency of the denoising process. This explicitly introduces stochastic properties to mitigate this fundamental limitation and unlock substantial performance gains. Finally, we introduce a sophisticated self-distillation strategy, leveraging insights from later decoder layers to synergistically improve query quality in earlier layers, thereby amplifying the iterative refinement process. Rigorous experimentation demonstrates that MonoVQD achieves superior performance on the challenging KITTI monocular benchmark. Highlighting its broad applicability, MonoVQD's core components seamlessly integrate into other architectures, delivering significant performance gains even in multi-view 3D detection scenarios on the nuScenes dataset and underscoring its robust generalization capabilities.

[62] Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction

Chengzhi Xu,Yuyang Wang,Lai Wei,Lichao Sun,Weiran Huang

Main category: cs.CV

TL;DR: The paper proposes ChartIR, an iterative refinement method for chart-to-code generation using structured instructions to improve MLLMs performance.

Details Motivation: Multimodal large language models (MLLMs) have suboptimal performance in chart-to-code generation which requires both precise visual understanding and accurate translation of visual elements into structured code. Method: ChartIR distinguishes two tasks: visual understanding and code translation. It designs two types of structured instructions - description and difference - for the visual understanding component. The overall chart generation pipeline is decomposed into initial code generation and iterative refinement stages. Result: Experimental results indicate superior performance of ChartIR on both open-source model Qwen2-VL and closed-source model GPT-4o compared to other methods. Conclusion: ChartIR, with its structured instructions and iterative refinement approach, effectively enhances MLLMs' capabilities in chart-to-code generation. Abstract: Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction. First, we distinguish two tasks: visual understanding and code translation. To accomplish the visual understanding component, we design two types of structured instructions: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations, thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement, enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.

[63] PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

Lukas Schiesser,Cornelius Wolff,Sophie Haas,Simon Pukrop

Main category: cs.CV

TL;DR: PictSure is an ICL framework that focuses on the embedding model to improve few-shot image classification performance.

Details Motivation: Building effective image classification models in data-scarce domains is challenging due to the impracticality of collecting large labeled datasets. In-context learning shows promise for few-shot image classification, but the role of image embeddings has been largely overlooked. Method: PictSure systematically investigates the impact of visual encoder types, pretraining objectives, and fine-tuning strategies on downstream few-shot image classification performance by placing the embedding model at the center of analysis. Result: Experiments demonstrate that training success and out-of-domain performance are highly dependent on the way embedding models are pretrained. PictSure outperforms existing models on out-of-domain benchmarks while maintaining competitive results on in-domain tasks. Conclusion: PictSure highlights the importance of image embeddings in in-context learning for few-shot image classification and provides insights into optimizing embedding models for improved performance. Abstract: Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.

[64] Finding Optimal Kernel Size and Dimension in Convolutional Neural Networks An Architecture Optimization Approach

Shreyas Rajeev,B Sathish Babu

Main category: cs.CV

TL;DR: This paper proposes BKSEF, a framework for optimal kernel size selection in CNNs that balances information gain, computational efficiency, and accuracy. It shows significant improvements in accuracy and reductions in computational cost across various datasets and real-world applications.

Details Motivation: Kernel size selection in CNNs impacts multiple critical factors such as receptive field, feature extraction, computational cost, and model accuracy, yet it is often treated as a fixed heuristic rather than an optimizable parameter. Method: The Best Kernel Size Estimation Function (BKSEF) integrates principles from information theory, signal processing, and learning theory to determine the optimal kernel size layer-wise in CNNs. Result: Experiments show that BKSEF-guided architectures achieve up to 3.1% accuracy improvement and 42.8% reduction in FLOPs compared to traditional models using uniform 3x3 kernels. Real-world case studies further validate these improvements in medical image classification and traffic sign recognition. Conclusion: BKSEF demonstrates that kernel size can be an active, optimizable parameter rather than a fixed heuristic, providing practical heuristics and theoretical support for efficient and application-aware CNN designs. Abstract: Kernel size selection in Convolutional Neural Networks (CNNs) is a critical but often overlooked design decision that affects receptive field, feature extraction, computational cost, and model accuracy. This paper proposes the Best Kernel Size Estimation Function (BKSEF), a mathematically grounded and empirically validated framework for optimal, layer-wise kernel size determination. BKSEF balances information gain, computational efficiency, and accuracy improvements by integrating principles from information theory, signal processing, and learning theory. Extensive experiments on CIFAR-10, CIFAR-100, ImageNet-lite, ChestX-ray14, and GTSRB datasets demonstrate that BKSEF-guided architectures achieve up to 3.1 percent accuracy improvement and 42.8 percent reduction in FLOPs compared to traditional models using uniform 3x3 kernels. Two real-world case studies further validate the approach: one for medical image classification in a cloud-based setup, and another for traffic sign recognition on edge devices. The former achieved enhanced interpretability and accuracy, while the latter reduced latency and model size significantly, with minimal accuracy trade-off. These results show that kernel size can be an active, optimizable parameter rather than a fixed heuristic. BKSEF provides practical heuristics and theoretical support for researchers and developers seeking efficient and application-aware CNN designs. It is suitable for integration into neural architecture search pipelines and real-time systems, offering a new perspective on CNN optimization.

[65] Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis

Varun Mannam,Zhenyu Shi

Main category: cs.CV

TL;DR: 准确的视频标注在现代零售应用中起着至关重要的作用。为了解决传统标注方法耗时且成本高的问题,我们提出了一种基于深度学习的方法,可以自动识别零售视频中的关键帧,并对产品和客户进行自动标注。该方法通过嵌入视频帧和结合针对零售环境的对象检测技术来学习区分特征。实验结果表明,该方法在保持与人工标注相当的准确性的同时,显著提高了效率并节省了约2倍的成本。

Details Motivation: 传统视频标注方法依赖于耗时的人工标注,导致不稳定的帧选择和操作成本增加。因此需要一种更高效、自动化的方法来处理零售视频的标注任务。 Method: 使用深度学习技术,特别是深度神经网络,学习区分性特征,通过嵌入视频帧和结合针对零售环境的对象检测技术,实现对零售视频中关键帧的自动识别和产品及客户的自动标注。 Result: 实验结果证明,该方法优于传统方法,其准确性与人工标注相当,同时将视频标注的成本平均降低了2倍。允许人工验证/调整不到5%的检测帧,其余帧则完全自动化标注,而不会降低标注质量。 Conclusion: 提出的深度学习方法在零售视频标注任务中表现出色,显著减少了时间和精力的投入,具有很高的实用价值,适用于各种零售应用场景,如购物者旅程分析、产品交互检测和店内安全监控等。 Abstract: Accurate video annotation plays a vital role in modern retail applications, including customer behavior analysis, product interaction detection, and in-store activity recognition. However, conventional annotation methods heavily rely on time-consuming manual labeling by human annotators, introducing non-robust frame selection and increasing operational costs. To address these challenges in the retail domain, we propose a deep learning-based approach that automates key-frame identification in retail videos and provides automatic annotations of products and customers. Our method leverages deep neural networks to learn discriminative features by embedding video frames and incorporating object detection-based techniques tailored for retail environments. Experimental results showcase the superiority of our approach over traditional methods, achieving accuracy comparable to human annotator labeling while enhancing the overall efficiency of retail video annotation. Remarkably, our approach leads to an average of 2 times cost savings in video annotation. By allowing human annotators to verify/adjust less than 5% of detected frames in the video dataset, while automating the annotation process for the remaining frames without reducing annotation quality, retailers can significantly reduce operational costs. The automation of key-frame detection enables substantial time and effort savings in retail video labeling tasks, proving highly valuable for diverse retail applications such as shopper journey analysis, product interaction detection, and in-store security monitoring.

[66] Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction

Zhengquan Zhang,Feng Xu,Mengmi Zhang

Main category: cs.CV

TL;DR: 提出了一种新的主动视角选择(AVS)方法,使用轻量级神经网络UPNet预测不确定性图以优化3D重建。该方法减少了计算开销并提高了效率,同时在较少视角下实现了与基线方法相当的重建精度。

Details Motivation: 现有的3D重建方法(如NeRF或3D Gaussian Splatting)需要从当前观测中学习辐射场并计算每个候选视角的不确定性,这导致了较高的计算成本和冗长的处理时间。因此,研究者希望开发一种更高效的方法来选择最有价值的视角进行3D对象重建。 Method: 引入了名为UPNet的轻量级前馈深度神经网络,通过输入单个3D对象图像,输出预测的不确定性图。此图表示所有可能候选视角的不确定性值。利用观察自然物体及其不确定性模式得出的启发式规则,训练UPNet学习从视角外观到基础体积表示不确定性的直接映射。随后,将之前预测的所有神经不确定性图聚合,以抑制冗余候选视角并有效选择最具有信息量的视角。 Result: 实验表明,尽管使用的视角数量仅为上限的一半,该方法仍能实现与现有最佳方法相当的重建精度。此外,相比基线方法,它显著减少了计算开销,速度提升了400倍,并且CPU、RAM和GPU使用量减少了50%以上。值得注意的是,该方法无需额外训练即可有效推广到涉及新对象类别的AVS任务。 Conclusion: 提出的AVS方法通过UPNet预测不确定性图,显著降低了计算负担并提升了3D重建效率。其在减少视角数量的同时保持了高重建精度,并展示了对新对象类别的良好泛化能力。 Abstract: Some perspectives naturally provide more information than others. How can an AI system determine which viewpoint offers the most valuable insight for accurate and efficient 3D object reconstruction? Active view selection (AVS) for 3D reconstruction remains a fundamental challenge in computer vision. The aim is to identify the minimal set of views that yields the most accurate 3D reconstruction. Instead of learning radiance fields, like NeRF or 3D Gaussian Splatting, from a current observation and computing uncertainty for each candidate viewpoint, we introduce a novel AVS approach guided by neural uncertainty maps predicted by a lightweight feedforward deep neural network, named UPNet. UPNet takes a single input image of a 3D object and outputs a predicted uncertainty map, representing uncertainty values across all possible candidate viewpoints. By leveraging heuristics derived from observing many natural objects and their associated uncertainty patterns, we train UPNet to learn a direct mapping from viewpoint appearance to uncertainty in the underlying volumetric representations. Next, our approach aggregates all previously predicted neural uncertainty maps to suppress redundant candidate viewpoints and effectively select the most informative one. Using these selected viewpoints, we train 3D neural rendering models and evaluate the quality of novel view synthesis against other competitive AVS methods. Remarkably, despite using half of the viewpoints than the upper bound, our method achieves comparable reconstruction accuracy. In addition, it significantly reduces computational overhead during AVS, achieving up to a 400 times speedup along with over 50\% reductions in CPU, RAM, and GPU usage compared to baseline methods. Notably, our approach generalizes effectively to AVS tasks involving novel object categories, without requiring any additional training.

[67] DETONATE: A Benchmark for Text-to-Image Alignment and Kernelized Direct Preference Optimization

Renjith Prasad,Abhilekh Borah,Hasnat Md Abdullah,Chathurangi Shyalika,Gurpreet Singh,Ritvik Garimella,Rajarshi Roy,Harshul Surana,Nasrin Imanpour,Suranjana Trivedy,Amit Sheth,Amitava Das

Main category: cs.CV

TL;DR: DPO-Kernels for T2I models is introduced, enhancing alignment in three dimensions: Hybrid Loss, Kernelized Representations, and Divergence Selection. DETONATE, a large-scale benchmark, and AQI are proposed to evaluate and reveal vulnerabilities in T2I models.

Details Motivation: To ensure that generated images faithfully capture user intent while maintaining safety and fairness. Method: 1) Hybrid Loss integrating embedding-based objectives with traditional probability-based loss; 2) Kernelized Representations using RBF, Polynomial, and Wavelet kernels; 3) Divergence Selection incorporating Wasserstein and R'enyi divergences beyond the default KL regularizer. Result: Empirical results show that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). Conclusion: DPO-Kernels for T2I models enhance alignment across multiple dimensions, and DETONATE serves as a valuable benchmark revealing hidden vulnerabilities. Abstract: Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness. Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems. This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybrid Loss, integrating embedding-based objectives with traditional probability-based loss for improved optimization; (ii) Kernelized Representations, employing Radial Basis Function (RBF), Polynomial, and Wavelet kernels for richer feature transformations and better separation between safe and unsafe inputs; and (iii) Divergence Selection, expanding beyond DPO's default Kullback-Leibler (KL) regularizer by incorporating Wasserstein and R'enyi divergences for enhanced stability and robustness. We introduce DETONATE, the first large-scale benchmark of its kind, comprising approximately 100K curated image pairs categorized as chosen and rejected. DETONATE encapsulates three axes of social bias and discrimination: Race, Gender, and Disability. Prompts are sourced from hate speech datasets, with images generated by leading T2I models including Stable Diffusion 3.5 Large, Stable Diffusion XL, and Midjourney. Additionally, we propose the Alignment Quality Index (AQI), a novel geometric measure quantifying latent-space separability of safe/unsafe image activations, revealing hidden vulnerabilities. Empirically, we demonstrate that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). DETONATE and complete code are publicly released.

[68] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

Yizhen Zhang,Yang Ding,Shuoshuo Zhang,Xinchen Zhang,Haoling Li,Zhong-zhi Li,Peijie Wang,Jie Wu,Lei Ji,Yelong Shen,Yujiu Yang,Yeyun Gong

Main category: cs.CV

TL;DR: The paper proposes PeRL, a reinforcement learning approach designed for multimodal reasoning tasks, which includes permutation of image sequences and rollout filtering mechanism. It outperforms existing models on multi-image benchmarks while maintaining performance on single-image tasks.

Details Motivation: Existing multimodal reinforcement learning approaches are limited to spatial reasoning within single-image contexts and struggle with more complex scenarios involving multi-image positional reasoning. Method: The method involves proposing PeRL, a general reinforcement learning approach tailored for interleaved multimodal tasks, and a multi-stage strategy enhancing the exploration-exploitation trade-off. This includes permutation of image sequences for diversity and a rollout filtering mechanism for effective policy exploitation. Result: PeRL trained model surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks while preserving comparable performance on single-image tasks. Conclusion: PeRL is an effective approach for improving multimodal reasoning tasks, especially in multi-image scenarios, demonstrating superior performance compared to existing models. Abstract: Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.

[69] Frequency-Calibrated Membership Inference Attacks on Medical Image Diffusion Models

Xinkai Zhao,Yuta Tokuoka,Junichiro Iwasawa,Keita Oda

Main category: cs.CV

TL;DR: 提出了一种新的方法FCRE,用于医疗图像扩散模型的成员推断攻击(MIA),通过聚焦中频重建误差提高隐私风险评估准确性。

Details Motivation: 随着扩散模型在图像生成中的广泛应用,特别是在医疗成像等敏感领域,隐私问题日益突出。现有的MIA方法存在局限性,尤其是在处理医疗图像时面临挑战,例如重建误差受图像固有难度和高频细节重建困难的影响。 Method: 提出了一种Frequency-Calibrated Reconstruction Error (FCRE)方法,专注于中频范围内的重建误差,排除了难以重建的高频区域和信息较少的低频区域。具体步骤包括分析逆扩散过程、获取中频重建误差,并计算重建图像与原始图像之间的结构相似性指数分数。最后,通过将该分数与阈值进行比较来确定成员身份。 Result: 实验结果表明,FCRE方法在多个医疗图像数据集上显著优于现有的MIA方法。 Conclusion: FCRE方法有效缓解了图像固有难度对成员推断攻击的影响,提高了在医疗图像扩散模型上的隐私风险评估能力。 Abstract: The increasing use of diffusion models for image generation, especially in sensitive areas like medical imaging, has raised significant privacy concerns. Membership Inference Attack (MIA) has emerged as a potential approach to determine if a specific image was used to train a diffusion model, thus quantifying privacy risks. Existing MIA methods often rely on diffusion reconstruction errors, where member images are expected to have lower reconstruction errors than non-member images. However, applying these methods directly to medical images faces challenges. Reconstruction error is influenced by inherent image difficulty, and diffusion models struggle with high-frequency detail reconstruction. To address these issues, we propose a Frequency-Calibrated Reconstruction Error (FCRE) method for MIAs on medical image diffusion models. By focusing on reconstruction errors within a specific mid-frequency range and excluding both high-frequency (difficult to reconstruct) and low-frequency (less informative) regions, our frequency-selective approach mitigates the confounding factor of inherent image difficulty. Specifically, we analyze the reverse diffusion process, obtain the mid-frequency reconstruction error, and compute the structural similarity index score between the reconstructed and original images. Membership is determined by comparing this score to a threshold. Experiments on several medical image datasets demonstrate that our FCRE method outperforms existing MIA methods.

[70] Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images

Md Abrar Jahin,Shahriar Soudeep,Arian Rahman Aditta,M. F. Mridha,Nafiz Fahad,Md. Jakir Hossen

Main category: cs.CV

TL;DR: This paper explores the use of Vision Transformer (ViT) architectures for distinguishing between quark and gluon-initiated jets using calorimeter images from simulated CMS Open Data, showing improved performance over traditional CNN methods.

Details Motivation: To improve the accuracy of quark-gluon jet classification, which is essential for new physics searches and precision measurements in high-energy physics, by leveraging the global contextual information modeling capabilities of ViTs. Method: Systematic evaluation of ViT and ViT-CNN hybrid models on multi-channel jet-view images constructed from detector-level energy deposits and reconstructed tracks. Comparison against established CNN baselines using metrics like F1-score, ROC-AUC, and accuracy. Result: ViT-based models, particularly hybrids like ViT+MaxViT and ViT+ConvNeXt, outperform traditional CNNs in terms of F1-score, ROC-AUC, and accuracy, demonstrating the advantage of capturing long-range spatial correlations within jet substructure. Conclusion: The study establishes a robust framework for applying ViT architectures to calorimeter image-based jet classification, providing strong performance baselines and a structured dataset for further deep learning research in this field. Abstract: Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.

[71] Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors

Ziteng Li,Malte Kuhlmann,Ilana Nisky,Nicolás Navarro-Guerrero

Main category: cs.CV

TL;DR: The paper proposes two models based on LRCNs and Transformer architectures for predicting compliance metrics using RGB tactile images from GelSight sensors, showing significant improvement over the baseline.

Details Motivation: Traditional compliance detection methods have limitations such as lack of portability and scalability, reliance on expensive equipment, and unsuitability for robotic applications. Neural network-based approaches using vision-based tactile sensors face challenges in achieving sufficient prediction accuracy. Method: Two models based on Long-term Recurrent Convolutional Networks (LRCNs) and Transformer architectures are developed to leverage RGB tactile images and other information captured by the GelSight sensor for accurate prediction of compliance metrics. Result: The proposed models show significant performance improvement over the baseline when validated with multiple metrics. The investigation also revealed that estimating compliance for objects harder than the sensor is more challenging. Conclusion: The models effectively estimate compliance metrics, demonstrating substantial improvements over existing methods. The study highlights the difficulty in estimating compliance for objects harder than the sensor. Abstract: Compliance is a critical parameter for describing objects in engineering, agriculture, and biomedical applications. Traditional compliance detection methods are limited by their lack of portability and scalability, rely on specialized, often expensive equipment, and are unsuitable for robotic applications. Moreover, existing neural network-based approaches using vision-based tactile sensors still suffer from insufficient prediction accuracy. In this paper, we propose two models based on Long-term Recurrent Convolutional Networks (LRCNs) and Transformer architectures that leverage RGB tactile images and other information captured by the vision-based sensor GelSight to predict compliance metrics accurately. We validate the performance of these models using multiple metrics and demonstrate their effectiveness in accurately estimating compliance. The proposed models exhibit significant performance improvement over the baseline. Additionally, we investigated the correlation between sensor compliance and object compliance estimation, which revealed that objects that are harder than the sensor are more challenging to estimate.

[72] Hyper-Local Deformable Transformers for Text Spotting on Historical Maps

Yijun Lin,Yao-Yi Chiang

Main category: cs.CV

TL;DR: This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps with a novel hyper-local sampling module and SynthMap+, a method to generate synthetic map images for training. Experiments show that PALETTE outperforms SOTA text spotters on historical map datasets.

Details Motivation: Text extraction from historical maps is challenging due to the lack of effective methods and training data. Previous approaches are tailored to specific map styles and recent machine learning-based text spotters face challenges in extracting precise image features for predicting every sub-component in a text instance. Method: PALETTE introduces a novel hyper-local sampling module to learn localized image features around the target boundary points and characters of a text instance for detection and recognition. It also enables hyper-local positional embeddings to learn spatial interactions between boundary points and characters within and across text instances. Additionally, SynthMap+ is presented as a novel approach to automatically generate synthetic map images for training text spotters for historical maps. Result: Experiments show that PALETTE with SynthMap+ outperforms SOTA text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. The system has been deployed to process over 60,000 maps in the David Rumsey Historical Map collection and generated over 100 million text labels. Conclusion: PALETTE, with its innovative modules and SynthMap+, provides an effective solution for text extraction from historical maps, outperforming existing methods and supporting large-scale map searching. Abstract: Text on historical maps contains valuable information providing georeferenced historical, political, and cultural contexts. However, text extraction from historical maps is challenging due to the lack of (1) effective methods and (2) training data. Previous approaches use ad-hoc steps tailored to only specific map styles. Recent machine learning-based text spotters (e.g., for scene images) have the potential to solve these challenges because of their flexibility in supporting various types of text instances. However, these methods remain challenges in extracting precise image features for predicting every sub-component (boundary points and characters) in a text instance. This is critical because map text can be lengthy and highly rotated with complex backgrounds, posing difficulties in detecting relevant image features from a rough text region. This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps of a wide variety. PALETTE introduces a novel hyper-local sampling module to explicitly learn localized image features around the target boundary points and characters of a text instance for detection and recognition. PALETTE also enables hyper-local positional embeddings to learn spatial interactions between boundary points and characters within and across text instances. In addition, this paper presents a novel approach to automatically generate synthetic map images, SynthMap+, for training text spotters for historical maps. The experiment shows that PALETTE with SynthMap+ outperforms SOTA text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. We have deployed PALETTE with SynthMap+ to process over 60,000 maps in the David Rumsey Historical Map collection and generated over 100 million text labels to support map searching. The project is released at https://github.com/kartta-foundation/mapkurator-palette-doc.

[73] Break Stylistic Sophon: Are We Really Meant to Confine the Imagination in Style Transfer?

Gary Song Yan,Yusen Zhang,Jinyu Zhao,Hao Zhang,Zhangping Yang,Guanye Xiong,Yanfei Liu,Tao Zhang,Yujie He,Siyuan Tian,Yao Gou,Min Li

Main category: cs.CV

TL;DR: This paper presents StyleWallfacer, a novel framework for style transfer that introduces semantic-based style injection, human feedback driven data augmentation, and a training-free triple diffusion process. It achieves high-quality artist-level style transfer while preserving original content and enables image color editing during the process.

Details Motivation: To address issues in traditional style transfer methods and unify the framework for different tasks, enabling efficient and high-quality artist-level style transfer and text-driven stylization. Method: 1) Semantic-based style injection using BLIP and CLIP space to generate text descriptions aligned with style image semantics; 2) Data augmentation strategy based on human feedback to incorporate high-quality samples into the training set; 3) Training-free triple diffusion process manipulating self-attention layers features for style injection while maintaining text control. Result: Achieved high-quality image-driven style transfer and text-driven stylization with artist-level results while preserving original image content. Also enabled image color editing during the style transfer process for the first time. Conclusion: StyleWallfacer successfully addresses issues in traditional style transfer methods, unifies different tasks' frameworks, and provides an efficient solution for artist-level style transfer with preserved content and new capabilities like image color editing. Abstract: In this pioneering study, we introduce StyleWallfacer, a groundbreaking unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to revolutionize the field by enabling artist level style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering artist-level style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time.

[74] Enhancing Vector Quantization with Distributional Matching: A Theoretical and Empirical Study

Xianghong Fang,Litao Guo,Hengchao Chen,Yuxuan Zhang,XiaofanXia,Dingjie Song,Yexin Liu,Hao Wang,Harry Yang,Yuan Yuan,Qiang Sun

Main category: cs.CV

TL;DR: This paper addresses training instability and codebook collapse in vector quantization by using Wasserstein distance to align feature and code vector distributions, achieving high codebook utilization and reduced quantization error.

Details Motivation: Training instability and codebook collapse are significant issues in vector quantization methods due to mismatched distributions of features and code vectors. Method: The authors use the Wasserstein distance to align the distributions of features and code vectors. Result: The method achieves near 100% codebook utilization and significantly reduces quantization error, with both empirical and theoretical support. Conclusion: Using Wasserstein distance effectively solves the problems of training instability and codebook collapse. Abstract: The success of autoregressive models largely depends on the effectiveness of vector quantization, a technique that discretizes continuous features by mapping them to the nearest code vectors within a learnable codebook. Two critical issues in existing vector quantization methods are training instability and codebook collapse. Training instability arises from the gradient discrepancy introduced by the straight-through estimator, especially in the presence of significant quantization errors, while codebook collapse occurs when only a small subset of code vectors are utilized during training. A closer examination of these issues reveals that they are primarily driven by a mismatch between the distributions of the features and code vectors, leading to unrepresentative code vectors and significant data information loss during compression. To address this, we employ the Wasserstein distance to align these two distributions, achieving near 100\% codebook utilization and significantly reducing the quantization error. Both empirical and theoretical analyses validate the effectiveness of the proposed approach.

[75] SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts

Yufei Liu,Haoke Xiao,Jiaxing Chai,Yongcun Zhang,Rong Wang,Zijie Meng,Zhiming Luo

Main category: cs.CV

TL;DR: The paper introduces SynPo, a training-free few-shot method for medical image segmentation using Large Vision Models (LVMs). It enhances negative prompts via a Confidence Map Synergy Module, achieving results comparable to state-of-the-art trained methods.

Details Motivation: Existing training-free methods based on LVMs have difficulty effectively utilizing negative prompts, particularly affecting performance on low-contrast medical images. Method: Proposes SynPo which improves the quality of negative prompts through a Confidence Map Synergy Module that combines DINOv2 and SAM. Positive points are selected based on a confidence map while negative points use a Gaussian distribution, both followed by K-means clustering. These high-quality prompts are then used in SAM for segmentation. Result: SynPo achieves performance comparable to state-of-the-art training-based few-shot methods. Conclusion: SynPo is an effective training-free approach for few-shot medical image segmentation, especially enhancing performance on low-contrast images. Abstract: The advent of Large Vision Models (LVMs) offers new opportunities for few-shot medical image segmentation. However, existing training-free methods based on LVMs fail to effectively utilize negative prompts, leading to poor performance on low-contrast medical images. To address this issue, we propose SynPo, a training-free few-shot method based on LVMs (e.g., SAM), with the core insight: improving the quality of negative prompts. To select point prompts in a more reliable confidence map, we design a novel Confidence Map Synergy Module by combining the strengths of DINOv2 and SAM. Based on the confidence map, we select the top-k pixels as the positive points set and choose the negative points set using a Gaussian distribution, followed by independent K-means clustering for both sets. Then, these selected points are leveraged as high-quality prompts for SAM to get the segmentation results. Extensive experiments demonstrate that SynPo achieves performance comparable to state-of-the-art training-based few-shot methods.

[76] Enhancing point cloud analysis via neighbor aggregation correction based on cross-stage structure correlation

Jiaqi Shi,Jin Xiao,Xiaoguang Hu,Boyang Song,Hao Jiang,Tianyou Chen,Baochang Zhang

Main category: cs.CV

TL;DR: 提出了一种新的点云分析模块PDSA,通过高维空间相关性修正特征分布,提升计算效率与鲁棒性,在语义分割和分类任务中表现出显著性能提升且参数成本更低。

Details Motivation: 当前点云分析方法在聚合局部结构时存在无关点干扰和特征层次差距问题,而改进方法如直接几何结构编码则面临高计算开销和噪声敏感性问题。 Method: 提出了Point Distribution Set Abstraction(PDSA)模块,利用高维空间相关性修正聚合过程中的特征分布,并通过轻量级跨阶段结构描述符区分点相关性,减少邻居特征矩阵的方差以增强结构同质性,同时引入关键点机制优化计算开销。 Result: 在语义分割和分类任务上的实验结果表明,所提出的方法在不同基线模型上具有良好的泛化能力,显著提升了性能并减少了参数成本,消融实验和可视化结果进一步验证了方法的有效性和合理性。 Conclusion: PDSA模块通过修正特征分布提升了计算效率和鲁棒性,适用于语义分割和分类任务,且具有较低的参数成本。 Abstract: Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three-dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limitation by refining spatial description though explicit modeling of cross-stage structure, these enhancement methods based on direct geometric structure encoding have problems of high computational overhead and noise sensitivity. To overcome these problems, we propose the Point Distribution Set Abstraction module (PDSA) that utilizes the correlation in the high-dimensional space to correct the feature distribution during aggregation, which improves the computational efficiency and robustness. PDSA distinguishes the point correlation based on a lightweight cross-stage structural descriptor, and enhances structural homogeneity by reducing the variance of the neighbor feature matrix and increasing classes separability though long-distance modeling. Additionally, we introducing a key point mechanism to optimize the computational overhead. The experimental result on semantic segmentation and classification tasks based on different baselines verify the generalization of the method we proposed, and achieve significant performance improvement with less parameter cost. The corresponding ablation and visualization results demonstrate the effectiveness and rationality of our method. The code and training weight is available at: https://github.com/AGENT9717/PointDistribution

[77] Echo-DND: A dual noise diffusion model for robust and precise left ventricle segmentation in echocardiography

Abdur Rahman,Keerthiveena Balraj,Manojkumar Ramteke,Anurag Singh Rathore

Main category: cs.CV

TL;DR: Recent advancements in diffusion probabilistic models (DPMs) have been leveraged to develop Echo-DND, a novel dual-noise diffusion model for accurate left ventricle segmentation in echocardiograms. This model uses Gaussian and Bernoulli noises along with multi-scale fusion conditioning and spatial coherence calibration to enhance segmentation precision. Evaluated on CAMUS and EchoNet-Dynamic datasets, it achieves Dice scores of 0.962 and 0.939 respectively, outperforming existing SOTA models.

Details Motivation: Accurate segmentation of the left ventricle in echocardiograms is crucial for medical diagnostics but is complicated by the noisy, low contrast, and ambiguous nature of ultrasound images. Method: Echo-DND is a dual-noise diffusion model that combines Gaussian and Bernoulli noises. It includes a multi-scale fusion conditioning module and employs spatial coherence calibration to maintain spatial integrity in segmentation masks. Result: The model was rigorously validated on the CAMUS and EchoNet-Dynamic datasets, achieving high Dice scores of 0.962 and 0.939 respectively, surpassing the performance of existing state-of-the-art models. Conclusion: Echo-DND establishes a new standard in echocardiogram segmentation and holds promise for broader applicability in other medical imaging tasks, potentially improving diagnostic accuracy across various medical domains. Abstract: Recent advancements in diffusion probabilistic models (DPMs) have revolutionized image processing, demonstrating significant potential in medical applications. Accurate segmentation of the left ventricle (LV) in echocardiograms is crucial for diagnostic procedures and necessary treatments. However, ultrasound images are notoriously noisy with low contrast and ambiguous LV boundaries, thereby complicating the segmentation process. To address these challenges, this paper introduces Echo-DND, a novel dual-noise diffusion model specifically designed for this task. Echo-DND leverages a unique combination of Gaussian and Bernoulli noises. It also incorporates a multi-scale fusion conditioning module to improve segmentation precision. Furthermore, it utilizes spatial coherence calibration to maintain spatial integrity in segmentation masks. The model's performance was rigorously validated on the CAMUS and EchoNet-Dynamic datasets. Extensive evaluations demonstrate that the proposed framework outperforms existing SOTA models. It achieves high Dice scores of 0.962 and 0.939 on these datasets, respectively. The proposed Echo-DND model establishes a new standard in echocardiogram segmentation, and its architecture holds promise for broader applicability in other medical imaging tasks, potentially improving diagnostic accuracy across various medical domains. Project page: https://abdur75648.github.io/Echo-DND

[78] ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections

Ziling Huang,Yidan Zhang,Shin'ichi Satoh

Main category: cs.CV

TL;DR: ReSeDis is a new task that combines corpus-level retrieval with pixel-level grounding, providing a realistic testbed for multimodal search systems.

Details Motivation: Current techniques either focus on visual grounding which assumes object presence in every image or text-to-image retrieval which lacks fine-grained localization. There's a need for a unified approach addressing both challenges simultaneously. Method: Introduced ReSeDis task that requires models to decide if an object appears in images and locate it using bounding boxes or masks based on free-form descriptions. A benchmark with unique mappings of descriptions to object instances and a task-specific metric scoring retrieval recall and localization precision were developed. Result: A zero-shot baseline using a frozen vision-language model was provided, showing significant room for improvement in this area. Conclusion: ReSeDis provides an end-to-end testbed for developing robust and scalable multimodal search systems. Abstract: Large-scale visual search engines are expected to solve a dual problem at once: (i) locate every image that truly contains the object described by a sentence and (ii) identify the object's bounding box or exact pixels within each hit. Existing techniques address only one side of this challenge. Visual grounding yields tight boxes and masks but rests on the unrealistic assumption that the object is present in every test image, producing a flood of false alarms when applied to web-scale collections. Text-to-image retrieval excels at sifting through massive databases to rank relevant images, yet it stops at whole-image matches and offers no fine-grained localization. We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, a ReSeDis model must decide whether the queried object appears in each image and, if so, where it is, returning bounding boxes or segmentation masks. To enable rigorous study, we curate a benchmark in which every description maps uniquely to object instances scattered across a large, diverse corpus, eliminating unintended matches. We further design a task-specific metric that jointly scores retrieval recall and localization precision. Finally, we provide a straightforward zero-shot baseline using a frozen vision-language model, revealing significant headroom for future study. ReSeDis offers a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems.

[79] Conquering the Retina: Bringing Visual in-Context Learning to OCT

Alessio Negrini,Simon Reiß

Main category: cs.CV

TL;DR: The paper explores training generalist models for retinal OCT using visual in-context learning (VICL), proposes an evaluation protocol, evaluates a state-of-the-art VICL approach on multiple datasets, and releases code to promote further research.

Details Motivation: Recent advancements in medical image analysis have led to highly specialized models that perform exceptionally well but are limited to predefined tasks. Generalist models could offer more flexibility by allowing medical practitioners to define tasks without needing task-specific model development. Method: The authors use visual in-context learning (VICL) to train generalist models for retinal optical coherence tomography (OCT). They also propose a broad evaluation protocol tailored to VICL in OCT. Result: They extensively evaluate a state-of-the-art medical VICL approach on multiple retinal OCT datasets, establishing a first baseline that highlights both the potential and limitations of in-context learning for OCT. Conclusion: This work demonstrates the potential of VICL for training generalist models in retinal OCT, provides a baseline for future research, and fosters practical adoption by releasing their code. Abstract: Recent advancements in medical image analysis have led to the development of highly specialized models tailored to specific clinical tasks. These models have demonstrated exceptional performance and remain a crucial research direction. Yet, their applicability is limited to predefined tasks, requiring expertise and extensive resources for development and adaptation. In contrast, generalist models offer a different form of utility: allowing medical practitioners to define tasks on the fly without the need for task-specific model development. In this work, we explore how to train generalist models for the domain of retinal optical coherence tomography using visual in-context learning (VICL), i.e., training models to generalize across tasks based on a few examples provided at inference time. To facilitate rigorous assessment, we propose a broad evaluation protocol tailored to VICL in OCT. We extensively evaluate a state-of-the-art medical VICL approach on multiple retinal OCT datasets, establishing a first baseline to highlight the potential and current limitations of in-context learning for OCT. To foster further research and practical adoption, we openly release our code.

[80] Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models

Xuelin Shen,Jiayin Xu,Kangsheng Yin,Wenhan Yang

Main category: cs.CV

TL;DR: The paper introduces Privacy-Shielded Image Compression (PSIC), a flexible coding method that protects user privacy by preventing VLP models from interpreting images while maintaining perceptual quality and original compression functionality. It uses a CLTG module for customizable reconstruction and UAEO optimization function, along with an adaptive multi-objective optimization strategy.

Details Motivation: To protect users' privacy in publicly posted images from being exploited by vision-language pretrained (VLP) models. Method: Propose PSIC which produces bitstreams with multiple decoding options. Includes a CLTG module to produce bias information for different reconstructed versions and a UAEO optimization function to leverage soft labels inferred from the target VLP model's uncertainty on the training data. Also incorporates an adaptive multi-objective optimization strategy. Result: Demonstrated effectiveness through extensive experiments across multiple downstream tasks. Conclusion: PSIC is a plug-and-play solution that can be integrated into most existing Learned Image Compression (LIC) models. Abstract: The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. In this context, this paper seeks to protect users' privacy by implementing defenses at the image compression stage to prevent exploitation. Specifically, we propose a flexible coding method, termed Privacy-Shielded Image Compression (PSIC), that can produce bitstreams with multiple decoding options. By default, the bitstream is decoded to preserve satisfactory perceptual quality while preventing interpretation by VLP models. Our method also retains the original image compression functionality. With a customizable input condition, the proposed scheme can reconstruct the image that preserves its full semantic information. A Conditional Latent Trigger Generation (CLTG) module is proposed to produce bias information based on customizable conditions to guide the decoding process into different reconstructed versions, and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function is designed to leverage the soft labels inferred from the target VLP model's uncertainty on the training data. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing Learned Image Compression (LIC) models. Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of our design.

[81] DM-FNet: Unified multimodal medical image fusion via diffusion process-trained encoder-decoder

Dan He,Weisheng Li,Guofen Wang,Yuping Huang,Shiqiang Liu

Main category: cs.CV

TL;DR: A two-stage diffusion model-based fusion network (DM-FNet) is proposed for multimodal medical image fusion, which integrates a diffusion process and UNet in Stage I for detailed feature capture, and enhances cross-modal feature interaction in Stage II with three fusion modules. Experimental results show that this method improves fused image quality significantly.

Details Motivation: Existing MMIF methods have limited capacity to capture detailed features during training and insufficient cross-modal feature interaction, resulting in suboptimal fused image quality. Method: The study proposes DM-FNet, a two-stage diffusion model-based fusion network. Stage I uses a diffusion process to train UNet for image reconstruction, capturing detailed information through progressive denoising. Stage II inputs noisy images into the fusion network and integrates three key fusion modules to enhance feature recognition capability and adaptively process different modalities. Result: The experimental results across various medical image types indicate exceptional performance regarding objective evaluation metrics. The fused image preserves appropriate brightness, comprehensive distribution of radioactive tracers, rich textures, and clear edges. Conclusion: DM-FNet effectively addresses the limitations of existing MMIF methods by enhancing detailed feature capture and cross-modal feature interaction, leading to improved fused image quality. Abstract: Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality fusion results requires a careful balance of brightness, color, contrast, and detail; this ensures that the fused images effectively display relevant anatomical structures and reflect the functional status of the tissues. However, existing MMIF methods have limited capacity to capture detailed features during conventional training and suffer from insufficient cross-modal feature interaction, leading to suboptimal fused image quality. To address these issues, this study proposes a two-stage diffusion model-based fusion network (DM-FNet) to achieve unified MMIF. In Stage I, a diffusion process trains UNet for image reconstruction. UNet captures detailed information through progressive denoising and represents multilevel data, providing a rich set of feature representations for the subsequent fusion network. In Stage II, noisy images at various steps are input into the fusion network to enhance the model's feature recognition capability. Three key fusion modules are also integrated to process medical images from different modalities adaptively. Ultimately, the robust network structure and a hybrid loss function are integrated to harmonize the fused image's brightness, color, contrast, and detail, enhancing its quality and information density. The experimental results across various medical image types demonstrate that the proposed method performs exceptionally well regarding objective evaluation metrics. The fused image preserves appropriate brightness, a comprehensive distribution of radioactive tracers, rich textures, and clear edges. The code is available at https://github.com/HeDan-11/DM-FNet.

[82] video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Changli Tang,Yixuan Li,Yudong Yang,Jimin Zhuang,Guangzhi Sun,Wei Li,Zejun Ma,Chao Zhang

Main category: cs.CV

TL;DR: video-SALMONN 2,一个具有低秩适应(LoRA)的先进视听大型语言模型(LLM),通过定向偏好优化(DPO)增强视频(带配对音频)字幕制作。实验结果表明,多轮DPO(MrDPO)方法显著提高了video-SALMONN 2的字幕准确性,使该模型在视频字幕任务中超越GPT-4o和Gemini-1.5-Pro等领先模型。

Details Motivation: 视频包含大量信息,生成详细且准确的自然语言描述是视频理解的关键方面。当前需要一种更精确、高效的视频描述生成方法,以提升视频内容的理解度。 Method: 提出了一种名为video-SALMONN 2的新型视听大型语言模型,结合了低秩适应(LoRA)和定向偏好优化(DPO)。此外,还引入了多轮DPO(MrDPO)方法,通过定期更新DPO参考模型、合并及重新初始化LoRA模块,并融入真实视频字幕指导来稳定训练过程。 Result: 实验结果表明,MrDPO方法将video-SALMONN 2的字幕错误率降低了28%。最终模型在参数量仅为70亿的情况下,在视频字幕任务中超越了GPT-4o和Gemini-1.5-Pro等领先模型,并在常用的视频问答基准测试中表现出与同类规模模型相当的竞争性能。 Conclusion: video-SALMONN 2通过采用LoRA和MrDPO方法,在视频字幕生成任务中展现了卓越的性能,不仅提升了字幕准确性,还在资源消耗上保持高效,为未来视频理解技术的发展提供了新的方向。 Abstract: Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.

[83] Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images

Liangjie Meng,Danxia Li,Jinrong He,Lili Ma,Zhixin Li

Main category: cs.CV

TL;DR: This paper proposes C-AFBiFPN for SAR ship detection, enhancing feature representation and fusion to improve detection accuracy.

Details Motivation: SAR ship detection faces challenges such as significant scale variations among ships, small offshore vessels mixed with noise, and complex backgrounds. Method: The paper introduces C-AFBiFPN which includes a Convolutional Feature Enhancement (CFE) module and integrates BiFormer attention within the fusion strategy of BiFPN. Result: Experimental results on SAR Ship Detection Dataset (SSDD) show improved detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features. Conclusion: C-AFBiFPN substantially enhances SAR ship detection performance. Abstract: Synthetic Aperture Radar (SAR) enables submeter-resolution imaging and all-weather monitoring via active microwave and advanced signal processing. Currently, SAR has found extensive applications in critical maritime domains such as ship detection. However, SAR ship detection faces several challenges, including significant scale variations among ships, the presence of small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. To address these issues, this paper proposes a novel feature enhancement and fusion framework named C-AFBiFPN. C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network, aiming to enrich feature representation and enhance the ability to capture and represent local details and contextual information. Furthermore, C-AFBiFPN innovatively integrates BiFormer attention within the fusion strategy of BiFPN, creating the AFBiFPN network. AFBiFPN improves the global modeling capability of cross-scale feature fusion and can adaptively focus on critical feature regions. The experimental results on SAR Ship Detection Dataset (SSDD) indicate that the proposed approach substantially enhances detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features.

[84] RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories

Qingsong Yan,Qiang Wang,Kaiyong Zhao,Jie Chen,Bo Li,Xiaowen Chu,Fei Deng

Main category: cs.CV

TL;DR: The paper presents RA-NeRF, a method for predicting accurate camera poses with complex trajectories using NeRF and flow-driven pose regulation. It achieves state-of-the-art results on Tanks&Temple and NeRFBuster datasets.

Details Motivation: To address the challenge of achieving accurate camera pose priors in Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), especially when dealing with complex camera trajectories. Method: RA-NeRF uses an incremental pipeline to reconstruct scenes with NeRF based on photometric consistency, incorporates flow-driven pose regulation for robust initialization and localization, and employs an implicit pose filter to capture camera movement patterns and reduce noise in pose estimation. Result: Extensive experiments on Tanks&Temple and NeRFBuster datasets show that RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality. Conclusion: RA-NeRF effectively and robustly reconstructs scenes under complex pose trajectories, offering improvements in both camera pose estimation and visual quality. Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks\&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.

[85] Retrospective Memory for Camouflaged Object Detection

Chenxi Zhang,Jiayun Wu,Qing Zhang,Yazhe Zhai,Youwei Pang

Main category: cs.CV

TL;DR: The paper proposes RetroMem, a recall-augmented architecture for camouflaged object detection (COD), which integrates historical knowledge into the process of perception and inference. It employs a two-stage training paradigm with dense multi-scale adapter (DMA) in the learning stage and dynamic memory mechanism (DMM) with inference pattern reconstruction (IPR) in the recall stage. Experiments show that RetroMem outperforms existing methods.

Details Motivation: Existing COD methods lack explicit mechanisms to acquire historical context, limiting their effectiveness in challenging camouflage scenes. Method: RetroMem uses a two-stage training paradigm: 1) Learning stage - Dense Multi-scale Adapter (DMA) improves pretrained encoder's capability to capture rich multi-scale visual information; 2) Recall stage - Dynamic Memory Mechanism (DMM) and Inference Pattern Reconstruction (IPR) leverage latent relationships between learned knowledge and current sample context to reconstruct camouflage pattern inference. Result: Extensive experiments on widely used datasets demonstrate significant outperformance over existing state-of-the-art methods. Conclusion: RetroMem effectively modulates camouflage pattern perception and inference by integrating historical knowledge, leading to improved adaptation and effectiveness in handling challenging camouflage scenes. Abstract: Camouflaged object detection (COD) primarily focuses on learning subtle yet discriminative representations from complex scenes. Existing methods predominantly follow the parametric feedforward architecture based on static visual representation modeling. However, they lack explicit mechanisms for acquiring historical context, limiting their adaptation and effectiveness in handling challenging camouflage scenes. In this paper, we propose a recall-augmented COD architecture, namely RetroMem, which dynamically modulates camouflage pattern perception and inference by integrating relevant historical knowledge into the process. Specifically, RetroMem employs a two-stage training paradigm consisting of a learning stage and a recall stage to construct, update, and utilize memory representations effectively. During the learning stage, we design a dense multi-scale adapter (DMA) to improve the pretrained encoder's capability to capture rich multi-scale visual information with very few trainable parameters, thereby providing foundational inferences. In the recall stage, we propose a dynamic memory mechanism (DMM) and an inference pattern reconstruction (IPR). These components fully leverage the latent relationships between learned knowledge and current sample context to reconstruct the inference of camouflage patterns, thereby significantly improving the model's understanding of camouflage scenes. Extensive experiments on several widely used datasets demonstrate that our RetroMem significantly outperforms existing state-of-the-art methods.

[86] Domain Adaptation for Image Classification of Defects in Semiconductor Manufacturing

Adrian Poniatowski,Natalie Gentner,Manuel Barusco,Davide Dalle Pezze,Samuele Salti,Gian Antonio Susto

Main category: cs.CV

TL;DR: In the competitive semiconductor sector, Domain Adaptation (DA) techniques have shown promise in defect classification tasks. The proposed DBACS approach, inspired by CycleGAN with added loss terms, was tested on real-world Electron Microscope images under unsupervised and semi-supervised settings, demonstrating its potential to enhance DA methods.

Details Motivation: The motivation of this paper stems from the need to reduce time-to-market and improve quality in the semiconductor industry amidst high competition. Leveraging the success of deep learning in computer vision, particularly Domain Adaptation (DA), the authors aim to minimize manual re-labeling and re-training costs while improving robustness and scalability. Method: The authors propose the DBACS approach, a model inspired by CycleGAN that incorporates additional loss terms for better performance. This method is evaluated in both semi-supervised and unsupervised settings using real-world Electron Microscope images. Result: The DBACS approach demonstrates effectiveness in advancing DA techniques within the semiconductor field, as validated through experiments on Electron Microscope images under unsupervised and semi-supervised conditions. Conclusion: Domain Adaptation techniques hold significant potential for improving defect classification in the semiconductor industry. The proposed DBACS method successfully enhances these techniques, reducing computational costs and allowing experts to focus on higher-value tasks. Abstract: In the semiconductor sector, due to high demand but also strong and increasing competition, time to market and quality are key factors in securing significant market share in various application areas. Thanks to the success of deep learning methods in recent years in the computer vision domain, Industry 4.0 and 5.0 applications, such as defect classification, have achieved remarkable success. In particular, Domain Adaptation (DA) has proven highly effective since it focuses on using the knowledge learned on a (source) domain to adapt and perform effectively on a different but related (target) domain. By improving robustness and scalability, DA minimizes the need for extensive manual re-labeling or re-training of models. This not only reduces computational and resource costs but also allows human experts to focus on high-value tasks. Therefore, we tested the efficacy of DA techniques in semi-supervised and unsupervised settings within the context of the semiconductor field. Moreover, we propose the DBACS approach, a CycleGAN-inspired model enhanced with additional loss terms to improve performance. All the approaches are studied and validated on real-world Electron Microscope images considering the unsupervised and semi-supervised settings, proving the usefulness of our method in advancing DA techniques for the semiconductor field.

[87] MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion

Jun Zhu,Xinfeng Zhang,Lv Tang,JunHao Jiang

Main category: cs.CV

TL;DR: MSNeRV是一种多尺度特征融合框架,用于神经视频表示。它通过增强时间一致性、多分辨率和多频率信息集成以及改进特征提取来解决现有INR方法在表示细节密集和快速变化的视频内容方面的不足。实验结果表明,该模型在动态场景中的压缩效率优于VTM-23.7(随机访问)。

Details Motivation: 现有的INR方法难以有效表示细节密集和快速变化的视频内容,主要是由于内部网络特征的利用不足以及网络设计中缺乏视频特定的考虑。 Method: 提出了一种多尺度特征融合框架MSNeRV。在编码阶段,通过使用时间窗口增强时间一致性,并将视频分为多个图像组(GoPs),使用GoP级别的网格进行背景表示。此外,设计了一个多尺度空间解码器和自适应损失函数以整合多分辨率和多频率信息,并引入了多尺度特征块以充分利用隐藏特征。 Result: 实验结果表明,MSNeRV在HEVC ClassB和UVG数据集上表现出优越的表示能力,并且在动态场景中的压缩效率超过了VTM-23.7(随机访问)。 Conclusion: MSNeRV有效地解决了现有INR方法在处理细节密集和快速变化视频内容时的不足,展现了优越的表示能力和压缩效率。 Abstract: Implicit Neural representations (INRs) have emerged as a promising approach for video compression, and have achieved comparable performance to the state-of-the-art codecs such as H.266/VVC. However, existing INR-based methods struggle to effectively represent detail-intensive and fast-changing video content. This limitation mainly stems from the underutilization of internal network features and the absence of video-specific considerations in network design. To address these challenges, we propose a multi-scale feature fusion framework, MSNeRV, for neural video representation. In the encoding stage, we enhance temporal consistency by employing temporal windows, and divide the video into multiple Groups of Pictures (GoPs), where a GoP-level grid is used for background representation. Additionally, we design a multi-scale spatial decoder with a scale-adaptive loss function to integrate multi-resolution and multi-frequency information. To further improve feature extraction, we introduce a multi-scale feature block that fully leverages hidden features. We evaluate MSNeRV on HEVC ClassB and UVG datasets for video representation and compression. Experimental results demonstrate that our model exhibits superior representation capability among INR-based approaches and surpasses VTM-23.7 (Random Access) in dynamic scenarios in terms of compression efficiency.

[88] BCRNet: Enhancing Landmark Detection in Laparoscopic Liver Surgery via Bezier Curve Refinement

Qian Li,Feng Liu,Shuojue Yang,Daiyun Shen,Yueming Jin

Main category: cs.CV

TL;DR: 提出了一种名为BCRNet(Bezier Curve Refinement Net)的新框架,用于增强腹腔镜肝脏手术中的解剖标志检测。该框架通过多模态特征提取、自适应曲线提议初始化和分层曲线细化机制来显著提高标志检测的准确性。实验结果表明,BCRNet在L3D和P2ILF数据集上的表现优于现有方法。

Details Motivation: 腹腔镜肝脏手术中准确识别关键解剖结构具有挑战性。现有的基于2D-3D配准的AR系统提供了一种潜在解决方案,但需要改进对曲线条理解剖地标的精确检测。 Method: 1. 设计了多模态特征提取(MFE)模块以捕获语义特征。 2. 提出自适应曲线提议初始化(ACPI),生成像素对齐的贝塞尔曲线及置信度评分。 3. 设计分层曲线细化(HCR)机制,通过多阶段过程逐步优化曲线提议,并从多尺度像素级特征中捕捉精细上下文细节。 Result: 在L3D和P2ILF数据集上进行了广泛评估,结果显示BCRNet显著优于现有最先进的方法,性能提升明显。 Conclusion: BCRNet通过引入贝塞尔曲线细化策略,在腹腔镜肝脏手术中显著提高了解剖标志检测的准确性,为手术导航提供了更可靠的解决方案。 Abstract: Laparoscopic liver surgery, while minimally invasive, poses significant challenges in accurately identifying critical anatomical structures. Augmented reality (AR) systems, integrating MRI/CT with laparoscopic images based on 2D-3D registration, offer a promising solution for enhancing surgical navigation. A vital aspect of the registration progress is the precise detection of curvilinear anatomical landmarks in laparoscopic images. In this paper, we propose BCRNet (Bezier Curve Refinement Net), a novel framework that significantly enhances landmark detection in laparoscopic liver surgery primarily via the Bezier curve refinement strategy. The framework starts with a Multi-modal Feature Extraction (MFE) module designed to robustly capture semantic features. Then we propose Adaptive Curve Proposal Initialization (ACPI) to generate pixel-aligned Bezier curves and confidence scores for reliable initial proposals. Additionally, we design the Hierarchical Curve Refinement (HCR) mechanism to enhance these proposals iteratively through a multi-stage process, capturing fine-grained contextual details from multi-scale pixel-level features for precise Bezier curve adjustment. Extensive evaluations on the L3D and P2ILF datasets demonstrate that BCRNet outperforms state-of-the-art methods, achieving significant performance improvements. Code will be available.

[89] AI-driven visual monitoring of industrial assembly tasks

Mattia Nardon,Stefano Messelodi,Antonio Granata,Fabio Poiesi,Alberto Danese,Davide Boscaini

Main category: cs.CV

TL;DR: Visual monitoring of industrial assembly tasks is critical for preventing equipment damage and ensuring worker safety. The paper introduces ViMAT, an AI-driven system for real-time visual monitoring that operates without rigid workspace setups or visual markers.

Details Motivation: Visual monitoring of industrial assembly tasks is important to prevent equipment damage and ensure worker safety. Existing commercial solutions typically require rigid workspace setups or visual markers which simplify the problem but may not be practical in all scenarios. Method: ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. Result: ViMAT was validated on two assembly tasks - replacement of LEGO components and reconfiguration of hydraulic press molds. It demonstrated its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Conclusion: ViMAT is a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without the constraints of rigid workspace setups or visual markers. It has been shown to be effective in real-world scenarios. Abstract: Visual monitoring of industrial assembly tasks is critical for preventing equipment damage due to procedural errors and ensuring worker safety. Although commercial solutions exist, they typically require rigid workspace setups or the application of visual markers to simplify the problem. We introduce ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without these constraints. ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. We validate ViMAT on two assembly tasks, involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Project page: https://tev-fbk.github.io/ViMAT

[90] MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering

Xinqi Fan,Jingting Li,John See,Moi Hoon Yap,Wen-Huang Cheng,Xiaobai Li,Xiaopeng Hong,Su-Jing Wang,Adrian K. Davision

Main category: cs.CV

TL;DR: Facial micro-expression (ME) analysis has seen advancements in recognition, spotting and generation. However, conventional separate treatment of spotting and recognition is suboptimal for long-duration videos. The ME grand challenge 2025 presents two tasks: ME spot-then-recognize (ME-STR), integrating spotting and recognition; and ME visual question answering (ME-VQA), exploring ME understanding through questions answered by multimodal models.

Details Motivation: Conventional approaches to facial micro-expression analysis treat spotting and recognition as separate tasks, which is suboptimal for analyzing long-duration videos in realistic settings. There's a need for an integrated approach that can better handle these complex scenarios. Method: The ME grand challenge 2025 introduces two tasks: (1) ME spot-then-recognize (ME-STR), which integrates the processes of spotting and recognizing micro-expressions into a unified sequential pipeline; and (2) ME visual question answering (ME-VQA), which uses multimodal large language models or large vision-language models to address various question types related to micro-expressions. Result: Participants are required to run their algorithms on the provided test set and submit results for evaluation on a leaderboard. Conclusion: The ME grand challenge 2025 aims to advance the field of micro-expression analysis by promoting integrated methods for spotting and recognition, as well as leveraging multimodal models for deeper understanding. Abstract: Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. However, conventional approaches that treat spotting and recognition as separate tasks are suboptimal, particularly for analyzing long-duration videos in realistic settings. Concurrently, the emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2025 introduces two tasks that reflect these evolving research directions: (1) ME spot-then-recognize (ME-STR), which integrates ME spotting and subsequent recognition in a unified sequential pipeline; and (2) ME visual question answering (ME-VQA), which explores ME understanding through visual question answering, leveraging MLLMs or LVLMs to address diverse question types related to MEs. All participating algorithms are required to run on this test set and submit their results on a leaderboard. More details are available at https://megc2025.github.io.

[91] MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning

Leonid Ivanov,Vasily Yuryev,Dmitry Yudin

Main category: cs.CV

TL;DR: An enhanced End-to-End model named MapFM is introduced for online vectorized HD map generation in autonomous driving. It incorporates a powerful foundation model for encoding camera images and integrates auxiliary prediction heads for semantic segmentation in the BEV representation, leading to higher accuracy and improved quality of the predicted maps.

Details Motivation: In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making. Method: The method involves an enhanced End-to-End model named MapFM for online vectorized HD map generation. This model boosts feature representation quality by incorporating a powerful foundation model for encoding camera images and integrates auxiliary prediction heads for semantic segmentation in the BEV representation. Result: This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. Conclusion: MapFM significantly enhances the accuracy and quality of vectorized HD maps for autonomous driving applications. Abstract: In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model's understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at https://github.com/LIvanoff/MapFM.

[92] OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models

Lanfeng Zhong,Xin Liao,Shichuan Zhang,Shaoting Zhang,Guotai Wang

Main category: cs.CV

TL;DR: 该论文提出了一种名为OpenPath的新方法,用于病理图像分类的开放集主动学习。通过结合预训练视觉-语言模型(VLM)和特定任务提示,在初始查询中有效选择分布内且信息丰富的样本,并在后续查询中采用多样化的信息ID采样策略以确保查询的纯度和信息量。实验表明,OpenPath显著提高了模型性能并优于现有方法。

Details Motivation: 当前主动学习(AL)方法主要基于封闭集假设,而在实际临床环境中未标注数据池可能包含大量分布外(OOD)数据,导致传统AL方法标注效率低下。此外,大多数现有AL方法在第一轮查询时采用随机选择方式,造成开放集场景下标注成本的浪费。 Method: 1. 提出OpenPath方法,利用预训练视觉-语言模型(VLM)进行开放集主动学习。 2. 在第一轮查询中设计任务特定提示,结合目标类和相关非目标类提示以选择分布内且信息丰富的样本。 3. 在后续查询中引入多样化信息ID采样(DIS),包括基于原型的ID候选选择(PIS)和熵引导随机采样(EGSS),以确保查询的纯度和信息量,避免选择OOD样本。 Result: 在两个公开病理图像数据集上的实验表明,由于所选样本的高度纯度,OpenPath显著提升了模型性能,并优于多个最先进的开放集AL方法。 Conclusion: OpenPath是一种有效的开放集主动学习方法,适用于病理图像分类任务。它通过减少OOD干扰和提高样本选择质量,大幅提升了模型性能,为实际临床应用提供了新思路。 Abstract: Pathology image classification plays a crucial role in accurate medical diagnosis and treatment planning. Training high-performance models for this task typically requires large-scale annotated datasets, which are both expensive and time-consuming to acquire. Active Learning (AL) offers a solution by iteratively selecting the most informative samples for annotation, thereby reducing the labeling effort. However, most AL methods are designed under the assumption of a closed-set scenario, where all the unannotated images belong to target classes. In real-world clinical environments, the unlabeled pool often contains a substantial amount of Out-Of-Distribution (OOD) data, leading to low efficiency of annotation in traditional AL methods. Furthermore, most existing AL methods start with random selection in the first query round, leading to a significant waste of labeling costs in open-set scenarios. To address these challenges, we propose OpenPath, a novel open-set active learning approach for pathological image classification leveraging a pre-trained Vision-Language Model (VLM). In the first query, we propose task-specific prompts that combine target and relevant non-target class prompts to effectively select In-Distribution (ID) and informative samples from the unlabeled pool. In subsequent queries, Diverse Informative ID Sampling (DIS) that includes Prototype-based ID candidate Selection (PIS) and Entropy-Guided Stochastic Sampling (EGSS) is proposed to ensure both purity and informativeness in a query, avoiding the selection of OOD samples. Experiments on two public pathology image datasets show that OpenPath significantly enhances the model's performance due to its high purity of selected samples, and outperforms several state-of-the-art open-set AL methods. The code is available at \href{https://github.com/HiLab-git/OpenPath}{https://github.com/HiLab-git/OpenPath}..

[93] Open-World Object Counting in Videos

Niki Amini-Naieni,Andrew Zisserman

Main category: cs.CV

TL;DR: An open-world object counting task in videos is introduced, along with the CountVid model which combines an image-based counting model and a promptable video segmentation and tracking model. A new dataset called VideoCount is also presented to evaluate its performance.

Details Motivation: The motivation is to create a system that can enumerate all unique instances of target objects in videos given a text description or an image example, especially in crowded scenes with occlusions and similar objects where avoiding double counting and identifying reappearances is crucial. Method: Introduced CountVid model leverages an image-based counting model and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. Result: CountVid provides accurate object counts and significantly outperforms strong baselines when evaluated on the VideoCount dataset. Conclusion: CountVid model successfully accomplishes the open-world object counting task in videos, demonstrating superior performance compared to baselines. Abstract: We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and similar objects, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for our novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://github.com/niki-amini-naieni/CountVid/.

[94] Unsupervised Pelage Pattern Unwrapping for Animal Re-identification

Aleksandr Algasov,Ekaterina Nepovinnykh,Fedor Zolotarev,Tuomas Eerola,Heikki Kälviäinen,Pavel Zemčík,Charles V. Stewart

Main category: cs.CV

TL;DR: 提出了一种几何感知的纹理映射方法,将动物皮毛的独特标记展开到规范的UV空间中,以提高个体再识别的准确性。该方法通过表面法向量估计引导展开过程,并在不使用真实UV注释的情况下进行自监督训练。实验表明,在海豹和豹子数据集上,再识别精度提高了5.4%。

Details Motivation: 现有的个体再识别方法难以应对由于身体运动和姿势变化引起的动物皮毛或皮肤图案的几何扭曲问题。 Method: 提出了一种几何感知的纹理映射方法,将动物皮毛的独特标记展开到规范的UV空间中,使用表面法向量估计来引导展开过程,保持3D表面与2D纹理空间之间的几何一致性。并将此方法与现有的再识别技术相结合。 Result: 在海豹和豹子数据集上的实验表明,再识别精度提高了5.4%。 Conclusion: 所提出的几何感知纹理映射方法能够有效解决动物皮毛图案变形问题,提高了个体再识别的准确性,并且可以通过自监督学习进行训练。 Abstract: Existing individual re-identification methods often struggle with the deformable nature of animal fur or skin patterns which undergo geometric distortions due to body movement and posture changes. In this paper, we propose a geometry-aware texture mapping approach that unwarps pelage patterns, the unique markings found on an animal's skin or fur, into a canonical UV space, enabling more robust feature matching. Our method uses surface normal estimation to guide the unwrapping process while preserving the geometric consistency between the 3D surface and the 2D texture space. We focus on two challenging species: Saimaa ringed seals (Pusa hispida saimensis) and leopards (Panthera pardus). Both species have distinctive yet highly deformable fur patterns. By integrating our pattern-preserving UV mapping with existing re-identification techniques, we demonstrate improved accuracy across diverse poses and viewing angles. Our framework does not require ground truth UV annotations and can be trained in a self-supervised manner. Experiments on seal and leopard datasets show up to a 5.4% improvement in re-identification accuracy.

[95] When Model Knowledge meets Diffusion Model: Diffusion-assisted Data-free Image Synthesis with Alignment of Domain and Class

Yujin Kim,Hyunsoo Kim,Hyunwoo J. Kim,Suhyun Kim

Main category: cs.CV

TL;DR: DDIS is a new method for data-free image synthesis which uses a text-to-image diffusion model to improve synthetic image quality. It introduces Domain Alignment Guidance and Class Alignment Token to align the synthetic data domain with the training data domain.

Details Motivation: Existing DFIS methods produce samples that deviate from the training data distribution due to lack of prior knowledge about natural images. Method: Propose DDIS, a Diffusion-assisted Data-free Image Synthesis method that leverages a text-to-image diffusion model as a powerful image prior. Introduce Domain Alignment Guidance (DAG) and optimize a single Class Alignment Token (CAT) embedding. Result: Experiments on PACS and ImageNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution. Conclusion: DDIS achieves SOTA performance in data-free applications. Abstract: Open-source pre-trained models hold great potential for diverse applications, but their utility declines when their training data is unavailable. Data-Free Image Synthesis (DFIS) aims to generate images that approximate the learned data distribution of a pre-trained model without accessing the original data. However, existing DFIS meth ods produce samples that deviate from the training data distribution due to the lack of prior knowl edge about natural images. To overcome this limitation, we propose DDIS, the first Diffusion-assisted Data-free Image Synthesis method that leverages a text-to-image diffusion model as a powerful image prior, improving synthetic image quality. DDIS extracts knowledge about the learned distribution from the given model and uses it to guide the diffusion model, enabling the generation of images that accurately align with the training data distribution. To achieve this, we introduce Domain Alignment Guidance (DAG) that aligns the synthetic data domain with the training data domain during the diffusion sampling process. Furthermore, we optimize a single Class Alignment Token (CAT) embedding to effectively capture class-specific attributes in the training dataset. Experiments on PACS and Ima geNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution, achieving SOTA performance in data-free applications.

[96] NERO: Explainable Out-of-Distribution Detection with Neuron-level Relevance

Anju Chhetri,Jari Korhonen,Prashnna Gyawali,Binod Bhattarai

Main category: cs.CV

TL;DR: The paper proposes NERO, a novel OOD scoring mechanism that uses neuron-level relevance to enhance OOD separability in medical imaging. It achieves improvements over state-of-the-art methods on benchmarks.

Details Motivation: To improve the reliability of deep learning models in medical imaging by better detecting out-of-distribution samples which can help identify potential anomalies. Method: NERO leverages neuron-level relevance at the feature layer, clusters this relevance for each in-distribution class to form centroids, and introduces a relevance distance metric. Scaled relevance in the bias term and feature norms are also incorporated. Result: NERO shows improvements over state-of-the-art OOD detection methods when validated across multiple deep learning architectures on the Kvasir and GastroVision benchmarks. Conclusion: NERO enhances OOD separability and provides explainable OOD detection, contributing to more reliable deep learning models in medical imaging. Abstract: Ensuring reliability is paramount in deep learning, particularly within the domain of medical imaging, where diagnostic decisions often hinge on model outputs. The capacity to separate out-of-distribution (OOD) samples has proven to be a valuable indicator of a model's reliability in research. In medical imaging, this is especially critical, as identifying OOD inputs can help flag potential anomalies that might otherwise go undetected. While many OOD detection methods rely on feature or logit space representations, recent works suggest these approaches may not fully capture OOD diversity. To address this, we propose a novel OOD scoring mechanism, called NERO, that leverages neuron-level relevance at the feature layer. Specifically, we cluster neuron-level relevance for each in-distribution (ID) class to form representative centroids and introduce a relevance distance metric to quantify a new sample's deviation from these centroids, enhancing OOD separability. Additionally, we refine performance by incorporating scaled relevance in the bias term and combining feature norms. Our framework also enables explainable OOD detection. We validate its effectiveness across multiple deep learning architectures on the gastrointestinal imaging benchmarks Kvasir and GastroVision, achieving improvements over state-of-the-art OOD detection methods.

[97] Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

Team Hunyuan3D,Shuhui Yang,Mingxin Yang,Yifei Feng,Xin Huang,Sheng Zhang,Zebin He,Di Luo,Haolin Liu,Yunfei Zhao,Qingxiang Lin,Zeqiang Lai,Xianghui Yang,Huiwen Shi,Zibo Zhao,Bowen Zhang,Hongyu Yan,Lifu Wang,Sicong Liu,Jihong Zhang,Meng Chen,Liang Dong,Yiwen Jia,Yulin Cai,Jiaao Yu,Yixuan Tang,Dongyuan Guo,Junlin Yu,Hao Zhang,Zheng Ye,Peng He,Runzhou Wu,Shida Wei,Chao Zhang,Yonghao Tan,Yifu Sun,Lin Niu,Shirui Huang,Bojian Zheng,Shu Liu,Shilin Chen,Xiang Yuan,Xiaofeng Yang,Kai Liu,Jianchen Zhu,Peng Chen,Tian Liu,Di Wang,Yuhong Liu,Linus,Jie Jiang,Jingwei Huang,Chunchao Guo

Main category: cs.CV

TL;DR: 本文介绍了一个名为Hunyuan3D 2.1的系统,旨在解决3D模型生成领域中的复杂性问题,提供高分辨率、带纹理的3D资产生产。系统包含两个核心组件:用于形状生成的Hunyuan3D-DiT和用于纹理合成的Hunyuan3D-Paint。文章提供了从数据准备到模型训练、评估和部署的全流程指导。

Details Motivation: 尽管已经有一些突破性的3D生成模型,但3D AI生成内容(AIGC)领域仍然因为收集、处理和训练3D模型的复杂性而难以普及。 Method: 该方法通过Hunyuan3D 2.1系统实现,包括形状生成模块Hunyuan3D-DiT和纹理合成模块Hunyuan3D-Paint,并提供从数据准备到模型架构、训练策略、评估指标和部署的全面指南。 Result: 通过本教程,读者将能够微调或开发一个强大的3D生成模型,适用于游戏、虚拟现实和工业设计等领域。 Conclusion: Hunyuan3D 2.1为3D生成模型的创建提供了一个全面的解决方案,使得非专业人员也能进入3D AIGC领域,推动了高分辨率、带纹理的3D资产的生成和发展。 Abstract: 3D AI-generated content (AIGC) is a passionate field that has significantly accelerated the creation of 3D models in gaming, film, and design. Despite the development of several groundbreaking models that have revolutionized 3D generation, the field remains largely accessible only to researchers, developers, and designers due to the complexities involved in collecting, processing, and training 3D models. To address these challenges, we introduce Hunyuan3D 2.1 as a case study in this tutorial. This tutorial offers a comprehensive, step-by-step guide on processing 3D data, training a 3D generative model, and evaluating its performance using Hunyuan3D 2.1, an advanced system for producing high-resolution, textured 3D assets. The system comprises two core components: the Hunyuan3D-DiT for shape generation and the Hunyuan3D-Paint for texture synthesis. We will explore the entire workflow, including data preparation, model architecture, training strategies, evaluation metrics, and deployment. By the conclusion of this tutorial, you will have the knowledge to finetune or develop a robust 3D generative model suitable for applications in gaming, virtual reality, and industrial design.

[98] Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning

Chunlei Li,Jingyang Hou,Yilei Shi,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou

Main category: cs.CV

TL;DR: The paper introduces MRG-LLM, a novel multimodal large language model that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism to generate medical reports from imaging data. It achieves state-of-the-art performance on IU X-ray and MIMIC-CXR datasets.

Details Motivation: Medical report generation from imaging data remains a challenging task in clinical practice, and although LLMs show great promise, their effective integration with medical imaging data still deserves in-depth exploration. Method: MRG-LLM is a multimodal large language model that combines a frozen LLM with a learnable visual encoder. It introduces a dynamic prompt customization mechanism through conditional affine transformations derived from visual features, proposing two implementations: prompt-wise and promptbook-wise customization. Result: Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Conclusion: MRG-LLM effectively integrates LLMs with medical imaging data through its innovative dynamic prompt customization mechanism, achieving superior results in medical report generation. Abstract: Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.

[99] GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects

Shujia Li,Haiyu Zhang,Xinyuan Chen,Yaohui Wang,Yutong Ban

Main category: cs.CV

TL;DR: This paper introduces GenHOI, a two-stage framework for 4D human-object interaction synthesis that generalizes to unseen objects and generates high-fidelity sequences. It uses Object-AnchorNet for sparse 3D keyframe reconstruction and Contact-Aware Diffusion Model for interpolation into dense 4D sequences.

Details Motivation: Existing diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis but face challenges in extending these advances to 4D human-object interactions due to the limited availability of large-scale 4D HOI datasets. Method: The framework consists of two stages: 1) Object-AnchorNet reconstructs sparse 3D HOI keyframes for unseen objects using only 3D HOI datasets; 2) Contact-Aware Diffusion Model interpolates these keyframes into dense 4D HOI sequences with a novel Contact-Aware Encoder and Contact-Aware HOI Attention. Result: GenHOI achieves state-of-the-art results on OMOMO and 3D-FUTURE datasets, demonstrating strong generalization abilities to unseen objects and enabling high-fidelity 4D HOI generation. Conclusion: GenHOI successfully addresses the challenge of synthesizing 4D human-object interactions by generalizing to unseen objects and producing high-quality sequences, reducing reliance on large-scale 4D HOI datasets. Abstract: While diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, extending these advances to 4D human-object interaction (HOI) remains challenging, mainly due to the limited availability of large-scale 4D HOI datasets. In our study, we introduce GenHOI, a novel two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences. In the initial stage of our framework, we employ an Object-AnchorNet to reconstruct sparse 3D HOI keyframes for unseen objects, learning solely from 3D HOI datasets, thereby mitigating the dependence on large-scale 4D HOI datasets. Subsequently, we introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate sparse 3D HOI keyframes into densely temporally coherent 4D HOI sequences. To enhance the quality of generated 4D HOI sequences, we propose a novel Contact-Aware Encoder within ContactDM to extract human-object contact patterns and a novel Contact-Aware HOI Attention to effectively integrate the contact signals into diffusion models. Experimental results show that we achieve state-of-the-art results on the publicly available OMOMO and 3D-FUTURE datasets, demonstrating strong generalization abilities to unseen objects, while enabling high-fidelity 4D HOI generation.

[100] NTIRE 2025 Image Shadow Removal Challenge Report

Florin-Alexandru Vasluianu,Tim Seizinger,Zhuyun Zhou,Cailian Chen,Zongwei Wu,Radu Timofte,Mingjia Li,Jin Hu,Hainuo Wang,Hengxing Liu,Jiarui Wang,Qiming Hu,Xiaojie Guo,Xin Lu,Jiarong Yang,Yuanfei Bao,Anya Hu,Zihao Fan,Kunyu Wang,Jie Xiao,Xi Wang,Xueyang Fu,Zheng-Jun Zha,Yu-Fan Lin,Chia-Ming Lee,Chih-Chung Hsu,Xingbo Wang,Dong Li,Yuxu Chen,Bin Chen,Yuanbo Zhou,Yuanbin Chen,Hongwei Wang,Jiannan Lin,Qinquan Gao,Tong Tong,Zhao Zhang,Yanyan Wei,Wei Dong,Han Zhou,Seyed Amirreza Mousavi,Jun Chen,Haobo Liang,Jiajie Jing,Junyu Li,Yan Yang,Seoyeon Lee,Chaewon Kim,Ziyu Feng,Shidi Chen,Bowen Luan,Zewen Chen,Vijayalaxmi Ashok Aralikatti,G Gyaneshwar Rao,Nikhil Akalwadi,Chaitra Desai,Ramesh Ashok Tabib,Uma Mudenagudi,Anas M. Ali,Bilel Benjdira,Wadii Boulila,Alexandru Brateanu,Cosmin Ancuti,Tanmay Chaturvedi,Manish Kumar,Anmol Srivastav,Daksh Trivedi,Shashwat Thakur,Kishor Upla,Zeyu Xiao,Zhuoyuan Li,Boda Zhou,Shashank Shekhar,Kele Xu,Qisheng Xu,Zijian Gao,Tianjiao Wan,Suiyi Zhao,Bo Wang,Yan Luo,Mingshen Wang,Yilin Zhang

Main category: cs.CV

TL;DR: This paper explores the results of the NTIRE 2025 Shadow Removal Challenge, which had two evaluation tracks focusing on reconstruction fidelity and visual perception. 17 teams submitted solutions using the WSRD+ dataset.

Details Motivation: To advance shadow removal techniques by evaluating different approaches through a challenge with diverse objects, textures, and materials. Method: The challenge involved two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study, both using images from the WSRD+ dataset. Result: A total of 306 participants registered and 17 teams successfully submitted their solutions during the final evaluation phase. Conclusion: The NTIRE 2025 Shadow Removal Challenge successfully evaluated various methods for shadow removal, considering both reconstruction fidelity and visual perception. Abstract: This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.

[101] CLAIM: Clinically-Guided LGE Augmentation for Realistic and Diverse Myocardial Scar Synthesis and Segmentation

Farheen Ramzan,Yusuf Kiberu,Nikesh Jathanna,Shahnaz Jamil-Copley,Richard H. Clayton,Chen,Chen

Main category: cs.CV

TL;DR: The paper introduces CLAIM, a framework using the SMILE module for realistic myocardial scar synthesis and segmentation from LGE cardiac MRI, which enhances both synthesized scar realism and segmentation accuracy.

Details Motivation: The motivation is to overcome the limitations of available high-quality LGE images with scar labels for training robust deep learning models for myocardial scar segmentation. Method: The method includes the CLAIM framework with its core SMILE module that uses a diffusion-based generator conditioned on the AHA 17-segment model for anatomically consistent scar synthesis. It also employs joint training for optimizing the scar segmentation network alongside the generator. Result: Experimental results demonstrate that CLAIM produces anatomically coherent scar patterns with higher Dice similarity to real scar distributions compared to baseline models. Conclusion: CLAIM enables controllable and realistic myocardial scar synthesis and shows utility for downstream medical imaging tasks. Abstract: Deep learning-based myocardial scar segmentation from late gadolinium enhancement (LGE) cardiac MRI has shown great potential for accurate and timely diagnosis and treatment planning for structural cardiac diseases. However, the limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models. To address this, we introduce CLAIM: \textbf{C}linically-Guided \textbf{L}GE \textbf{A}ugmentation for Real\textbf{i}stic and Diverse \textbf{M}yocardial Scar Synthesis and Segmentation framework, a framework for anatomically grounded scar generation and segmentation. At its core is the SMILE module (Scar Mask generation guided by cLinical knowledgE), which conditions a diffusion-based generator on the clinically adopted AHA 17-segment model to synthesize images with anatomically consistent and spatially diverse scar patterns. In addition, CLAIM employs a joint training strategy in which the scar segmentation network is optimized alongside the generator, aiming to enhance both the realism of synthesized scars and the accuracy of the scar segmentation performance. Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Our approach enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task.

[102] RaCalNet: Radar Calibration Network for Sparse-Supervised Metric Depth Estimation

Xingrui Qin,Wentao Zhao,Chuan Cao,Yihe Niu,Houcheng Jiang,Jingchuan Wang

Main category: cs.CV

TL;DR: RaCalNet is a novel framework that uses sparse LiDAR to supervise the learning of refined radar measurements, achieving dense metric depth estimation without needing dense LiDAR supervision. It recalibrates sparse radar points to construct accurate depth priors and guides monocular depth prediction, improving structural consistency and preserving fine details. Experiments show its effectiveness in reducing RMSE.

Details Motivation: The motivation of this paper is to address the cost and data-intensity issues associated with using dense LiDAR supervision for depth estimation from millimeter-wave radar and RGB images. By eliminating the need for dense supervision, the authors aim to create a more efficient and less resource-intensive method for depth estimation. Method: The proposed RaCalNet framework uses sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of around 1%. It recalibrates and refines sparse radar points to construct accurate depth priors which serve as reliable anchors for guiding monocular depth prediction. Result: RaCalNet surpasses state-of-the-art dense-supervised methods by producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, with reductions in RMSE by 35.30% and 34.89%, respectively. Conclusion: RaCalNet successfully eliminates the need for dense LiDAR supervision by using sparse LiDAR to guide the learning process, leading to improved structural consistency and preservation of fine details in depth maps. Abstract: Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, to guide the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1% compared to dense-supervised methods. Unlike previous approaches that associate radar points with broad image regions and rely heavily on dense labels, RaCalNet first recalibrates and refines sparse radar points to construct accurate depth priors. These priors then serve as reliable anchors to guide monocular depth prediction, enabling metric-scale estimation without resorting to dense supervision. This design improves structural consistency and preserves fine details. Despite relying solely on sparse supervision, RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively.

[103] Control and Realism: Best of Both Worlds in Layout-to-Image without Training

Bonan Li,Yinhan Hu,Songhua Liu,Xinchao Wang

Main category: cs.CV

TL;DR: WinWinLay是一种无需训练的新方法,通过非局部注意力能量函数和自适应更新策略,解决了现有文本到图像扩散模型在布局到图像生成任务中的定位不准确和不真实伪影的问题,从而提高了控制精度和现实主义效果。

Details Motivation: 现有的文本到图像扩散模型虽然无需特定数据训练即可实现复杂场景生成,但存在定位不精确和产生不真实伪影的挑战。为了解决这些问题,需要提出一种新的方法来提高生成图像的精度和真实性。 Method: WinWinLay提出了两个关键策略:非局部注意力能量函数和自适应更新。前者通过重新分配注意力分数克服了空间分布偏差,使对象更符合布局指令;后者基于Langevin动力学引入自适应更新方案,以促进域内更新并尊重布局约束。 Result: 大量实验表明,WinWinLay在控制元素放置和实现照片级视觉保真度方面表现出色,优于当前最先进的方法。 Conclusion: WinWinLay通过其创新的策略显著提升了布局到图像生成任务中对象放置的精确性和图像的真实性,成为该领域的一种领先方法。 Abstract: Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies, Non-local Attention Energy Function and Adaptive Update, that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.

[104] Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie,Zhenheng Yang,Mike Zheng Shou

Main category: cs.CV

TL;DR: This paper presents Show-o2, an improved multimodal model that uses autoregressive modeling and flow matching for effective understanding and generation of text, images, and videos.

Details Motivation: To enhance the capabilities of unified multimodal models in understanding and generating diverse forms of data including text, images, and videos. Method: Show-o2 leverages a 3D causal variational autoencoder space with dual-path spatial(-temporal) fusion for constructing unified visual representations. It applies autoregressive modeling to the language head for text token prediction and flow matching to the flow head for image/video generation. A two-stage training approach is used to effectively learn and scale the model. Result: The Show-o2 models show versatility in handling a wide range of multimodal tasks across different modalities, indicating successful integration and scalability. Conclusion: Show-o2 demonstrates strong performance in multimodal understanding and generation tasks, and the associated code and models have been made publicly available. Abstract: This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

[105] Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification

Junhao Wu,Aboagye-Ntow Stephen,Chuyuan Wang,Gang Chen,Xin Huang

Main category: cs.CV

TL;DR: An ultra-high spatial resolution land cover classification method is proposed, which uses a semi-supervised segmentation framework for 0.3 m spatial resolution imagery, introducing FreqWeaver Adapter to improve detail modeling with low parameter overhead. It leverages unlabeled data and achieves better results than existing methods.

Details Motivation: Existing methods focus on 1-meter spatial resolution imagery and rely heavily on annotated data, which is impractical for higher-resolution imagery under weak supervision. Method: A parameter-efficient semi-supervised segmentation framework for 0.3 m spatial resolution imagery is proposed, leveraging the knowledge of SAM2 and introducing a remote sensing-specific FreqWeaver Adapter to enhance fine-grained detail modeling while maintaining a lightweight design. Result: The proposed method delivers robust segmentation results with superior structural consistency, achieving a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches. Conclusion: The parameter-efficient semi-supervised segmentation framework with FreqWeaver Adapter effectively leverages unlabeled data and maintains minimal parameter overhead, providing robust segmentation results for ultra-high spatial resolution land cover classification. Abstract: Ultra-high Spatial Resolution Land Cover Classification is essential for fine-grained land cover analysis, yet it remains challenging due to the high cost of pixel-level annotations, significant scale variation, and the limited adaptability of large-scale vision models. Existing methods typically focus on 1-meter spatial resolution imagery and rely heavily on annotated data, whereas practical applications often require processing higher-resolution imagery under weak supervision. To address this, we propose a parameter-efficient semi-supervised segmentation framework for 0.3 m spatial resolution imagery, which leverages the knowledge of SAM2 and introduces a remote sensing-specific FreqWeaver Adapter to enhance fine-grained detail modeling while maintaining a lightweight design at only 5.96% of the total model parameters. By effectively leveraging unlabeled data and maintaining minimal parameter overhead, the proposed method delivers robust segmentation results with superior structural consistency, achieving a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches.

[106] A Unified Graph-based Framework for Scalable 3D Tree Reconstruction and Non-Destructive Biomass Estimation from Point Clouds

Di Wang,Shi Li

Main category: cs.CV

TL;DR: The paper presents a novel framework for large-scale, non-destructive estimation of forest above-ground biomass (AGB) using an innovative graph-based pipeline. This approach integrates tree segmentation, leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations. Experimental results demonstrate strong performance under challenging conditions and significantly reduce dependency on specialized pre-processing tools.

Details Motivation: Current Quantitative Structural Model (QSM) methods for estimating AGB are limited as they are designed for individual trees, depend on high-quality TLS data, and require multiple pre-processing steps which hinder scalability and practical deployment. Method: A unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The framework integrates tree segmentation, leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Result: Strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). Conclusion: The proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. Abstract: Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction. However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment. This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS). Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales. This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research.

[107] One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

Yujing Sun,Lingchen Sun,Shuaizheng Liu,Rongyuan Wu,Zhengqiang Zhang,Lei Zhang

Main category: cs.CV

TL;DR: DLoRAL是一种用于真实视频超分辨率的双LoRA学习范式,通过交替优化一致性和细节增强,实现高效高质量的视频恢复。

Details Motivation: 现有的基于稳定扩散(SD)的真实视频超分辨率方法在保持时间连贯性的同时往往牺牲了空间细节,导致视觉质量不佳。需要一种方法能有效提取低质量输入视频中的降级鲁棒时间一致性先验,并在此基础上增强视频细节。 Method: 提出了一种双LoRA学习(DLoRAL)范式,训练一个有效的一步扩散模型。具体包括:1) 引入跨帧检索(CFR)模块聚合帧间互补信息;2) 训练一致性LoRA(C-LoRA)从降级输入中学习鲁棒的时间表示;3) 在一致性学习后,固定CFR和C-LoRA模块,训练细节LoRA(D-LoRA)以增强空间细节并保持时间连贯性。两阶段交替迭代优化,最终合并两个LoRA分支到SD模型中进行高效推理。 Result: 实验表明,DLoRAL在准确性和速度上均表现出色,能够实现高效且高质量的视频恢复。 Conclusion: DLoRAL范式通过结合时间和空间优化,成功解决了真实视频超分辨率中细节与一致性之间的权衡问题,提供了强大的性能表现。 Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

[108] Mono-Modalizing Extremely Heterogeneous Multi-Modal Medical Image Registration

Kyobin Choo,Hyunkyung Han,Jinyeong Kim,Chanyong Yoon,Seong Jae Hwang

Main category: cs.CV

TL;DR: 提出M2M-Reg框架,通过仅使用单模态相似性训练多模态DIR模型,解决高异质模态配准问题,并引入GradCyCon正则化器促进可微分性,在ADNI数据集上实现比先前方法高达两倍的DSC。

Details Motivation: 由于功能成像模态(如PET和FA)与结构参考(如MRI、CT)之间的极端异质性,传统的无监督DIR方法难以学习可靠的映射并常导致图像失真,其相似性度量无法捕捉高度差异模态间的对齐。 Method: 提出了M2M-Reg(多对单配准)框架,利用仅单模态相似性训练多模态DIR模型,同时保留现有架构范式以无缝集成到现有模型中;还引入了GradCyCon正则化器,利用M2M-Reg的循环训练方案促进可微性;该框架可自然扩展至半监督设置,无需真实变换或分割掩码。 Result: 在ADNI数据集上的实验表明,M2M-Reg在PET-MRI和FA-MRI配准中的DSC比先前方法高出高达两倍,证明了其处理高异质多模态DIR的有效性。 Conclusion: M2M-Reg框架通过仅使用单模态相似性成功解决了高异质模态间配准的问题,显著提高了配准精度,并且可以自然扩展到半监督场景,具有重要的临床应用潜力。 Abstract: In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to standard structural scans, conventional unsupervised DIR methods struggle to learn reliable spatial mappings and often distort images. We find that the similarity metrics guiding these models fail to capture alignment between highly disparate modalities. To address this, we propose M2M-Reg (Multi-to-Mono Registration), a novel framework that trains multi-modal DIR models using only mono-modal similarity while preserving the established architectural paradigm for seamless integration into existing models. We also introduce GradCyCon, a regularizer that leverages M2M-Reg's cyclic training scheme to promote diffeomorphism. Furthermore, our framework naturally extends to a semi-supervised setting, integrating pre-aligned and unaligned pairs only, without requiring ground-truth transformations or segmentation masks. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that M2M-Reg achieves up to 2x higher DSC than prior methods for PET-MRI and FA-MRI registration, highlighting its effectiveness in handling highly heterogeneous multi-modal DIR. Our code is available at https://github.com/MICV-yonsei/M2M-Reg.

[109] BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

Yuqing Lan,Chenyang Zhu,Zhirui Gao,Jiazhao Zhang,Yihan Cao,Renjiao Yi,Yijie Wang,Kai Xu

Main category: cs.CV

TL;DR: 提出了一种新的无需重建的在线框架,用于高效的3D物体检测,结合Cubify Anything和CLIP模型,并通过关联和优化模块处理多视角信息,在ScanNetV2和CA-1M数据集上表现出色。

Details Motivation: 开放词汇3D物体检测在自动驾驶和实体AI中有重要应用,但现有的方法依赖密集点云重建,计算开销大且内存受限,阻碍了实时部署。 Method: 利用流式RGB-D视频输入,采用预训练的Cubify Anything作为单视图3D物体检测的基础模型,结合CLIP捕捉开放词汇语义;通过关联模块(使用3D NMS和盒子对应匹配)和优化模块(基于粒子滤波的IoU引导高效随机优化技术)融合多视角检测到的边界框。 Result: 在ScanNetV2和CA-1M数据集上的广泛实验表明,该方法在在线方法中达到最先进的性能,具有良好的泛化能力,甚至可以在超过1000平方米的环境中实现实时感知。 Conclusion: 所提出的无需重建的3D物体检测框架提高了效率,实现了实时性,适用于多种场景。 Abstract: Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.

[110] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization

Roey Ron,Guy Tevet,Haim Sawdayee,Amit H. Bermano

Main category: cs.CV

TL;DR: The paper introduces HOIDiNi, a text-driven diffusion framework for generating realistic human-object interactions (HOIs). It uses Diffusion Noise Optimization (DNO) to achieve both realism and physical correctness by separating the process into object-centric and human-centric phases. Evaluations on the GRAB dataset show superior performance in contact accuracy, physical validity, and overall quality.

Details Motivation: Generating realistic and plausible human-object interactions is extremely challenging due to the need for strict contact accuracies alongside diverse motion manifold. Current methods trade off between realism and physical correctness, prompting the need for a solution that achieves both. Method: HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO). The problem is separated into two phases: an object-centric phase for discrete choices of hand-object contact locations, and a human-centric phase that refines full-body motion based on these choices. Result: Quantitative, qualitative, and subjective evaluations on the GRAB dataset indicate that HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. The framework can generate complex, controllable interactions such as grasping, placing, and full-body coordination driven by textual prompts. Conclusion: HOIDiNi presents a novel approach to HOI generation that achieves both realism and physical correctness through its structured two-phase method, demonstrating superior performance on the GRAB dataset. Abstract: We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. https://hoidini.github.io.

[111] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Karmesh Yadav,Yusuf Ali,Gunshi Gupta,Yarin Gal,Zsolt Kira

Main category: cs.CV

TL;DR: 为了评估长期记忆在具身任务中的应用,本文提出了一个新的基准测试和相关基线模型。

Details Motivation: 当前的视觉-语言模型在处理长期记忆任务时存在局限性,尤其是在需要大量图像数据的具身场景中。 Method: 引入了一个新的Habitat模拟器基准测试,包含60个需要持续参与和环境意识的任务,并提供了可扩展的程序版本。同时,结合最先进的VLMs与低级导航策略的基线模型被提出以评估性能。 Result: 基准测试和基线模型揭示了现有模型在记忆密集型任务中的不足,并指出了改进方向。 Conclusion: 新的基准测试为研究具身智能体的记忆和推理能力提供了一个重要平台,未来的研究应关注提升模型处理长期记忆的能力。 Abstract: Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.

[112] Demystifying the Visual Quality Paradox in Multimodal Large Language Models

Shuo Xing,Lanqing Guo,Hongyuan Hua,Seoyoung Lee,Peiran Li,Yufei Wang,Zhangyang Wang,Zhengzhong Tu

Main category: cs.CV

TL;DR: 研究发现,输入图像的视觉质量对多模态大语言模型(MLLMs)的表现有复杂影响。尽管人类感知的高质量图像不一定能提升模型理解能力,但通过引入VQ-TTT方法可以动态调整输入图像以适应模型偏好,从而提高任务表现。

Details Motivation: 目前对于输入图像质量如何影响MLLMs的响应知之甚少,因此需要系统地研究视觉质量对这些模型的影响。 Method: 通过对主流MLLMs进行一系列受控退化和风格转换处理,并在多个视觉-语言基准测试上评估其性能。提出了一种名为VQ-TTT的方法,该方法通过插入可学习的低秩核以及使用LoRA微调浅层视觉编码器来适应特定任务的模型偏好。 Result: VQ-TTT方法能够显著提高平均准确率,且无需外部模型、缓存特征或额外训练数据。 Conclusion: 重新定义了适用于MLLMs的“更好”的视觉输入标准,强调了适应性而非普遍清洁的图像的重要性。 Abstract: Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.

[113] Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

Ankan Deria,Adinath Madhavrao Dukre,Feilong Tang,Sara Atito,Sudipta Roy,Muhammad Awais,Muhammad Haris Khan,Imran Razzak

Main category: cs.CV

TL;DR: 提出了一种名为ViMaR的两阶段推理框架,通过结合时间差值模型与边缘感知奖励调整,显著提高了视觉-语言模型生成标题的可靠性、准确性和效率。实验表明,该方法在多个VLM架构上表现优异,并且具有良好的跨模型泛化能力。此外,利用ViMaR生成的标题进行自训练可以进一步提升模型性能。

Details Motivation: 当前视觉-语言模型的推理时搜索方法虽然取得了显著进展,但仍然存在计算成本高、容易产生低置信度生成结果以及持续幻觉的问题。 Method: ViMaR是一种两阶段推理框架:第一阶段通过单次遍历从多样候选中识别出价值最高的标题;第二阶段选择性地优化那些被忽略或视觉基础薄弱的部分。同时,引入校准后的边缘惩罚以抑制低置信度延续,保持描述丰富性。 Result: 实验表明,ViMaR生成的标题在多个VLM架构上表现出更高的可靠性、准确性、细节和解释性,速度比现有方法快4倍以上。并且,在未见过的更强模型上也具有良好的泛化能力。此外,使用ViMaR生成的标题进行自训练,可以显著提升模型在一系列视觉理解基准上的表现。 Conclusion: ViMaR作为一种可扩展和可转移的推理时解码策略,展示了其灵活性和模块化特性,为快速、准确、自我改进的视觉-语言模型流水线提供了潜力。 Abstract: Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.

[114] UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting

Kai He,Ruofan Liang,Jacob Munkberg,Jon Hasselgren,Nandita Vijaykumar,Alexander Keller,Sanja Fidler,Igor Gilitschenski,Zan Gojcic,Zian Wang

Main category: cs.CV

TL;DR: This paper presents a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in one pass using video diffusion models, trained on synthetic and real-world data to achieve strong generalization and surpass previous methods.

Details Motivation: Existing end-to-end relighting models face limitations due to scarce paired multi-illumination data, while two-stage pipelines suffer from error accumulation and inability to produce realistic outputs under complex conditions. Method: The method involves a single-pass joint estimation of albedo and synthesis of relit outputs using the generative capabilities of video diffusion models, trained on both synthetic multi-illumination data and extensive automatically labeled real-world videos. Result: The model shows strong generalization across diverse domains and outperforms previous methods in visual fidelity and temporal consistency. Conclusion: This general-purpose approach enhances implicit scene comprehension and creates realistic lighting effects and material interactions. Abstract: We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.

[115] Sekai: A Video Dataset towards World Exploration

Zhen Li,Chuanhao Li,Xiaofeng Mao,Shaoheng Lin,Ming Li,Shitian Zhao,Zhaopan Xu,Xinyue Li,Yukang Feng,Jianwen Sun,Zizhen Li,Fanrui Zhang,Jiaxin Ai,Zhixiang Wang,Yuwei Wu,Tong He,Jiangmiao Pang,Yu Qiao,Yunde Jia,Kaipeng Zhang

Main category: cs.CV

TL;DR: An abstract about a new video dataset called Sekai, which consists of over 5,000 hours of first-person view videos from over 100 countries and regions. It has rich annotations for world exploration and an interactive video world exploration model named YUME is trained using a subset of this dataset.

Details Motivation: Existing video generation datasets are not well-suited for world exploration training due to their limitations such as limited locations, short duration, static scenes, and lack of annotations about exploration and the world. Method: Introduced Sekai, a high-quality first-person view worldwide video dataset with rich annotations for world exploration. Developed an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Result: Experiments demonstrate the quality of the dataset. An interactive video world exploration model named YUME is successfully trained using a subset of this dataset. Conclusion: Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. Abstract: Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.

[116] Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model

Anirud Aggarwal,Abhinav Shrivastava,Matthew Gwilliam

Main category: cs.CV

TL;DR: Evolutionary Caching to Accelerate Diffusion models (ECAD) is a genetic algorithm that learns efficient caching schedules for diffusion models, providing significant inference speedups and fine-grained control over quality-latency trade-offs without modifying network parameters or reference images.

Details Motivation: Diffusion-based image generation models produce high-quality synthetic content but suffer from slow and computationally expensive inference. Existing methods to accelerate these models using caching often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. Method: The method proposed is ECAD, which uses a genetic algorithm to learn efficient, per-model caching schedules forming a Pareto frontier. This is achieved using only a small set of calibration prompts and requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Result: ECAD was evaluated on PixArt-alpha, PixArt-Sigma, and FLUX-1.dev using multiple metrics across diverse benchmarks. On PixArt-alpha, ECAD outperformed the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Conclusion: ECAD establishes itself as a scalable and generalizable approach for accelerating diffusion inference, offering significant improvements over previous approaches. Abstract: Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and FLUX-1.dev using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference. Our project website is available at https://aniaggarwal.github.io/ecad and our code is available at https://github.com/aniaggarwal/ecad.