cs.CL [Back]

[1] Adverse Event Extraction from Discharge Summaries: A New Dataset, Annotation Scheme, and Initial Findings

Imane Guellil,Salomé Andres,Atul Anand,Bruce Guthrie,Huayu Zhang,Abul Hasan,Honghan Wu,Beatrice Alex

Main category: cs.CL

TL;DR: The paper presents a manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients, which is underrepresented in clinical NLP resources. The dataset includes 14 clinically significant AEs and evaluates multiple models using FlairNLP across three annotation granularities.

Details

Motivation: To address the lack of clinical NLP resources focused on elderly patients and to provide a robust benchmark for evaluating AE extraction methods. Method: The authors developed a manually annotated corpus including 14 clinically significant AEs with contextual attributes. They evaluated multiple models using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. Result: Transformer-based models performed well on document-level coarse-grained extraction (F1 = 0.943), but performance dropped significantly for fine-grained entity-level tasks (e.g., F1 = 0.675). Challenges remain in detecting rare events and complex attributes. Conclusion: Despite high-level scores, there are still significant challenges in detecting underrepresented AEs and capturing nuanced clinical language. The dataset serves as a robust benchmark for future research. Abstract: In this work, we present a manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients, a population often underrepresented in clinical NLP resources. The dataset includes 14 clinically significant AEs-such as falls, delirium, and intracranial haemorrhage, along with contextual attributes like negation, diagnosis type, and in-hospital occurrence. Uniquely, the annotation schema supports both discontinuous and overlapping entities, addressing challenges rarely tackled in prior work. We evaluate multiple models using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. While transformer-based models (e.g., BERT-cased) achieve strong performance on document-level coarse-grained extraction (F1 = 0.943), performance drops notably for fine-grained entity-level tasks (e.g., F1 = 0.675), particularly for rare events and complex attributes. These results demonstrate that despite high-level scores, significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language. Developed within a Trusted Research Environment (TRE), the dataset is available upon request via DataLoch and serves as a robust benchmark for evaluating AE extraction methods and supporting future cross-dataset generalisation.

[2] Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction

Marija Šakota,Robert West

Main category: cs.CL

TL;DR: Boosted Constrained Decoding (BoostCD) combines constrained and unconstrained decoding to improve the quality of structured NLP tasks.

Details

Motivation: Many recent approaches use an autoregressive language model to map unstructured input text to output text representing structured objects, but can lead to low-quality output during constrained decoding at test time. Method: BoostCD decodes from the base model twice, in constrained and unconstrained mode, obtaining two weak predictions. Then a learned autoregressive boosted model combines the two weak predictions into one final prediction. Result: BoostCD outperforms prior approaches both in and out of distribution, addressing several common errors identified in those approaches. Conclusion: BoostCD is a powerful method for structured NLP tasks that improves the quality of output during constrained decoding. Abstract: Many recent approaches to structured NLP tasks use an autoregressive language model $M$ to map unstructured input text $x$ to output text $y$ representing structured objects (such as tuples, lists, trees, code, etc.), where the desired output structure is enforced via constrained decoding. During training, these approaches do not require the model to be aware of the constraints, which are merely implicit in the training outputs $y$. This is advantageous as it allows for dynamic constraints without requiring retraining, but can lead to low-quality output during constrained decoding at test time. We overcome this problem with Boosted Constrained Decoding (BoostCD), which combines constrained and unconstrained decoding in two phases: Phase 1 decodes from the base model $M$ twice, in constrained and unconstrained mode, obtaining two weak predictions. In phase 2, a learned autoregressive boosted model combines the two weak predictions into one final prediction. The mistakes made by the base model with vs. without constraints tend to be complementary, which the boosted model learns to exploit for improved performance. We demonstrate the power of BoostCD by applying it to closed information extraction. Our model, BoostIE, outperforms prior approaches both in and out of distribution, addressing several common errors identified in those approaches.

[3] CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision

Dyah Adila,Shuai Zhang,Boran Han,Bonan Min,Yuyang Wang

Main category: cs.CL

TL;DR: The paper introduces CrEst, a weakly supervised framework to assess context document credibility in LLMs without manual annotations. It proposes two integration strategies and shows significant performance improvements in experiments.

Details

Motivation: To address the challenge of varying credibility in context documents used by LLMs, which can lead to unreliable information propagation. Method: CrEst is grounded in the idea that credible documents have higher semantic coherence with other credible documents. It uses inter-document agreement for automated credibility estimation and proposes black-box and white-box integration strategies for LLM inference. Result: CrEst outperforms strong baselines across three model architectures and five datasets, achieving up to 26.86% improvement in accuracy and 3.49% increase in F1 score. It also maintains robust performance under high-noise conditions. Conclusion: CrEst provides an effective solution to incorporate document credibility into LLM inference, enhancing performance on knowledge-intensive tasks. Abstract: The integration of contextual information has significantly enhanced the performance of large language models (LLMs) on knowledge-intensive tasks. However, existing methods often overlook a critical challenge: the credibility of context documents can vary widely, potentially leading to the propagation of unreliable information. In this paper, we introduce CrEst, a novel weakly supervised framework for assessing the credibility of context documents during LLM inference--without requiring manual annotations. Our approach is grounded in the insight that credible documents tend to exhibit higher semantic coherence with other credible documents, enabling automated credibility estimation through inter-document agreement. To incorporate credibility into LLM inference, we propose two integration strategies: a black-box approach for models without access to internal weights or activations, and a white-box method that directly modifies attention mechanisms. Extensive experiments across three model architectures and five datasets demonstrate that CrEst consistently outperforms strong baselines, achieving up to a 26.86% improvement in accuracy and a 3.49% increase in F1 score. Further analysis shows that CrEst maintains robust performance even under high-noise conditions.

[4] MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance

Joseph J. Peper,Wenzhao Qiu,Ali Payani,Lu Wang

Main category: cs.CL

TL;DR: 创建了一个名为MDBench的新数据集，用于评估大型语言模型在多文档推理任务中的表现。通过新颖的合成生成过程，可控且高效地生成具有挑战性的文档集和相应的问答示例。实验表明，MDBench对所有方法都构成了重大挑战，并允许进行有针对性的分析和快速适应新的挑战及未来的模型改进。

Details

Motivation: 尽管大型语言模型（LLMs）在处理长上下文输入方面表现出色，但在多文档推理领域的基准测试却很少。此外，由于标注长输入的成本高昂，为多文档设置创建基准一直是个难题。因此，需要一个新的基准来严格检验模型在这种设置下的行为。 Method: 引入了MDBench数据集，通过一种新颖的合成生成过程创建。该技术基于浓缩的结构化种子知识，通过LLM辅助编辑对其进行修改，以诱导特定于多文档的推理挑战。然后将这种结构化知识转换为自然文本形式，生成文档集和相应的问答示例。 Result: 流行的大型语言模型和提示技术在MDBench上面临重大挑战，即使是在相对较短的文档集上。知识引导的生成技术能够轻松执行针对多文档特定推理能力的分析，并能快速适应新挑战和未来的模型改进。 Conclusion: MDBench提供了一个强大的基准，用于评估大型语言模型在多文档推理任务中的表现。其合成生成过程使得可控和高效地生成具有挑战性的文档集和问答示例成为可能，有助于深入分析模型的能力并推动未来的研究进展。 Abstract: Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs). New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs are expanding at a rapid pace. In particular, while multi-document (MD) reasoning is an area of extreme relevance given LLM capabilities in handling longer-context inputs, few benchmarks exist to rigorously examine model behavior in this setting. Moreover, the multi-document setting is historically challenging for benchmark creation due to the expensive cost of annotating long inputs. In this work, we introduce MDBench, a new dataset for evaluating LLMs on the task of multi-document reasoning. Notably, MDBench is created through a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets and the corresponding question-answer (QA) examples. Our novel technique operates on condensed structured seed knowledge, modifying it through LLM-assisted edits to induce MD-specific reasoning challenges. We then convert this structured knowledge into a natural text surface form, generating a document set and corresponding QA example. We analyze the behavior of popular LLMs and prompting techniques, finding that MDBENCH poses significant challenges for all methods, even with relatively short document sets. We also see our knowledge-guided generation technique (1) allows us to readily perform targeted analysis of MD-specific reasoning capabilities and (2) can be adapted quickly to account for new challenges and future modeling improvements.

[5] From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction?

Shadman Sakib,Oishy Fatema Akhand,Ajwad Abrar

Main category: cs.CL

TL;DR: This paper explores the use of Large Language Models (LLMs) for predicting diabetes using structured numerical data, comparing open-source and proprietary LLMs with traditional machine learning models.

Details

Motivation: To investigate the effectiveness of LLMs in medical prediction tasks, specifically diabetes prediction, using structured numerical data from Pima Indian Diabetes Database. Method: Tested six LLMs (four open-source and two proprietary) using zero-shot, one-shot, and three-shot prompting methods. Compared their performance with three traditional ML models (Random Forest, Logistic Regression, SVM) using accuracy, precision, recall, and F1-score as metrics. Result: Proprietary LLMs outperformed open-source ones, with GPT-4o and Gemma-2-27B achieving highest accuracy in few-shot settings. Gemma-2-27B also surpassed traditional ML models in terms of F1-score. Conclusion: LLMs can be useful for medical prediction tasks but require further research on prompt engineering and domain-specific fine-tuning. Abstract: While Machine Learning (ML) and Deep Learning (DL) models have been widely used for diabetes prediction, the use of Large Language Models (LLMs) for structured numerical data is still not well explored. In this study, we test the effectiveness of LLMs in predicting diabetes using zero-shot, one-shot, and three-shot prompting methods. We conduct an empirical analysis using the Pima Indian Diabetes Database (PIDD). We evaluate six LLMs, including four open-source models: Gemma-2-27B, Mistral-7B, Llama-3.1-8B, and Llama-3.2-2B. We also test two proprietary models: GPT-4o and Gemini Flash 2.0. In addition, we compare their performance with three traditional machine learning models: Random Forest, Logistic Regression, and Support Vector Machine (SVM). We use accuracy, precision, recall, and F1-score as evaluation metrics. Our results show that proprietary LLMs perform better than open-source ones, with GPT-4o and Gemma-2-27B achieving the highest accuracy in few-shot settings. Notably, Gemma-2-27B also outperforms the traditional ML models in terms of F1-score. However, there are still issues such as performance variation across prompting strategies and the need for domain-specific fine-tuning. This study shows that LLMs can be useful for medical prediction tasks and encourages future work on prompt engineering and hybrid approaches to improve healthcare predictions.

[6] Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings

Ignacio Sastre,Aiala Rosá

Main category: cs.CL

TL;DR: An interesting phenomenon where reversible sentence embeddings can be generated, enabling LLMs to reconstruct original texts exactly without modifying model weights.

Details

Motivation: To explore the possibility of generating reversible sentence embeddings that allow an LLM to reconstruct the original text exactly without altering the model's weights. Method: Introducing a special memory token whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. Result: Evaluated across English and Spanish datasets, sequences up to approximately 240 tokens, and model scales from 100M to 8B parameters. Notably, Llama 3.1 8B successfully reconstructed all tested sequences. Conclusion: This phenomenon highlights an interesting capability of LLMs and suggests potential applications in memory-based retrieval, compression, and controlled text generation. Abstract: In this work, we observe an interesting phenomenon: it is possible to generate reversible sentence embeddings that allow an LLM to reconstruct the original text exactly, without modifying the model's weights. This is achieved by introducing a special memory token, whose embedding is optimized through training on a fixed sequence. When prompted with this embedding, the model reconstructs the fixed sequence exactly. We evaluate this phenomenon across English and Spanish datasets, sequences of up to approximately 240 tokens, and model scales ranging from 100M to 8B parameters. Notably, Llama 3.1 8B successfully reconstructs all tested sequences. Our findings highlight an interesting capability of LLMs and suggest potential applications in memory-based retrieval, compression, and controlled text generation.

Drew Walker,Swati Rajwal,Sudeshna Das,Snigdha Peddireddy,Abeed Sarker

Main category: cs.CL

TL;DR: 通过自然语言处理技术，研究者分析了美国国家暴力死亡报告系统中的数据，开发了高质量分类器来识别社会孤立和孤独感，并发现男性、同性恋及离异人士更可能遭受长期社会孤立。

Details

Motivation: 社会孤立和孤独感近年来不断上升，对自杀率有显著影响。然而，美国国家暴力死亡报告系统并未记录这些变量，因此需要新的方法来识别这些因素。 Method: 利用主题建模生成词典开发和监督学习分类器，通过自然语言处理技术分析执法和法医检查员的叙述，以识别社会孤立和孤独感。 Result: 开发的分类器质量高（平均F1：0.86，准确率：0.82），在超过30万份自杀案例中识别出1,198例提到慢性社会孤立的情况，并发现了多个显著预测因素。 Conclusion: 该方法可以改进美国社会孤立和孤独感的监测与预防。 Abstract: Social isolation and loneliness, which have been increasing in recent years strongly contribute toward suicide rates. Although social isolation and loneliness are not currently recorded within the US National Violent Death Reporting System's (NVDRS) structured variables, natural language processing (NLP) techniques can be used to identify these constructs in law enforcement and coroner medical examiner narratives. Using topic modeling to generate lexicon development and supervised learning classifiers, we developed high-quality classifiers (average F1: .86, accuracy: .82). Evaluating over 300,000 suicides from 2002 to 2020, we identified 1,198 mentioning chronic social isolation. Decedents had higher odds of chronic social isolation classification if they were men (OR = 1.44; CI: 1.24, 1.69, p<.0001), gay (OR = 3.68; 1.97, 6.33, p<.0001), or were divorced (OR = 3.34; 2.68, 4.19, p<.0001). We found significant predictors for other social isolation topics of recent or impending divorce, child custody loss, eviction or recent move, and break-up. Our methods can improve surveillance and prevention of social isolation and loneliness in the United States.

[8] Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

Zongxia Li,Yapei Chang,Yuhang Zhou,Xiyang Wu,Zichao Liang,Yoo Yeon Sung,Jordan Lee Boyd-Graber

Main category: cs.CL

TL;DR: The paper introduces PrefBERT, a scoring model for evaluating open-ended long-form generation. It provides better semantic reward feedback than traditional metrics and aligns well with the rewards needed for GRPO. Human evaluations confirm that using PrefBERT leads to responses more aligned with human preferences.

Details

Motivation: Evaluating open-ended long-form generation is challenging due to difficulty in defining clear distinctions between good and bad outputs, missing key aspects in existing methods, and bias from pretraining data. Method: Propose PrefBERT, a scoring model trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality. It evaluates open-ended long-form generation in GRPO and guides its training with distinct rewards for good and bad outputs. Result: PrefBERT offers better semantic reward feedback than ROUGE-L and BERTScore. It remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Using PrefBERT as the reward signal yields responses better aligned with human preferences. Conclusion: PrefBERT is effective in evaluating open-ended long-form generation and guiding training with better alignment to human preferences compared to traditional metrics. Abstract: Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.

[9] Learning-Time Encoding Shapes Unlearning in LLMs

Ruihan Wu,Konstantin Garov,Kamalika Chaudhuri

Main category: cs.CL

TL;DR: 大型语言模型（LLMs）需要具备事后移除特定知识的能力，即“遗忘”能力。本文研究了学习时间选择对遗忘事实知识有效性的影响，发现使用同义改写描述和从文本块中遗忘单个知识点具有挑战性，表明学习时的知识编码可能在实现可靠的后验遗忘中起关键作用。

Details

Motivation: 随着LLMs在现实世界中的应用日益广泛，隐私法规、纠正过时或有害内容等原因使得事后移除特定知识的能力变得至关重要。 Method: 通过实验探讨学习过程中知识编码的选择如何影响事后遗忘事实知识的有效性。 Result: 1) 使用同义改写描述可以提高遗忘性能；2) 从一段文本中遗忘单个知识点具有挑战性。 Conclusion: 学习时的知识编码可能在实现可靠的后验遗忘中起着核心作用。 Abstract: As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn'', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge. Our experiments reveal two key findings: (1) learning with paraphrased descriptions improves unlearning performance and (2) unlearning individual piece of knowledge from a chunk of text is challenging. Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.

[10] Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification

Yaxin Fan,Peifeng Li,Qiaoming Zhu

Main category: cs.CL

TL;DR: This paper proposes a Discourse-aware Clarification Module (DCM) with two reasoning processes and Contribution-aware Preference Optimization (CPO) to enhance dialogue discourse parsing, achieving better performance on STAC and Molweni datasets.

Details

Motivation: Dialogue discourse parsing faces challenges due to ambiguities introduced by linguistic features like omission and idiom. Method: Propose DCM with clarification type reasoning and discourse goal reasoning, and CPO to optimize DCM's performance. Result: Effectively resolves ambiguities and significantly outperforms SOTA baselines on STAC and Molweni datasets. Conclusion: The approach improves dialogue discourse parsing by addressing ambiguities through DCM and CPO. Abstract: Dialogue discourse parsing aims to identify and analyze discourse relations between the utterances within dialogues. However, linguistic features in dialogues, such as omission and idiom, frequently introduce ambiguities that obscure the intended discourse relations, posing significant challenges for parsers. To address this issue, we propose a Discourse-aware Clarification Module (DCM) to enhance the performance of the dialogue discourse parser. DCM employs two distinct reasoning processes: clarification type reasoning and discourse goal reasoning. The former analyzes linguistic features, while the latter distinguishes the intended relation from the ambiguous one. Furthermore, we introduce Contribution-aware Preference Optimization (CPO) to mitigate the risk of erroneous clarifications, thereby reducing cascading errors. CPO enables the parser to assess the contributions of the clarifications from DCM and provide feedback to optimize the DCM, enhancing its adaptability and alignment with the parser's requirements. Extensive experiments on the STAC and Molweni datasets demonstrate that our approach effectively resolves ambiguities and significantly outperforms the state-of-the-art (SOTA) baselines.

[11] CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records

Junke Wang,Hongshun Ling,Li Zhang,Longqian Zhang,Fang Wang,Yuan Gao,Zhi Li

Main category: cs.CL

TL;DR: The paper proposes CKD-EHR framework which uses knowledge distillation techniques to improve disease risk prediction based on EHR. It fine-tunes Qwen2.5-7B as the teacher model and transfers the knowledge to a lightweight BERT student model, achieving significant performance improvements on MIMIC-III dataset.

Details

Motivation: Existing large language models have insufficient medical knowledge representation and low efficiency in clinical deployment. Method: CKD-EHR framework applies knowledge distillation techniques. The large language model Qwen2.5-7B is fine-tuned on medical knowledge-enhanced data to serve as the teacher model, generating interpretable soft labels through a multi-granularity attention distillation mechanism, then transferring the distilled knowledge to a lightweight BERT student model. Result: On the MIMIC-III dataset, CKD-EHR significantly outperforms the baseline model with 9% increase in diagnostic accuracy, 27% improvement in F1-score, and 22.2 times inference speedup. Conclusion: CKD-EHR not only greatly improves resource utilization efficiency but also significantly enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for resource optimization in clinical settings. Abstract: Electronic Health Records (EHR)-based disease prediction models have demonstrated significant clinical value in promoting precision medicine and enabling early intervention. However, existing large language models face two major challenges: insufficient representation of medical knowledge and low efficiency in clinical deployment. To address these challenges, this study proposes the CKD-EHR (Clinical Knowledge Distillation for EHR) framework, which achieves efficient and accurate disease risk prediction through knowledge distillation techniques. Specifically, the large language model Qwen2.5-7B is first fine-tuned on medical knowledge-enhanced data to serve as the teacher model.It then generates interpretable soft labels through a multi-granularity attention distillation mechanism. Finally, the distilled knowledge is transferred to a lightweight BERT student model. Experimental results show that on the MIMIC-III dataset, CKD-EHR significantly outperforms the baseline model:diagnostic accuracy is increased by 9%, F1-score is improved by 27%, and a 22.2 times inference speedup is achieved. This innovative solution not only greatly improves resource utilization efficiency but also significantly enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for resource optimization in clinical settings. The code and data for this research are available athttps://github.com/209506702/CKD_EHR.

[12] Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs

Jing Yang Lee,Kong-Aik Lee,Woon-Seng Gan

Main category: cs.CL

TL;DR: 通过将开放域对话生成分解为多响应生成和基于偏好的选择两个任务，并引入o2mDial语料库，提出新的学习策略和评估指标，从而在不牺牲连贯性的情况下显著提高小规模LLM的响应多样性与质量。

Details

Motivation: 尽管先前研究表明建模开放域对话的一对多特性可以提升响应多样性，但大多数现代基于LLM的对话代理并未明确利用这一特性。 Method: 将开放域对话生成分为多响应生成（MRG）和偏好选择（PS）两个任务；创建了专门捕捉一对多特性的对话语料库o2mDial；提出了新的上下文学习和指令调优策略及针对MRG的新评估指标，以及用于PS的模型驱动方法。 Result: 实证结果表明，在较小的LLM上应用所提出的两阶段框架，可以在保持上下文连贯性的同时增强响应多样性，并使响应质量最高提升90%，接近较大模型的表现。 Conclusion: 该研究提供了一种有效的两阶段框架来增强小型LLM在开放域对话中的表现，通过利用一对多特性提高了响应质量和多样性。 Abstract: Open-domain Dialogue (OD) exhibits a one-to-many (o2m) property, whereby multiple appropriate responses exist for a single dialogue context. Despite prior research showing that modeling this property boosts response diversity, most modern LLM-based dialogue agents do not explicitly do so. In this work, we model the o2m property of OD in LLMs by decomposing OD generation into two key tasks: Multi-Response Generation (MRG) and Preference-based Selection (PS), which entail generating a set of n semantically and lexically diverse high-quality responses for a given dialogue context, followed by selecting a single response based on human preference, respectively. To facilitate MRG and PS, we introduce o2mDial, a dialogue corpus explicitly designed to capture the o2m property by featuring multiple plausible responses for each context. Leveraging o2mDial, we propose new in-context learning and instruction-tuning strategies, as well as novel evaluation metrics for MRG, alongside a model-based approach for PS. Empirical results demonstrate that applying the proposed two-stage framework to smaller LLMs for OD generation enhances overall response diversity while maintaining contextual coherence, improving response quality by up to 90%, bringing them closer to the performance of larger models.

[13] Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

Gyeongje Cho,Yeonkyoun So,Chanwoo Park,Sangmin Lee,Sungmok Jung,Jaejin Lee

Main category: cs.CL

TL;DR: This paper presents Thunder-Tok, a new Korean tokenizer which reduces token fertility by 10% without affecting model performance across various tasks.

Details

Motivation: To design a more efficient tokenizer for the Korean language that reduces token fertility without compromising model performance. Method: The approach uses a rule-based pre-tokenization method aligned with Korean linguistic structure, creates a seed vocabulary of linguistically relevant tokens, and employs a branching entropy-based selection algorithm to increase average token length. Result: Thunder-Tok reduces token fertility by approximately 10%, leading to a 10% reduction in the number of tokens and a 10% improvement in inference speed compared to BPE, all while maintaining performance on various downstream tasks. Conclusion: The linguistically informed approach used in Thunder-Tok is effective and practical for designing efficient tokenizers for language models. Abstract: This paper introduces Thunder-Tok, a new Korean tokenizer designed to reduce token fertility without compromising model performance. Our approach uses a rule-based pre-tokenization method that aligns with the linguistic structure of the Korean language. We also create a seed vocabulary containing tokens that resemble linguistic units and employ a branching entropy-based selection algorithm. These techniques increase the average token length, thus lowering fertility while preserving linguistic information. Experimental results indicate that Thunder-Tok reduces fertility by approximately 10% (i.e., reduces the number of tokens by 10%, improving the inference speed by 10%) compared to BPE without compromising performance across various downstream tasks. These findings demonstrate that our linguistically informed approach is effective and practical for designing efficient tokenizers for language models.

[14] Emergence of Primacy and Recency Effect in Mamba: A Mechanistic Point of View

Muhammad Cendekia Airlangga,Hilal AlQuabeh,Munachiso S Nwadike,Kentaro Inui

Main category: cs.CL

TL;DR: The paper explores memory in state-space language models using primacy and recency effects, finding that long-term memory is supported by sparse channels, short-term memory governed by delta-modulated recurrence, and memory allocation modulated by semantic regularity.

Details

Motivation: To understand how information is retained and forgotten over time in state-space language models, using behavioral tools such as primacy and recency effects. Method: Applying structured recall tasks to the Mamba architecture and observing the U-shaped accuracy profile. Identifying mechanisms related to long-term memory, short-term memory, and memory allocation modulation. Result: Discovered three key mechanisms: persistent encoding of early input tokens for long-term memory, exponential decay-weighted recent inputs for short-term memory, and dynamic modulation of memory allocation based on semantic regularity. Conclusion: These findings were validated via targeted ablations and input perturbations on large-scale Mamba-based language models. Abstract: We study memory in state-space language models using primacy and recency effects as behavioral tools to uncover how information is retained and forgotten over time. Applying structured recall tasks to the Mamba architecture, we observe a consistent U-shaped accuracy profile, indicating strong performance at the beginning and end of input sequences. We identify three mechanisms that give rise to this pattern. First, long-term memory is supported by a sparse subset of channels within the model's selective state space block, which persistently encode early input tokens and are causally linked to primacy effects. Second, short-term memory is governed by delta-modulated recurrence: recent inputs receive more weight due to exponential decay, but this recency advantage collapses when distractor items are introduced, revealing a clear limit to memory depth. Third, we find that memory allocation is dynamically modulated by semantic regularity: repeated relations in the input sequence shift the delta gating behavior, increasing the tendency to forget intermediate items. We validate these findings via targeted ablations and input perturbations on two large-scale Mamba-based language models: one with 1.4B and another with 7B parameters.

[15] A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals

Andrea Cadeddu,Alessandro Chessa,Vincenzo De Leo,Gianni Fenu,Enrico Motta,Francesco Osborne,Diego Reforgiato Recupero,Angelo Salatino,Luca Secchi

Main category: cs.CL

TL;DR: This paper explores the use of various proprietary and open-source large language models (LLMs) for a single-label, multi-class text classification task related to the Sustainable Development Goals (SDGs). It also evaluates different learning techniques including Zero-Shot, Few-Shot Learning, and Fine-Tuning. The study finds that smaller models optimized via prompt engineering can perform as well as larger models such as OpenAI's GPT.

Details

Motivation: The motivation is the challenge of tracking progress towards the United Nations' Sustainable Development Goals (SDGs) due to the extensive scale and complexity of the data involved. Text classification models are needed to automate the analysis of vast amounts of text from various sources. Method: The method involves analyzing various proprietary and open-source LLMs for a single-label, multi-class text classification task focused on SDGs. It also evaluates task adaptation techniques including Zero-Shot and Few-Shot Learning, and Fine-Tuning. Result: The results show that smaller models, when optimized through prompt engineering, can perform comparably to larger models like OpenAI's GPT. Conclusion: The conclusion is that with proper optimization techniques, smaller models can be effective in text classification tasks related to SDGs. Abstract: In 2012, the United Nations introduced 17 Sustainable Development Goals (SDGs) aimed at creating a more sustainable and improved future by 2030. However, tracking progress toward these goals is difficult because of the extensive scale and complexity of the data involved. Text classification models have become vital tools in this area, automating the analysis of vast amounts of text from a variety of sources. Additionally, large language models (LLMs) have recently proven indispensable for many natural language processing tasks, including text classification, thanks to their ability to recognize complex linguistic patterns and semantics. This study analyzes various proprietary and open-source LLMs for a single-label, multi-class text classification task focused on the SDGs. Then, it also evaluates the effectiveness of task adaptation techniques (i.e., in-context learning approaches), namely Zero-Shot and Few-Shot Learning, as well as Fine-Tuning within this domain. The results reveal that smaller models, when optimized through prompt engineering, can perform on par with larger models like OpenAI's GPT (Generative Pre-trained Transformer).

[16] ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs

Feng He,Zijun Chen,Xinnian Liang,Tingting Ma,Yunqi Qiu,Shuangzhi Wu,Junchi Yan

Main category: cs.CL

TL;DR: Recent advances in Large Reasoning Models (LRMs) with Long Chain-of-Thought (Long CoT) reasoning have shown cross-domain generalization capabilities. The paper hypothesizes that this arises from shared abstract reasoning prototypes, proposing ProtoReasoning framework which enhances LLMs' reasoning ability through scalable and verifiable prototypical representations.

Details

Motivation: To understand the underlying mechanisms of cross-domain generalization capabilities in Large Reasoning Models trained with Long Chain-of-Thought reasoning, hypothesizing that it arises from shared abstract reasoning prototypes. Method: ProtoReasoning framework features an automated prototype construction pipeline, a comprehensive verification system, and scalability within prototype space using Prolog for logical reasoning and PDDL for planning. Result: ProtoReasoning shows improvements over baseline models: 4.7% on logical reasoning, 6.3% on planning tasks, 4.0% on general reasoning, and 1.0% on mathematics. Ablation studies confirm enhanced generalization to structurally similar problems compared to training solely on natural language representations. Conclusion: The hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models is validated. Abstract: Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought (Long CoT) reasoning have demonstrated remarkable cross-domain generalization capabilities. However, the underlying mechanisms supporting such transfer remain poorly understood. We hypothesize that cross-domain generalization arises from shared abstract reasoning prototypes -- fundamental reasoning patterns that capture the essence of problems across domains. These prototypes minimize the nuances of the representation, revealing that seemingly diverse tasks are grounded in shared reasoning structures.Based on this hypothesis, we propose ProtoReasoning, a framework that enhances the reasoning ability of LLMs by leveraging scalable and verifiable prototypical representations (Prolog for logical reasoning, PDDL for planning).ProtoReasoning features: (1) an automated prototype construction pipeline that transforms problems into corresponding prototype representations; (2) a comprehensive verification system providing reliable feedback through Prolog/PDDL interpreters; (3) the scalability to synthesize problems arbitrarily within prototype space while ensuring correctness. Extensive experiments show that ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning (Enigmata-Eval), 6.3% improvement on planning tasks, 4.0% improvement on general reasoning (MMLU) and 1.0% on mathematics (AIME24). Significantly, our ablation studies confirm that learning in prototype space also demonstrates enhanced generalization to structurally similar problems compared to training solely on natural language representations, validating our hypothesis that reasoning prototypes serve as the foundation for generalizable reasoning in large language models.

[17] MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs

Yongqi Fan,Yating Wang,Guandong Wang,Jie Zhai,Jingping Liu,Qi Ye,Tong Ruan

Main category: cs.CL

TL;DR: MinosEval是一种新的开放式问题回答评估方法，通过区分事实性和非事实性问题并采用不同的评估策略，提高与人工标注的一致性和结果的可解释性。

Details

Motivation: 现有的LLM评估方法在处理开放式问题时存在不足：传统度量难以捕捉语义相似性，当前基于LLM的方法缺乏直观可解释性，且未能区分事实性和非事实性问题。 Method: 提出MinosEval方法，首先区分开放式问题为事实性或非事实性问题，然后分别应用不同的评估策略：对事实性问题采用自适应关键点评分策略，对非事实性问题采用实例感知列表排序策略。 Result: 在多个开放式QA数据集（包括自建数据集）上的实验表明，MinosEval与人工标注更一致，并提供更可解释的结果。 Conclusion: MinosEval为开放式问题回答提供了一种更精细和可解释的自动评估方法，有效解决了现有方法的不足。 Abstract: Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose \textbf{MinosEval}, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.

[18] Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

Jaione Bengoetxea,Itziar Gonzalez-Dios,Rodrigo Agerri

Main category: cs.CL

TL;DR: The paper evaluates the capacity of current language technologies to understand Basque and Spanish language varieties using Natural Language Inference (NLI) as a pivot task.

Details

Motivation: To assess how well current language technologies can handle linguistic variations in Basque and Spanish languages. Method: Using encoder-only and decoder-based Large Language Models (LLMs) for crosslingual and in-context learning experiments, introducing a novel manually-curated parallel dataset in Basque and Spanish along with their respective variants. Result: There is a performance drop when handling linguistic variation, particularly in Basque. Encoder-only models struggle more with Western Basque, aligning with linguistic theory that peripheral dialects are more distant from the standard. Conclusion: Error analysis suggests that the decline in performance is due to linguistic variation itself rather than lexical overlap. Abstract: In this paper, we evaluate the capacity of current language technologies to understand Basque and Spanish language varieties. We use Natural Language Inference (NLI) as a pivot task and introduce a novel, manually-curated parallel dataset in Basque and Spanish, along with their respective variants. Our empirical analysis of crosslingual and in-context learning experiments using encoder-only and decoder-based Large Language Models (LLMs) shows a performance drop when handling linguistic variation, especially in Basque. Error analysis suggests that this decline is not due to lexical overlap, but rather to the linguistic variation itself. Further ablation experiments indicate that encoder-only models particularly struggle with Western Basque, which aligns with linguistic theory that identifies peripheral dialects (e.g., Western) as more distant from the standard. All data and code are publicly available.

[19] Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs

Yang Fan,Zhang Qi,Xing Wenqian,Liu Chang,Liu Liu

Main category: cs.CL

TL;DR: This paper proposes the Graph RAG framework for historical text analysis, which creates a character relationship dataset with minimal manual annotation and introduces a collaborative mechanism between knowledge graphs and retrieval-augmented generation. Experiments show that the domain-specific model Xunzi-Qwen1.5-14B performs optimally in relation extraction, while the DeepSeek model integrated with GraphRAG improves F1 scores.

Details

Motivation: To address domain knowledge gaps in general large language models for historical text analysis. Method: The Graph RAG framework combines chain-of-thought prompting, self-instruction generation, and process supervision to create a character relationship dataset. It also introduces a collaborative mechanism between knowledge graphs and retrieval-augmented generation. Result: Xunzi-Qwen1.5-14B achieves optimal performance in relation extraction (F1 = 0.68). DeepSeek model with GraphRAG improves F1 by 11% on the C-CLUE relation extraction dataset, surpassing Xunzi-Qwen1.5-14B's F1 value. Conclusion: The Graph RAG framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research. Abstract: This article addresses domain knowledge gaps in general large language models for historical text analysis in the context of computational humanities and AIGC technology. We propose the Graph RAG framework, combining chain-of-thought prompting, self-instruction generation, and process supervision to create a The First Four Histories character relationship dataset with minimal manual annotation. This dataset supports automated historical knowledge extraction, reducing labor costs. In the graph-augmented generation phase, we introduce a collaborative mechanism between knowledge graphs and retrieval-augmented generation, improving the alignment of general models with historical knowledge. Experiments show that the domain-specific model Xunzi-Qwen1.5-14B, with Simplified Chinese input and chain-of-thought prompting, achieves optimal performance in relation extraction (F1 = 0.68). The DeepSeek model integrated with GraphRAG improves F1 by 11% (0.08-0.19) on the open-domain C-CLUE relation extraction dataset, surpassing the F1 value of Xunzi-Qwen1.5-14B (0.12), effectively alleviating hallucinations phenomenon, and improving interpretability. This framework offers a low-resource solution for classical text knowledge extraction, advancing historical knowledge services and humanities research.

[20] TopClustRAG at SIGIR 2025 LiveRAG Challenge

Juli Bakagianni,John Pavlopoulos,Aristidis Likas

Main category: cs.CL

TL;DR: The paper introduces TopClustRAG, a RAG system for question answering that uses hybrid retrieval and clustering to improve answer diversity, relevance, and faithfulness. It ranked 2nd in faithfulness and 7th in correctness on the FineWeb Sample-10BT dataset.

Details

Motivation: To enhance the performance of end-to-end question answering systems over large-scale web corpora by improving answer diversity, relevance, and faithfulness. Method: TopClustRAG employs a hybrid retrieval strategy combining sparse and dense indices, K-Means clustering to group semantically similar passages, and cluster-specific prompts for a large language model (LLM) to generate intermediate answers that are then filtered, reranked, and synthesized into a final response. Result: On the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard. Conclusion: The effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems is demonstrated. Abstract: We present TopClustRAG, a retrieval-augmented generation (RAG) system developed for the LiveRAG Challenge, which evaluates end-to-end question answering over large-scale web corpora. Our system employs a hybrid retrieval strategy combining sparse and dense indices, followed by K-Means clustering to group semantically similar passages. Representative passages from each cluster are used to construct cluster-specific prompts for a large language model (LLM), generating intermediate answers that are filtered, reranked, and finally synthesized into a single, comprehensive response. This multi-stage pipeline enhances answer diversity, relevance, and faithfulness to retrieved evidence. Evaluated on the FineWeb Sample-10BT dataset, TopClustRAG ranked 2nd in faithfulness and 7th in correctness on the official leaderboard, demonstrating the effectiveness of clustering-based context filtering and prompt aggregation in large-scale RAG systems.

[21] Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

Sungen Hahm,Heejin Kim,Gyuseong Lee,Hyunji Park,Jaejin Lee

Main category: cs.CL

TL;DR: A new de-identification framework, Thunder-DeID, for court judgments in South Korea is proposed, featuring a Korean legal dataset, systematic PII categorization, and a DNN-based pipeline.

Details

Motivation: The current de-identification process of court judgments in South Korea is insufficient for handling large-scale data while meeting strict legal requirements. Legal definitions of personal identifiers are vague, making it hard to apply technical solutions effectively. Method: Construct the first Korean legal dataset with annotated judgments and entity mentions, introduce a systematic PII categorization, and develop an end-to-end DNN-based de-identification pipeline called Thunder-DeID. Result: Experimental results indicate that the proposed model achieves state-of-the-art performance in de-identifying court judgments. Conclusion: Thunder-DeID provides an effective solution to balance open access to justice and personal data protection by improving the de-identification process of court judgments. Abstract: To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.

[22] Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment

Shrestha Ghosh,Moritz Schneider,Carina Reinicke,Carsten Eickhoff

Main category: cs.CL

TL;DR: This paper surveys the application of LLMs in clinical trial recruitment, focusing on the task of trial-patient matching.

Details

Motivation: To address the limitation of current methods for matching patients with clinical trials and to explore how LLMs can be better utilized in this critical domain. Method: The authors conduct a survey analyzing existing benchmarks, approaches, and evaluation frameworks for LLM-based solutions in clinical trial recruitment. Result: Identifies challenges in adopting LLM technologies in clinical research and highlights promising future directions. Conclusion: LLMs have great potential to improve trial-patient matching but require more robust evaluation benchmarks and open models to fully realize their capabilities. Abstract: Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet, their adoption in critical domains, such as clinical trial recruitment, remains limited. As trials are designed in natural language and patient data is represented as both structured and unstructured text, the task of matching trials and patients benefits from knowledge aggregation and reasoning abilities of LLMs. Classical approaches are trial-specific and LLMs with their ability to consolidate distributed knowledge hold the potential to build a more general solution. Yet recent applications of LLM-assisted methods rely on proprietary models and weak evaluation benchmarks. In this survey, we are the first to analyze the task of trial-patient matching and contextualize emerging LLM-based approaches in clinical trial recruitment. We critically examine existing benchmarks, approaches and evaluation frameworks, the challenges to adopting LLM technologies in clinical research and exciting future directions.

[23] ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Negar Foroutan,Jakhongir Saydaliev,Ye Eun Kim,Antoine Bosselut

Main category: cs.CL

TL;DR: 本文提出了一种新的监督对比学习方法，用于学习低资源语言的领域不变表示，从而提高语言识别模型在领域外数据上的性能。

Details

Motivation: 尽管许多关于语言识别模型的研究都集中在收集多样化的训练数据以提高性能，但低资源语言的表现仍然较差，尤其是在单领域数据（如圣经）中。为了解决这些类别不平衡和偏差问题，需要一种新方法来改善低资源语言的领域外数据的识别性能。 Method: 提出了一种新颖的监督对比学习(SCL)方法，通过学习低资源语言的领域不变表示来解决类别不平衡和偏差问题。 Result: 实验结果表明，该方法在低资源语言的领域外数据上提高了3.2%的语言识别性能，证明了其在增强语言识别模型方面的有效性。 Conclusion: 提出的监督对比学习方法能够有效提高低资源语言在领域外数据上的语言识别性能，为改善多语言LLM预训练语料库的语言识别步骤提供了新的思路。 Abstract: Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. Through an extensive analysis, we show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%, demonstrating its effectiveness in enhancing LID models.

[24] DeVisE: Behavioral Testing of Medical Large Language Models

Camila Zurdo Tagliabue,Heloisa Oss Boll,Aykut Erdem,Erkut Erdem,Iacer Calixto

Main category: cs.CL

TL;DR: 开发了一个名为DeVisE的行为测试框架，用于评估大型语言模型在临床决策支持中的表现。通过构建包含真实和合成数据的ICU出院记录数据集，研究了模型对人口统计学和生命体征属性变化的敏感性及推理能力。结果表明，零样本模型表现出更连贯的反事实推理模式，而微调模型则更为稳定但对临床有意义的变化反应较少。同时发现人口统计因素会微妙但一致地影响输出，强调了公平性评估的重要性。

Details

Motivation: 当前对大型语言模型在临床决策支持中的评估方法往往无法区分真实的医学推理与表面模式，因此需要一个更精细的评估框架来探测模型的临床理解能力。 Method: 引入了DeVisE框架，使用MIMIC-IV数据集构建ICU出院记录，生成真实和合成版本的数据，并通过单变量反事实控制人口统计学和生命体征属性。然后评估五种不同类型的大型语言模型在零样本和微调设置下的表现，分别从输入级敏感性和下游推理两个方面进行分析。 Result: 零样本模型表现出更连贯的反事实推理模式，而微调模型虽然更稳定但对临床有意义的变化反应较少。此外，人口统计因素会对模型输出产生微妙但持续的影响。 Conclusion: 行为测试有助于揭示临床大型语言模型的推理策略，并为设计更安全、更透明的医疗AI系统提供信息。 Abstract: Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.

[25] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture

Arijit Maji,Raghvendra Kumar,Akash Ghosh,Anushka,Sriparna Saha

Main category: cs.CL

TL;DR: SANSKRITI is a benchmark dataset with 21,853 question-answer pairs evaluating language models' comprehension of Indian cultural diversity across sixteen attributes. It highlights disparities in model performance and sets a new standard for assessing cultural understanding.

Details

Motivation: To improve language models' effectiveness by addressing their understanding of local socio-cultural contexts, specifically focusing on India's rich cultural diversity. Method: Introduced SANSKRITI, a benchmark dataset with 21,853 question-answer pairs covering 28 states and 8 union territories, and sixteen key attributes of Indian culture. Evaluated the dataset on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs). Result: Significant disparities were found in the ability of different models to handle culturally nuanced queries, particularly in region-specific contexts. Conclusion: SANSKRITI sets a new standard for assessing and enhancing the cultural understanding capabilities of language models. Abstract: Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models' comprehension of India's rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India's cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.

[26] COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation

Raghvendra Kumar,S. A. Mohammed Salman,Aryan Sahu,Tridib Nandi,Pragathi Y. P.,Sriparna Saha,Jose G. Moreno

Main category: cs.CL

TL;DR: COSMMIC是一个新的多模态、多语言数据集，包含9种主要印度语言，4959个文章-图像对和24484条评论。它通过整合读者见解来增强摘要生成，并探索了四种配置下的摘要和标题生成方法。使用先进的语言模型进行评估，研究发现有助于确定自然语言生成的最佳配置，推动印度语言资源的包容性发展。

Details

Motivation: 尽管在英语和中文的评论感知多模态和多语言摘要方面取得了进展，但在印度语言的研究仍然有限。为了填补这一空白，本研究提出了COSMMIC数据集。 Method: COSMMIC数据集包含9种主要印度语言，4959个文章-图像对和24484条评论。采用四种配置（仅文章文本、结合用户评论、利用图像、结合文本、评论和图像）进行摘要和标题生成。使用LLama3和GPT-4等先进语言模型进行评估，并使用IndicBERT分类器过滤噪声和CLIP-based分类器提取图像信息。 Result: 通过综合研究不同组件组合，成功识别支持性评论并过滤噪声，从而确定了自然语言生成任务中最有效的配置。COSMMIC数据集的独特之处在于整合了文本、图像和用户反馈，提供了更全面的方法。 Conclusion: COSMMIC数据集填补了印度语言资源的空白，推动了NLP研究的发展，促进了包容性。 Abstract: Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset's effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.

[27] Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning

Stanley Ngugi

Main category: cs.CL

TL;DR: Large Language Models (LLMs) often underperform in low-resource languages (LRLs) like Swahili due to data scarcity and underrepresentation. This paper introduces Targeted Lexical Injection (TLI), a fine-tuning approach using Low-Rank Adaptation (LoRA) and contrastive learning, to enhance lexical alignment for LRLs. Experiments show TLI significantly improves lexical alignment in both trained and unseen Swahili-English word pairs.

Details

Motivation: To address the issue of poor performance of LLMs in low-resource languages, particularly Swahili, due to insufficient data and lack of cross-lingual lexical alignment. Method: The method involves identifying an early layer in the model with strong lexical alignment for Swahili-English word pairs, then using Targeted Lexical Injection (TLI) with Low-Rank Adaptation (LoRA) and contrastive learning to fine-tune the model's embeddings from this layer. Result: TLI significantly improved lexical alignment for both trained (increased average cosine similarity from 0.3211 to 0.4113) and unseen word pairs (from 0.3143 to 0.4033), with statistically significant improvements. Conclusion: Targeted Lexical Injection enhances the ability of LLMs to preserve and propagate their inherent cross-lingual knowledge, offering a parameter-efficient strategy for improving lexical alignment in LRL-focused models. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their performance in low-resource languages (LRLs), such as Swahili, often lags due to data scarcity and underrepresentation in pre-training. A key challenge is achieving robust cross-lingual lexical alignment, crucial for tasks like translation and cross-lingual information retrieval. This paper introduces Targeted Lexical Injection (TLI), a novel and efficient fine-tuning approach. We first demonstrate that Lugha-Llama-8B-wura, a Swahili-centric LLM, exhibits strong, near-perfect lexical alignment for Swahili-English word pairs in its early internal layers (specifically Layer 2, with ~0.99998 average cosine similarity based on a pilot study), a capability not fully reflected in its final output representations (baseline ~0.32 similarity on our evaluation set). TLI leverages this insight by using Low-Rank Adaptation (LoRA) and a contrastive learning objective to fine-tune the model, specifically targeting embeddings from this empirically identified optimal early layer. Our experiments show that TLI significantly improves the output-level lexical alignment for 623 trained Swahili-English word pairs, increasing average cosine similarity from 0.3211 to 0.4113 (+28.08%, p < 1.33 x 10^-240). More importantly, these improvements generalize remarkably well to 63 unseen control word pairs, with similarity increasing from 0.3143 to 0.4033 (+28.32%, p < 7.17 x 10^-27). These findings suggest TLI enhances the model's ability to preserve and propagate its inherent early-layer cross-lingual knowledge, offering a parameter-efficient and effective strategy for improving lexical alignment in LRL-focused LLMs.

[28] Understanding GUI Agent Localization Biases through Logit Sharpness

Xingjian Tao,Yiwei Wang,Yujun Cai,Zhicheng Yang,Jing Tang

Main category: cs.CL

TL;DR: Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems, but they frequently exhibit hallucinations. The paper proposes a fine-grained evaluation framework that categorizes model predictions into four distinct types and introduces the Peak Sharpness Score (PSS), Context-Aware Cropping to improve model performance.

Details

Motivation: To address the issue of hallucinations in MLLMs that interact with GUIs, which are systematic localization errors that compromise reliability. Method: Propose a fine-grained evaluation framework categorizing predictions into four types, introduce Peak Sharpness Score (PSS) for uncertainty quantification, and Context-Aware Cropping for improving model performance without additional training. Result: The framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior as demonstrated by extensive experiments. Conclusion: The proposed evaluation framework and techniques offer ways to better understand and improve the reliability of MLLMs in GUI interactions. Abstract: Multimodal large language models (MLLMs) have enabled GUI agents to interact with operating systems by grounding language into spatial actions. Despite their promising performance, these models frequently exhibit hallucinations-systematic localization errors that compromise reliability. We propose a fine-grained evaluation framework that categorizes model predictions into four distinct types, revealing nuanced failure modes beyond traditional accuracy metrics. To better quantify model uncertainty, we introduce the Peak Sharpness Score (PSS), a metric that evaluates the alignment between semantic continuity and logits distribution in coordinate prediction. Building on this insight, we further propose Context-Aware Cropping, a training-free technique that improves model performance by adaptively refining input context. Extensive experiments demonstrate that our framework and methods provide actionable insights and enhance the interpretability and robustness of GUI agent behavior.

[29] AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need

Zhouhong Gu,Xiaoxuan Zhu,Yin Cai,Hao Shen,Xingzhou Chen,Qingyi Wang,Jialin Li,Xiaoran Shi,Haoran Guo,Wenxuan Huang,Hongwei Feng,Yanghua Xiao,Zheyu Ye,Yao Hu,Shaosheng Cao

Main category: cs.CL

TL;DR: AgentGroupChat-V2 is a novel framework for multi-agent systems based on large language models. It addresses challenges in system architecture, cross-domain generalizability, and performance through three core innovations: a parallel architecture, an adaptive collaboration engine, and optimization strategies.

Details

Motivation: To overcome the challenges faced by current frameworks in system architecture design, cross-domain generalizability, and performance guarantees as task complexity and number of agents increase. Method: The framework includes: (1) a divide-and-conquer fully parallel architecture for dependency management and distributed concurrent processing; (2) an adaptive collaboration engine for dynamic selection of LLM combinations and interaction modes; (3) agent organization optimization strategies for efficient problem decomposition. Result: Extensive experiments show superior performance across diverse domains, with notable improvements in accuracy and pass rates compared to state-of-the-art baselines, especially in complex reasoning scenarios. Conclusion: AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems, demonstrating significant advantages in complex reasoning scenarios. Abstract: Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at https://github.com/MikeGu721/AgentGroupChat-V2.

[30] RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

Xinnuo Xu,Rachel Lawrence,Kshitij Dubey,Atharva Pandey,Risa Ueno,Fabian Falck,Aditya V. Nori,Rahul Sharma,Amit Sharma,Javier Gonzalez

Main category: cs.CL

TL;DR: Recent Large Language Models (LLMs) show high accuracy on reasoning benchmarks, but it's unclear if this is due to true reasoning or statistical recall. This paper introduces RE-IMAGINE, a framework that characterizes a hierarchy of reasoning ability in LLMs and generates problem variations at different levels of the hierarchy. The framework demonstrates reductions in model performance when queried with problem variations, indicating reliance on statistical recall.

Details

Motivation: To determine whether the high accuracy of LLMs on reasoning benchmarks is due to true reasoning or statistical recall of the training set. Method: RE-IMAGINE, a framework that uses the ladder of causation to characterize a hierarchy of reasoning ability in LLMs and an automated pipeline to generate problem variations at different levels of the hierarchy. Result: Observes reductions in performance when LLMs are queried with problem variations, indicating reliance on statistical recall. Conclusion: The assessments indicate that LLMs rely on statistical recall for past performance and open the door to further research targeting skills across the reasoning hierarchy. Abstract: Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

[31] Context-Informed Grounding Supervision

Hyunji Lee,Seunghyun Yoon,Yunjae Won,Hanseok Oh,Geewook Kim,Trung Bui,Franck Dernoncourt,Elias Stengel-Eskin,Mohit Bansal,Minjoon Seo

Main category: cs.CL

TL;DR: 大型语言模型（LLMs）在推理时通常补充外部知识以提供不在其参数中编码的信息或减少幻觉。本文提出了一种后训练监督方法CINGS，通过在训练时将相关上下文附加到响应前，仅在响应标记上计算损失并屏蔽上下文，使模型生成更基于外部上下文的响应。实验表明，CINGS在文本和视觉领域都表现出更强的基于上下文的生成能力，且不会降低整体下游性能。

Details

Motivation: 尽管在推理时简单地附加上下文并不能确保生成的内容基于该上下文，但需要一种方法来确保LLMs能够更好地利用外部知识生成响应。 Method: 提出了一种称为Context-INformed Grounding Supervision (CINGS)的后训练监督方法，其中模型在训练时使用相关上下文前置的响应进行训练，但在计算损失时仅考虑响应标记并屏蔽上下文。 Result: CINGS在11个信息检索数据集上的文本领域和4个基准上的视觉-语言领域均表现出色，减少了幻觉并保持了事实一致性，同时没有降低整体下游性能。 Conclusion: CINGS是一种有效的后训练监督方法，可以增强LLMs在文本和视觉领域的基于上下文的生成能力，诱导模型更多地依赖外部上下文。 Abstract: Large language models (LLMs) are often supplemented with external knowledge to provide information not encoded in their parameters or to reduce hallucination. In such cases, we expect the model to generate responses by grounding its response in the provided external context. However, prior work has shown that simply appending context at inference time does not ensure grounded generation. To address this, we propose Context-INformed Grounding Supervision (CINGS), a post-training supervision in which the model is trained with relevant context prepended to the response, while computing the loss only over the response tokens and masking out the context. Our experiments demonstrate that models trained with CINGS exhibit stronger grounding in both textual and visual domains compared to standard instruction-tuned models. In the text domain, CINGS outperforms other training methods across 11 information-seeking datasets and is complementary to inference-time grounding techniques. In the vision-language domain, replacing a vision-language model's LLM backbone with a CINGS-trained model reduces hallucinations across four benchmarks and maintains factual consistency throughout the generated response. This improved grounding comes without degradation in general downstream performance. Finally, we analyze the mechanism underlying the enhanced grounding in CINGS and find that it induces a shift in the model's prior knowledge and behavior, implicitly encouraging greater reliance on the external context.

[32] SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Md Imbesat Hassan Rizvi,Xiaodan Zhu,Iryna Gurevych

Main category: cs.CL

TL;DR: SPARE is a new framework for single-pass process annotation that enhances multi-step reasoning in LLMs, showing effectiveness across domains and improved efficiency compared to tree search-based methods.

Details

Motivation: To overcome the challenge of efficient, high-quality automated process annotation for advancing complex multi-step reasoning capabilities of LLMs. Method: Introduced SPARE, which enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution with explicit reasoning for evaluation. Result: SPARE improves reasoning performance in fine-tuning models and training reward models, achieves competitive performance on mathematical datasets, and offers 2.6 times greater efficiency than tree search-based automatic annotation. Conclusion: SPARE effectively facilitates process supervision across various domains and provides a more efficient alternative to existing methods, with public release of codebase and trained model to support further research. Abstract: Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables single-pass, per-step annotation by aligning each solution step to one or multiple steps in a reference solution, accompanied by explicit reasoning for evaluation. We show that reference-guided step-level evaluation effectively facilitates process supervision on four datasets spanning three domains: mathematical reasoning, multi-hop compositional question answering, and spatial reasoning. We demonstrate that SPARE, when compared to baselines, improves reasoning performance when used for: (1) fine-tuning models in an offline RL setup for inference-time greedy-decoding, and (2) training reward models for ranking/aggregating multiple LLM-generated outputs. Additionally, SPARE achieves competitive performance on challenging mathematical datasets while offering 2.6 times greater efficiency, requiring only 38% of the runtime, compared to tree search-based automatic annotation. The codebase, along with a trained SPARE-PRM model, is publicly released to facilitate further research and reproducibility.

[33] Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge

Li Zheng,Sihang Wang,Hao Fei,Zuquan Peng,Fei Li,Jianming Fu,Chong Teng,Donghong Ji

Main category: cs.CL

TL;DR: The paper proposes EmoBi, a framework for hyperbole and metaphor detection guided by emotion analysis. It outperforms baselines with significant F1 score improvements.

Details

Motivation: Hyperbole and metaphor detection is crucial but challenging in NLP due to semantic obscurity and expressive diversity. Existing methods focus on superficial text features, ignoring associations and implicit emotion effects. Method: Proposes EmoBi framework with three modules: emotion analysis module mines connotations, domain mapping module identifies domains, and bidirectional dynamic interaction module promotes mutual understanding. A verification mechanism ensures accuracy and reliability. Result: Experiments show EmoBi surpasses all baseline methods on four datasets, achieving 28.1% higher F1 score for hyperbole detection on TroFi and 23.1% for metaphor detection on HYPO-L compared to current SoTA. Conclusion: EmoBi effectively advances hyperbole and metaphor detection, showcasing its potential in this field. Abstract: Text-based hyperbole and metaphor detection are of great significance for natural language processing (NLP) tasks. However, due to their semantic obscurity and expressive diversity, it is rather challenging to identify them. Existing methods mostly focus on superficial text features, ignoring the associations of hyperbole and metaphor as well as the effect of implicit emotion on perceiving these rhetorical devices. To implement these hypotheses, we propose an emotion-guided hyperbole and metaphor detection framework based on bidirectional dynamic interaction (EmoBi). Firstly, the emotion analysis module deeply mines the emotion connotations behind hyperbole and metaphor. Next, the emotion-based domain mapping module identifies the target and source domains to gain a deeper understanding of the implicit meanings of hyperbole and metaphor. Finally, the bidirectional dynamic interaction module enables the mutual promotion between hyperbole and metaphor. Meanwhile, a verification mechanism is designed to ensure detection accuracy and reliability. Experiments show that EmoBi outperforms all baseline methods on four datasets. Specifically, compared to the current SoTA, the F1 score increased by 28.1% for hyperbole detection on the TroFi dataset and 23.1% for metaphor detection on the HYPO-L dataset. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing hyperbole and metaphor detection.

[34] Lessons from Training Grounded LLMs with Verifiable Rewards

Shang Hong Sim,Tej Deep Pala,Vernon Toh,Hai Leong Chieu,Amir Zadeh,Chuan Li,Navonil Majumder,Soujanya Poria

Main category: cs.CL

TL;DR: 生成有根据且可信赖的回复对于大语言模型来说仍然是一个关键挑战。本文探讨了如何通过强化学习和内部推理来增强LLM中的接地效果，提出了一种名为GRPO的方法，并通过多阶段训练提高了模型性能。研究结果表明，结合指令微调和GRPO可以显著提升模型在长篇问答任务中的表现。

Details

Motivation: 尽管基于检索的生成（RAG）和基于引用的接地方法具有潜力，但现有的指令微调模型在处理简单场景时仍存在诸多问题，如遗漏明确答案、错误引用或拒绝提供证据支持的回答。因此，需要探索新的方法来改善LLM的接地效果。 Method: 使用GRPO方法对模型进行训练，利用可验证的结果驱动奖励来优化回答正确性、引用充分性和拒绝质量。采用两阶段训练策略：首先优化回答和引用行为，然后优化拒绝行为。此外，通过GPT-4蒸馏重新审视指令微调，并将其与GRPO结合以提高长篇问答任务的表现。 Result: 经过实验验证，推理增强模型在处理无法回答的问题和生成引用充分的回答方面显著优于仅依赖指令的变体。两阶段训练进一步稳定了学习信号，增强了模型的接地效果。 Conclusion: 研究结果表明，推理能力、分阶段优化以及结果驱动的强化学习对于构建更可验证和可靠的LLM具有重要价值。 Abstract: Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.

[35] RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

Bailin Wang,Chang Lan,Chong Wang,Ruoming Pang

Main category: cs.CL

TL;DR: RATTENTION is a new local-global attention model that improves efficiency in short-context scenarios without sacrificing performance.

Details

Motivation: To address the limitation of local attention which completely disregards tokens outside the defined window, aiming to shift the Pareto frontier for better tradeoff between performance and efficiency. Method: Introduce RATTENTION, a variant of local attention combined with a specialized linear attention mechanism to capture information from out-of-window tokens. Result: Pretraining experiments show RATTENTION achieves superior Pareto tradeoff at different scales. A window size of 512 matches full-attention performance and enhances long-context performance. Conclusion: RATTENTION maintains training speeds comparable to state-of-the-art approaches while offering efficiency gains in short-context regimes. Abstract: Local-global attention models have recently emerged as compelling alternatives to standard Transformers, promising improvements in both training and inference efficiency. However, the crucial choice of window size presents a Pareto tradeoff: larger windows maintain performance akin to full attention but offer minimal efficiency gains in short-context scenarios, while smaller windows can lead to performance degradation. Current models, such as Gemma2 and Mistral, adopt conservative window sizes (e.g., 4096 out of an 8192 pretraining length) to preserve performance. This work investigates strategies to shift this Pareto frontier, enabling local-global models to achieve efficiency gains even in short-context regimes. Our core motivation is to address the intrinsic limitation of local attention -- its complete disregard for tokens outside the defined window. We explore RATTENTION, a variant of local attention integrated with a specialized linear attention mechanism designed to capture information from these out-of-window tokens. Pretraining experiments at the 3B and 12B scales demonstrate that RATTENTION achieves a superior Pareto tradeoff between performance and efficiency. As a sweet spot, RATTENTION with a window size of just 512 consistently matches the performance of full-attention models across diverse settings. Furthermore, the recurrent nature inherent in the linear attention component of RATTENTION contributes to enhanced long-context performance, as validated on the RULER benchmark. Crucially, these improvements do not compromise training efficiency; thanks to a specialized kernel implementation and the reduced window size, RATTENTION maintains training speeds comparable to existing state-of-the-art approaches.

[36] Approximating Language Model Training Data from Weights

John X. Morris,Junjie Oscar Yin,Woojeong Kim,Vitaly Shmatikov,Alexander M. Rush

Main category: cs.CL

TL;DR: 现代语言模型通常权重开放但训练数据封闭。本文将从模型权重近似数据的问题形式化，并提出几种基线和度量方法。通过开发基于梯度的方法，从大型公共文本语料库中选择最匹配的数据，仅使用原始和微调模型的权重即可有效恢复有用数据。即使没有任何真实训练数据的情况下，该方法也能定位一小部分公共网络文档，用于训练接近原始模型性能的模型。在AG新闻分类任务中，该方法将性能从65%（使用随机选择的数据）提高到80%，接近专家基准88%。应用于MSMARCO网络文档上训练的SFT模型时，该方法将困惑度从3.3降低到2.3，接近专家LLAMA模型的困惑度2.0。

Details

Motivation: 现代语言模型虽然开放了权重，但训练数据仍然封闭，这限制了研究者对模型的理解和改进。为了克服这一限制，需要一种方法能够从模型权重中近似出训练数据，从而帮助研究者更好地理解模型的行为并改进其性能。 Method: 作者提出了一种基于梯度的方法，通过从大规模公共文本语料库中选择与模型权重最匹配的数据来近似原始训练数据。具体而言，该方法利用梯度信息找到能够最大程度还原模型性能的子集数据。此外，作者还定义了几种度量标准来评估数据近似的效果。此方法适用于分类任务和监督微调任务中的模型。 Result: 在AG新闻分类任务中，使用随机选择的数据训练模型只能达到65%的准确率，而使用本文方法选择的数据可以将准确率提升至80%，接近专家基准88%。在MSMARCO网络文档上的监督微调模型中，本文方法将困惑度从3.3降低到2.3，接近专家LLAMA模型的困惑度2.0。 Conclusion: 本文提出了一种基于梯度的数据近似方法，能够在没有真实训练数据的情况下，从公共文本语料库中找到一小部分数据，用于训练接近原始模型性能的新模型。此方法在分类和监督微调任务中均表现出色，为研究封闭训练数据的语言模型提供了一种有效的工具。 Abstract: Modern language models often have open weights but closed training data. We formalize the problem of data approximation from model weights and propose several baselines and metrics. We develop a gradient-based approach that selects the highest-matching data from a large public text corpus and show its effectiveness at recovering useful data given only weights of the original and finetuned models. Even when none of the true training data is known, our method is able to locate a small subset of public Web documents can be used to train a model to close to the original model performance given models trained for both classification and supervised-finetuning. On the AG News classification task, our method improves performance from 65% (using randomly selected data) to 80%, approaching the expert benchmark of 88%. When applied to a model trained with SFT on MSMARCO web documents, our method reduces perplexity from 3.3 to 2.3, compared to an expert LLAMA model's perplexity of 2.0.

[37] PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

Shufan Li,Aditya Grover

Main category: cs.CL

TL;DR: Predictive Generation (PredGen) reduces latency in voice chat apps by generating candidate responses while users are still speaking, cutting response time by about 2x with minimal extra computation.

Details

Motivation: Large Language Models (LLMs) often cause noticeable latency between user input and audio output in real-time voice chat applications, especially when deployed on consumer-grade hardware. Method: The proposed method, Predictive Generation (PredGen), uses speculative decoding at input time to generate candidate responses while the user is still speaking. This allows the system to start Text-to-Speech (TTS) processing with minimal delay. Result: Simulated experiments on Lmsys and MT-Bench datasets demonstrate that PredGen effectively reduces latency by around 2x across various use cases, with only minimal additional computation cost. Conclusion: PredGen mitigates or eliminates the delay caused by LLMs in generating the first sentence for TTS systems, enhancing the user experience in real-time voice chat applications. Abstract: Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.

[38] Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models

Zhengyang Shan,Emily Ruth Diana,Jiawei Zhou

Main category: cs.CL

TL;DR: A new metric called Gender Inclusivity Fairness Index (GIFI) is introduced to evaluate gender fairness in large language models, highlighting the need for improving inclusivity.

Details

Motivation: Existing studies mainly focus on binary gender distinctions, lacking a comprehensive evaluation of gender inclusivity in LLMs. Method: Developed GIFI, which includes evaluations at different levels from probing with gender pronouns to testing model generation and cognitive behaviors under various gender assumptions. Result: Significant variations in gender inclusivity were found among 22 evaluated LLMs. Conclusion: Improving LLMs' inclusivity is crucial, and GIFI provides a benchmark for future advancements in gender fairness. Abstract: We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers. We conduct extensive evaluations with GIFI on 22 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs' gender inclusivity. Our study highlights the importance of improving LLMs' inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models.

[39] SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Chengye Wang,Yifei Shen,Zexi Kuang,Arman Cohan,Yilun Zhao

Main category: cs.CL

TL;DR: The paper introduces SciVer, a benchmark for evaluating foundation models' ability to verify claims in multimodal scientific contexts. It consists of 3,000 annotated examples across 1,113 papers. The study assesses 21 state-of-the-art models, revealing a significant performance gap compared to human experts. Through RAG analysis and error evaluations, critical limitations in open-source models are identified.

Details

Motivation: To evaluate the capability of foundation models in verifying claims within a multimodal scientific context, addressing the lack of specific benchmarks for this task. Method: Created SciVer, a benchmark with 3,000 expert-annotated examples from 1,113 scientific papers. Evaluated 21 state-of-the-art multimodal foundation models using this benchmark. Result: There is a considerable performance gap between the assessed models and human experts. Critical limitations in current open-source models were identified through RAG analysis and human-conducted error evaluations. Conclusion: Current foundation models have significant limitations in comprehending and reasoning within multimodal scientific literature tasks. Insights from this study can help improve model performance in these tasks. Abstract: We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.

[40] DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement

Shaoqing Lin,Chong Teng,Fei Li,Donghong Ji,Lizhen Qu,Zhuang Li

Main category: cs.CL

TL;DR: The paper introduces Discourse-level text Scene Graph parsing (DiscoSG), a new task and dataset for creating comprehensive scene graphs from multi-sentence image captions. They propose DiscoSG-Refiner, an efficient method that improves performance over existing baselines while significantly reducing inference time.

Details

Motivation: Current text scene graph parsers are designed for single-sentence captions and struggle with discourse-level inputs, leading to fragmented graphs and degraded performance in downstream Vision-Language Model tasks. Method: Introduced DiscoSG, a new task supported by the DiscoSG-DS dataset containing multi-sentence caption-graph pairs. Proposed DiscoSG-Refiner which uses two small PLMs: one for drafting a base graph and another for iteratively proposing edits to improve efficiency and performance. Result: DiscoSG-Refiner improves SPICE scores by ~30% over the best baseline and is 86 times faster than GPT-4. It also enhances downstream VLM tasks like caption evaluation and hallucination detection. Conclusion: DiscoSG and DiscoSG-Refiner provide a novel approach to discourse-level scene graph parsing, offering better performance and efficiency compared to fine-tuning large PLMs. Abstract: Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: https://github.com/ShaoqLin/DiscoSG

[41] WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

Negar Foroutan,Angelika Romanou,Matin Ansaripour,Julian Martin Eisenschlos,Karl Aberer,Rémi Lebret

Main category: cs.CL

TL;DR: This paper introduces WikiMixQA, a benchmark with 1,000 MCQs for evaluating cross-modal reasoning over tables and charts from Wikipedia pages. It evaluates 12 VLLMs revealing challenges in long-context, multi-modal reasoning.

Details

Motivation: To address the gap in evaluating vision-language models' effectiveness in processing long-context vision inputs and complex reasoning across multiple modalities. Method: Introduced WikiMixQA, a benchmark of 1,000 MCQs covering tables and charts from 4,000 Wikipedia pages across seven topics. Evaluated 12 state-of-the-art vision-language models on this benchmark. Result: Proprietary models achieve ~70% accuracy with direct context but performance drops significantly with retrieval from long documents. GPT-4-o exceeds 50% accuracy in this setting while open-source models reach a maximum accuracy of 27%. Conclusion: WikiMixQA highlights the challenges of long-context, multi-modal reasoning and serves as an important benchmark for advancing document understanding research. Abstract: Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.

[42] From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns

Bernardo Leite,Henrique Lopes Cardoso,Pedro Pinto,Abel Ferreira,Luís Abreu,Isabel Rangel,Sandra Monteiro

Main category: cs.CL

TL;DR: 尽管生成式AI为自动化选择题（MCQ）生成提供了机会，但对其质量与可靠性的评估仍较少关注。本文研究了当前生成模型在葡萄牙语阅读理解中生成MCQ的能力，发现其质量接近人工编写，但仍存在语义清晰度、可回答性和干扰项设计等问题。

Details

Motivation: 手动创建多难度和目标阅读技能的选择题耗时且昂贵，而生成式AI提供了解决方案。然而，生成式选择题的质量和可靠性评估尚未得到充分研究，特别是在实际应用中的失败案例方面。此外，大多数研究集中于英语，其他语言如葡萄牙语的研究较少。 Method: 研究专注于使用生成模型为葡萄牙语的阅读理解生成符合课程相关叙事元素和不同难度水平的选择题，并通过专家评审和学生回答的心理测量属性分析来评估这些题目。 Result: 结果显示，当前生成模型能够生成与人工编写的题目质量相当的选择题，但在语义清晰度、可回答性和高质量干扰项的设计上仍存在问题。 Conclusion: 生成模型在葡萄牙语MCQ生成方面具有潜力，但需要进一步改进以解决语义清晰度和干扰项设计等问题，确保题目适合小学教育环境。 Abstract: While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task. Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently. However, assessing the actual quality and reliability of generated MCQs has received limited attention -- particularly regarding cases where generation fails. This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings. Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language. Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels. We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students. Our results show that current models can generate MCQs of comparable quality to human-authored ones. However, we identify issues related to semantic clarity and answerability. Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design.

[43] The Compositional Architecture of Regret in Large Language Models

Xiangxiang Cui,Shu Yang,Tianjin Huang,Wanyu Lin,Lijie Hu,Di Wang

Main category: cs.CL

TL;DR: 研究大型语言模型中的后悔机制，提出构建后悔数据集的工作流程、识别最优后悔表示层的S-CDI度量以及识别后悔神经元的RDS度量和分析激活模式的GIC度量。实验成功识别了最优后悔表示层并发现M形解耦模式及三类功能神经元。

Details

Motivation: 研究大型语言模型中的后悔机制对于提高模型可靠性至关重要，并有助于揭示神经网络中认知的编码方式。然而，目前缺乏专门的数据集、度量方法来识别和分析后悔表达及其内部表示。 Method: 1. 提出通过战略性设计提示场景构建全面后悔数据集的工作流程；2. 引入S-CDI度量以找到最优后悔表示层；3. 提出RDS度量识别后悔神经元和GIC度量分析激活模式。 Result: 实验成功识别了最优后悔表示层，显著提升了探测分类实验性能；发现了模型层间的M形解耦模式；通过RDS度量将神经元分为三类功能性组。 Conclusion: 提出的度量和方法有效解决了现有研究中的关键挑战，为深入理解大型语言模型中的后悔机制提供了新的工具和视角。 Abstract: Regret in Large Language Models refers to their explicit regret expression when presented with evidence contradicting their previously generated misinformation. Studying the regret mechanism is crucial for enhancing model reliability and helps in revealing how cognition is coded in neural networks. To understand this mechanism, we need to first identify regret expressions in model outputs, then analyze their internal representation. This analysis requires examining the model's hidden states, where information processing occurs at the neuron level. However, this faces three key challenges: (1) the absence of specialized datasets capturing regret expressions, (2) the lack of metrics to find the optimal regret representation layer, and (3) the lack of metrics for identifying and analyzing regret neurons. Addressing these limitations, we propose: (1) a workflow for constructing a comprehensive regret dataset through strategically designed prompting scenarios, (2) the Supervised Compression-Decoupling Index (S-CDI) metric to identify optimal regret representation layers, and (3) the Regret Dominance Score (RDS) metric to identify regret neurons and the Group Impact Coefficient (GIC) to analyze activation patterns. Our experimental results successfully identified the optimal regret representation layer using the S-CDI metric, which significantly enhanced performance in probe classification experiments. Additionally, we discovered an M-shaped decoupling pattern across model layers, revealing how information processing alternates between coupling and decoupling phases. Through the RDS metric, we categorized neurons into three distinct functional groups: regret neurons, non-regret neurons, and dual neurons.

[44] Minding the Politeness Gap in Cross-cultural Communication

Yuka Machino,Matthias Hofer,Max Siegel,Joshua B. Tenenbaum,Robert D. Hawkins

Main category: cs.CL

TL;DR: 通过三个实验和一个计算认知模型，研究英式英语和美式英语使用者对强化词的不同解释，发现跨文化差异源于字面意义不同及表达成本权重不同。

Details

Motivation: 探讨跨文化交流中细微的解释差异是否来源于词语的字面意义还是诸如礼貌和简洁等普遍的语用因素。 Method: 进行三个实验，研究英式和美式英语使用者对强化词（如quite和very）的理解；开发一个计算认知模型，模拟听话者对说话者的递归推理过程，考虑信息量、礼貌和话语成本的平衡。 Result: 跨文化差异在强化词解释上来自于字面意义的不同和表达成本的不同权重。 Conclusion: 跨文化解释差异并非单纯由语义变化或礼貌规范决定，而是两者的复杂相互作用导致的。 Abstract: Misunderstandings in cross-cultural communication often arise from subtle differences in interpretation, but it is unclear whether these differences arise from the literal meanings assigned to words or from more general pragmatic factors such as norms around politeness and brevity. In this paper, we report three experiments examining how speakers of British and American English interpret intensifiers like "quite" and "very." To better understand these cross-cultural differences, we developed a computational cognitive model where listeners recursively reason about speakers who balance informativity, politeness, and utterance cost. Our model comparisons suggested that cross-cultural differences in intensifier interpretation stem from a combination of (1) different literal meanings, (2) different weights on utterance cost. These findings challenge accounts based purely on semantic variation or politeness norms, demonstrating that cross-cultural differences in interpretation emerge from an intricate interplay between the two.

[45] Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability

Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe

Main category: cs.CL

TL;DR: In generative commonsense reasoning tasks, LLMs can generate sentences with given concepts but may struggle when a specific concept order is required. This study introduces Ordered CommonGen, a benchmark to evaluate both compositional generalization and instruction-following abilities of LLMs by measuring ordered coverage.

Details

Motivation: To address the limitation of LLMs in following instructions regarding concept order while generating sentences, which is crucial for generative commonsense reasoning tasks. Method: Propose a new benchmark called Ordered CommonGen to evaluate LLMs' performance in generating sentences that adhere to a specified concept order. Measure ordered coverage as an evaluation metric. Result: Comprehensive analysis of 36 LLMs showed that even though they generally understand instructions, their outputs often lack diversity due to biases towards certain concept order patterns. The best-performing model achieved only about 75% ordered coverage. Conclusion: There is a need for further improvement in LLMs' instruction-following and compositional generalization capabilities. Abstract: In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.

[46] Oldies but Goldies: The Potential of Character N-grams for Romanian Texts

Dana Lupsa,Sanda-Maria Avram

Main category: cs.CL

TL;DR: This study explores authorship attribution for Romanian texts using the ROST corpus and six machine learning techniques, finding that an ANN model with character n-gram features delivers high accuracy.

Details

Motivation: To address the problem of authorship attribution for Romanian texts using a standard benchmark (ROST corpus) and explore the effectiveness of various machine learning techniques with character n-gram features. Method: Systematically evaluate six machine learning techniques (SVM, LR, k-NN, DT, RF, ANN) using character n-gram features for classification. Result: The ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. Conclusion: Lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, showing potential in resource-constrained or under-studied language settings. Abstract: This study addresses the problem of authorship attribution for Romanian texts using the ROST corpus, a standard benchmark in the field. We systematically evaluate six machine learning techniques: Support Vector Machine (SVM), Logistic Regression (LR), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forests (RF), and Artificial Neural Networks (ANN), employing character n-gram features for classification. Among these, the ANN model achieved the highest performance, including perfect classification in four out of fifteen runs when using 5-gram features. These results demonstrate that lightweight, interpretable character n-gram approaches can deliver state-of-the-art accuracy for Romanian authorship attribution, rivaling more complex methods. Our findings highlight the potential of simple stylometric features in resource, constrained or under-studied language settings.

[47] CC-LEARN: Cohort-based Consistency Learning

Xiao Ye,Shaswat Shrivastava,Zhaonan Li,Jacob Dineen,Shijie Lu,Avneet Ahuja,Ming Shen,Zhikun Xu,Ben Zhou

Main category: cs.CL

TL;DR: 大型语言模型在许多任务中表现出色，但在一致性和稳健推理方面仍有困难。本文提出了基于群体的一致性学习（CC-Learn），通过强化学习框架提高LLM推理的可靠性。该方法通过对共享程序化抽象的相似问题进行训练，定义了结合群体准确性、检索奖励和拒绝惩罚的复合目标，从而优化推理一致性。实验表明，CC-Learn提高了推理准确性和稳定性。

Details

Motivation: 尽管大型语言模型在很多任务上表现出色，但它们在一致性和稳健推理方面仍然存在挑战，需要一种新方法来提高其推理的可靠性。 Method: 提出了一种名为Cohort-based Consistency Learning (CC-Learn)的强化学习框架。该框架通过对从共享程序化抽象中派生的相似问题组进行训练，以提高LLM推理的可靠性。为了确保群体级的一致性，定义了一个复合目标，包括群体准确性、有效的检索奖励以及对简单或无效查询的拒绝惩罚。这些元素可以通过强化学习直接优化，而监督微调则无法实现这一点。 Result: 在具有挑战性的推理基准测试（如ARC-Challenge和StrategyQA）中，CC-Learn显著提高了准确性和推理稳定性，超越了预训练和SFT基线模型。 Conclusion: 基于群体的强化学习有效提升了LLM推理的一致性。 Abstract: Large language models excel at many tasks but still struggle with consistent, robust reasoning. We introduce Cohort-based Consistency Learning (CC-Learn), a reinforcement learning framework that improves the reliability of LLM reasoning by training on cohorts of similar questions derived from shared programmatic abstractions. To enforce cohort-level consistency, we define a composite objective combining cohort accuracy, a retrieval bonus for effective problem decomposition, and a rejection penalty for trivial or invalid lookups that reinforcement learning can directly optimize, unlike supervised fine-tuning. Optimizing this reward guides the model to adopt uniform reasoning patterns across all cohort members. Experiments on challenging reasoning benchmarks (including ARC-Challenge and StrategyQA) show that CC-Learn boosts both accuracy and reasoning stability over pretrained and SFT baselines. These results demonstrate that cohort-level RL effectively enhances reasoning consistency in LLMs.

[48] Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

Tommaso Green,Martin Gubri,Haritz Puerto,Sangdoo Yun,Seong Joon Oh

Main category: cs.CL

TL;DR: 研究发现，大型推理模型的推理痕迹中存在隐私泄露问题，尤其是在增加推理步骤等测试时计算方法的情况下。尽管更谨慎的最终答案需要更多预算，但也会导致更详细的推理过程和更多的隐私泄露。因此，提高推理能力虽然改善了模型的实用性，但也扩大了隐私攻击面。安全性努力不应只局限于模型的输出，还应涵盖其内部思考过程。

Details

Motivation: 挑战关于推理痕迹安全性的假设，揭示推理痕迹可能包含敏感用户数据，并通过注入提示或意外泄露到输出中。 Method: 通过探测和代理评估，特别关注测试时计算方法（如增加推理步骤）对隐私泄露的影响。 Result: 证明了测试时计算方法，尤其是增加推理步骤，放大了隐私泄露；更高的预算使模型在最终答案上更加谨慎，但也导致更详细的推理过程和更多的隐私泄露。 Conclusion: 安全性工作必须扩展到模型的内部思考过程，而不仅仅是其输出。 Abstract: We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model's internal thinking, not just its outputs.

[49] Gender-Neutral Machine Translation Strategies in Practice

Hillary Dawkins,Isar Nejadgholi,Chi-kiu Lo

Main category: cs.CL

TL;DR: The study evaluates 21 MT systems' handling of gender neutrality across three translation directions, finding mostly disappointing results with a few exceptions.

Details

Motivation: To address the challenge of preserving gender ambiguity in translations from notional gender languages to grammatical gender languages, avoiding misgendering and representational harms. Method: Assess the sensitivity of 21 MT systems to gender neutrality in response to gender ambiguity across three translation directions, categorizing and discussing observed gender-neutral strategies while examining the effect of binary gender stereotypes. Result: Most MT systems fail to provide gender-neutral translations when faced with gender ambiguity. A small number of systems employ specific strategies for gender-neutral translation based on the target language. Conclusion: There is a significant lack of gender-neutral translations in current MT systems, highlighting the need for improvement in this area. Abstract: Gender-inclusive machine translation (MT) should preserve gender ambiguity in the source to avoid misgendering and representational harms. While gender ambiguity often occurs naturally in notional gender languages such as English, maintaining that gender neutrality in grammatical gender languages is a challenge. Here we assess the sensitivity of 21 MT systems to the need for gender neutrality in response to gender ambiguity in three translation directions of varying difficulty. The specific gender-neutral strategies that are observed in practice are categorized and discussed. Additionally, we examine the effect of binary gender stereotypes on the use of gender-neutral translation. In general, we report a disappointing absence of gender-neutral translations in response to gender ambiguity. However, we observe a small handful of MT systems that switch to gender neutral translation using specific strategies, depending on the target language.

[50] GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

Byung-Kwan Lee,Ryo Hachiuma,Yong Man Ro,Yu-Chiang Frank Wang,Yueh-Hua Wu

Main category: cs.CL

TL;DR: Recent advancements in VLMs have achieved performance on par with closed-source systems like GPT-4V, but deploying them in real-world scenarios is still challenging due to their substantial computational demands. To solve this, GenRecal, a novel distillation framework for VLMs is presented. It incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs.

Details

Motivation: Despite the great performance of recent VLMs, deploying them in real-world scenarios remains challenging due to their substantial computational demands. Method: The proposed method is called Generation after Recalibration (GenRecal), which is a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs. Result: Through extensive experiments on multiple challenging benchmarks, GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs. Conclusion: GenRecal is an effective solution for the deployment of VLMs in real-world scenarios. Abstract: Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

[51] PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning

Yuhui Shi,Yehan Yang,Qiang Sheng,Hao Mi,Beizhe Hu,Chaoxi Xu,Juan Cao

Main category: cs.CL

TL;DR: This paper proposes PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs, which shows superior performance with F1 scores of over 96%.

Details

Motivation: With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Method: PhantomHunter is proposed, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. Result: Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%. Conclusion: PhantomHunter effectively addresses the new challenge posed by text from privately tuned LLMs, significantly improving the detection performance. Abstract: With the popularity of large language models (LLMs), undesirable societal problems like misinformation production and academic misconduct have been more severe, making LLM-generated text detection now of unprecedented importance. Although existing methods have made remarkable progress, a new challenge posed by text from privately tuned LLMs remains underexplored. Users could easily possess private LLMs by fine-tuning an open-source one with private corpora, resulting in a significant performance drop of existing detectors in practice. To address this issue, we propose PhantomHunter, an LLM-generated text detector specialized for detecting text from unseen, privately-tuned LLMs. Its family-aware learning framework captures family-level traits shared across the base models and their derivatives, instead of memorizing individual characteristics. Experiments on data from LLaMA, Gemma, and Mistral families show its superiority over 7 baselines and 3 industrial services, with F1 scores of over 96%.

cs.CV [Back]

[52] SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection

Jingxuan Zhou,Yuehao Wu,Yibo Zhang,Yeyubei Zhang,Yunchong Liu,Bolin Huang,Chunhong Yuan

Main category: cs.CV

TL;DR: This paper proposes a Semantic Irony Recognition Network (SemIRNet) for multimodal irony detection tasks, improving accuracy and F1 value with innovations in conceptual knowledge acquisition, cross-modal semantic similarity detection, and contrastive learning optimization.

Details

Motivation: To address the difficulty of accurately identifying graphical implicit correlations in multimodal irony detection tasks. Method: Propose SemIRNet model with three main innovations: introducing ConceptNet knowledge base for common-sense reasoning, designing two cross-modal semantic similarity detection modules at word and sample levels, and introducing contrastive learning loss function to optimize sample feature distribution. Result: Experiments show an improvement of 1.64% in accuracy and 2.88% in F1 value compared to existing optimal methods, reaching 88.87% and 86.33% respectively. Ablation experiments verify the importance of knowledge fusion and semantic similarity detection. Conclusion: The proposed SemIRNet model effectively enhances performance in multimodal irony detection tasks through its innovative methods. Abstract: Aiming at the problem of difficulty in accurately identifying graphical implicit correlations in multimodal irony detection tasks, this paper proposes a Semantic Irony Recognition Network (SemIRNet). The model contains three main innovations: (1) The ConceptNet knowledge base is introduced for the first time to acquire conceptual knowledge, which enhances the model's common-sense reasoning ability; (2) Two cross-modal semantic similarity detection modules at the word level and sample level are designed to model graphic-textual correlations at different granularities; and (3) A contrastive learning loss function is introduced to optimize the spatial distribution of the sample features, which improves the separability of positive and negative samples. Experiments on a publicly available multimodal irony detection benchmark dataset show that the accuracy and F1 value of this model are improved by 1.64% and 2.88% to 88.87% and 86.33%, respectively, compared with the existing optimal methods. Further ablation experiments verify the important role of knowledge fusion and semantic similarity detection in improving the model performance.

[53] Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?

Yang Yao,Lingyu Li,Jiaxin Song,Chiyu Chen,Zhenqi He,Yixu Wang,Xin Wang,Tianle Gu,Jie Li,Yan Teng,Yingchun Wang

Main category: cs.CV

TL;DR: The paper introduces Argus Inspection, a multimodal benchmark for evaluating MLLMs' detailed visual recognition and commonsense causal reasoning abilities. It also presents the Eye of Panoptes framework for assessing opinion-based reasoning tasks. Experiments on 26 MLLMs show significant room for improvement in visual fine-grained reasoning.

Details

Motivation: To address challenges in visual fine-grained perception and commonsense causal inference in MLLMs. Method: Introduced Argus Inspection benchmark with two difficulty levels and the Eye of Panoptes framework integrating a binary parametric Sigmoid metric with an indicator function. Result: Experiments on 26 MLLMs showed the highest performance in visual fine-grained reasoning was 0.46. Conclusion: The research provides valuable insights for further refinement of MLLMs. Abstract: As Multimodal Large Language Models (MLLMs) continue to evolve, their cognitive and reasoning capabilities have seen remarkable progress. However, challenges in visual fine-grained perception and commonsense causal inference persist. This paper introduces Argus Inspection, a multimodal benchmark with two levels of difficulty, emphasizing detailed visual recognition while incorporating real-world commonsense understanding to evaluate causal reasoning abilities. Expanding on it, we present the Eye of Panoptes framework, which integrates a binary parametric Sigmoid metric with an indicator function, enabling a more holistic evaluation of MLLMs' responses in opinion-based reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the highest performance in visual fine-grained reasoning reaches only 0.46, highlighting considerable potential for enhancement. Our research offers valuable perspectives for the continued refinement of MLLMs.

[54] A Hybrid ConvNeXt-EfficientNet AI Solution for Precise Falcon Disease Detection

Alavikunhu Panthakkan,Zubair Medammal,S M Anzar,Fatma Taher,Hussain Al-Ahmad

Main category: cs.CV

TL;DR: This paper presents a hybrid ConvNeXt and EfficientNet AI model for classifying falcon diseases, achieving better performance than traditional methods.

Details

Motivation: Falconry requires careful health monitoring of falcons. To improve the accuracy of disease detection in falcons, an advanced AI solution is needed. Method: The researchers developed a concatenated AI model combining ConvNeXt and EfficientNet to classify three falcon conditions: Normal, Liver Disease, and Aspergillosis. The model was trained and validated using a large dataset focusing on metrics like accuracy, precision, recall, and F1-score. Result: The hybrid AI model showed superior performance compared to traditional diagnostic methods and individual model architectures in detecting falcon diseases. Conclusion: The successful implementation of this hybrid AI model represents a major advancement in precise falcon disease classification and opens new possibilities for AI-driven avian healthcare. Abstract: Falconry, a revered tradition involving the training and hunting with falcons, requires meticulous health surveillance to ensure the health and safety of these prized birds, particularly in hunting scenarios. This paper presents an innovative method employing a hybrid of ConvNeXt and EfficientNet AI models for the classification of falcon diseases. The study focuses on accurately identifying three conditions: Normal, Liver Disease and 'Aspergillosis'. A substantial dataset was utilized for training and validating the model, with an emphasis on key performance metrics such as accuracy, precision, recall, and F1-score. Extensive testing and analysis have shown that our concatenated AI model outperforms traditional diagnostic methods and individual model architectures. The successful implementation of this hybrid AI model marks a significant step forward in precise falcon disease detection and paves the way for future developments in AI-powered avian healthcare solutions.

[55] ViLLa: A Neuro-Symbolic approach for Animal Monitoring

Harsha Koduri

Main category: cs.CV

TL;DR: A neuro-symbolic framework named ViLLa is introduced, which integrates visual detection, language parsing, and symbolic reasoning to interpret animal monitoring data and answer human queries about visual content in a transparent way.

Details

Motivation: Current systems for monitoring animal populations often lack the ability to effectively interpret both visual data and human language queries. There is a need for a system that can bridge these two types of information while providing interpretable answers. Method: ViLLa combines three key components: a visual detection module that identifies animals and their locations in images, a language parser that understands natural language questions, and a symbolic reasoning layer that uses logic-based inference to provide answers related to count, presence, and location of animals in the image. Result: ViLLa was evaluated on various animal imagery tasks and showed the capability to connect visual content with structured, human-interpretable queries effectively. Conclusion: ViLLa provides a modular and transparent approach to interpreting animal monitoring data by separating perception, understanding, and reasoning processes. Abstract: Monitoring animal populations in natural environments requires systems that can interpret both visual data and human language queries. This work introduces ViLLa (Vision-Language-Logic Approach), a neuro-symbolic framework designed for interpretable animal monitoring. ViLLa integrates three core components: a visual detection module for identifying animals and their spatial locations in images, a language parser for understanding natural language queries, and a symbolic reasoning layer that applies logic-based inference to answer those queries. Given an image and a question such as "How many dogs are in the scene?" or "Where is the buffalo?", the system grounds visual detections into symbolic facts and uses predefined rules to compute accurate answers related to count, presence, and location. Unlike end-to-end black-box models, ViLLa separates perception, understanding, and reasoning, offering modularity and transparency. The system was evaluated on a range of animal imagery tasks and demonstrates the ability to bridge visual content with structured, human-interpretable queries.

[56] GraphGSOcc: Semantic and Geometric Graph Transformer for 3D Gaussian Splating-based Occupancy Prediction

Ke Song,Yunhe Wu,Chunchit Siu,Huiyuan Xiong

Main category: cs.CV

TL;DR: The paper introduces GraphGSOcc, a new model for 3D semantic occupancy prediction that combines semantic and geometric graph Transformers to address limitations in existing 3D Gaussian Splating methods.

Details

Motivation: Existing 3D Gaussian Splating methods have two main issues: unified feature aggregation neglects semantic correlations among similar categories and across regions, and boundary ambiguities are caused by the lack of geometric constraints in MLP iterative optimization. Method: The authors propose the GraphGSOcc model which uses Dual Gaussians Graph Attention to dynamically construct dual graph structures - a geometric graph for adaptive KNN search radii based on Gaussian poses and a semantic graph retaining top-M highly correlated nodes via cosine similarity. This is coupled with a Multi-scale Graph Attention framework for fine-grained attention at lower layers and coarse-grained attention at higher layers. Result: Experiments on the SurroundOcc dataset show an mIoU of 24.10%, with reduced GPU memory usage of 6.1 GB. This represents a 1.97% mIoU improvement and 13.7% memory reduction compared to GaussianWorld. Conclusion: GraphGSOcc effectively addresses the limitations of current 3D Gaussian Splating methods by integrating semantic and geometric information, leading to improved performance and reduced memory consumption. Abstract: Addressing the task of 3D semantic occupancy prediction for autonomous driving, we tackle two key issues in existing 3D Gaussian Splating (3DGS) methods: (1) unified feature aggregation neglecting semantic correlations among similar categories and across regions, and (2) boundary ambiguities caused by the lack of geometric constraints in MLP iterative optimization. We propose the GraphGSOcc model, a novel framework that combines semantic and geometric graph Transformer for 3D Gaussian Splating-based Occupancy Prediction. We propose the Dual Gaussians Graph Attenntion, which dynamically constructs dual graph structures: a geometric graph adaptively calculating KNN search radii based on Gaussian poses, enabling large-scale Gaussians to aggregate features from broader neighborhoods while compact Gaussians focus on local geometric consistency; a semantic graph retaining top-M highly correlated nodes via cosine similarity to explicitly encode semantic relationships within and across instances. Coupled with the Multi-scale Graph Attention framework, fine-grained attention at lower layers optimizes boundary details, while coarse-grained attention at higher layers models object-level topology. Experiments on the SurroundOcc dataset achieve an mIoU of 24.10%, reducing GPU memory to 6.1 GB, demonstrating a 1.97% mIoU improvement and 13.7% memory reduction compared to GaussianWorld

[57] DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning

Yifeng Gao,Yifan Ding,Hongyu Su,Juncheng Li,Yunhan Zhao,Lin Luo,Zixing Chen,Li Wang,Xin Wang,Yixu Wang,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: As AI-generated video becomes pervasive, distinguishing synthetic content from authentic footage has become urgent. Existing approaches offer limited insight into where or why a model identifies a video as AI-generated. To address this gap, we introduce DAVID-X, the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales. Leveraging these rich annotations, we present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning-including defect categorization, temporal-spatial localization, and natural language explanations.

Details

Motivation: The motivation for this paper is the urgency and necessity to reliably distinguish AI-generated video content from authentic footage across media platforms. The authors aim to go beyond binary classification by providing fine-grained, persuasive evidence that can convince auditors and end-users alike. Method: The method involves introducing DAVID-X, a dataset pairing AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales. They also present DAVID-XR1, a video-language model that provides an interpretable chain of visual reasoning including defect categorization, temporal-spatial localization, and natural language explanations. Result: A general-purpose backbone, fine-tuned on their compact dataset and enhanced with chain-of-thought distillation, achieves strong generalization across a variety of generators and generation modes. Conclusion: The results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content. Abstract: As AI-generated video becomes increasingly pervasive across media platforms, the ability to reliably distinguish synthetic content from authentic footage has become both urgent and essential. Existing approaches have primarily treated this challenge as a binary classification task, offering limited insight into where or why a model identifies a video as AI-generated. However, the core challenge extends beyond simply detecting subtle artifacts; it requires providing fine-grained, persuasive evidence that can convince auditors and end-users alike. To address this critical gap, we introduce DAVID-X, the first dataset to pair AI-generated videos with detailed defect-level, temporal-spatial annotations and written rationales. Leveraging these rich annotations, we present DAVID-XR1, a video-language model designed to deliver an interpretable chain of visual reasoning-including defect categorization, temporal-spatial localization, and natural language explanations. This approach fundamentally transforms AI-generated video detection from an opaque black-box decision into a transparent and verifiable diagnostic process. We demonstrate that a general-purpose backbone, fine-tuned on our compact dataset and enhanced with chain-of-thought distillation, achieves strong generalization across a variety of generators and generation modes. Our results highlight the promise of explainable detection methods for trustworthy identification of AI-generated video content.

[58] Recent Advances in Multi-Agent Human Trajectory Prediction: A Comprehensive Review

Céline Finet,Stephane Da Silva Martins,Jean-Bernard Hayet,Ioannis Karamouzas,Javad Amirian,Sylvie Le Hégarat-Mascle,Julien Pettré,Emanuel Aldea

Main category: cs.CV

TL;DR: This paper reviews deep learning-based multi-agent trajectory prediction methods from 2020 to 2024, categorizes them by architectural design, input representations and prediction strategies, with a focus on models evaluated using the ETH/UCY benchmark.

Details

Motivation: To gain a finer understanding of multi-agent interactions through reviewing recent advancements in deep learning-based human trajectory prediction. Method: Categorizing existing methods based on their architectural design, input representations and overall prediction strategies, particularly emphasizing models evaluated using the ETH/UCY benchmark. Result: Highlighted key challenges and future research directions in the field of multi-agent human trajectory prediction. Conclusion: Recent advancements in deep learning have brought important implications for autonomous navigation and crowd modeling. Abstract: With the emergence of powerful data-driven methods in human trajectory prediction (HTP), gaining a finer understanding of multi-agent interactions lies within hand's reach, with important implications in areas such as autonomous navigation and crowd modeling. This survey reviews some of the most recent advancements in deep learning-based multi-agent trajectory prediction, focusing on studies published between 2020 and 2024. We categorize the existing methods based on their architectural design, their input representations, and their overall prediction strategies, placing a particular emphasis on models evaluated using the ETH/UCY benchmark. Furthermore, we highlight key challenges and future research directions in the field of multi-agent HTP.

[59] ArchShapeNet:An Interpretable 3D-CNN Framework for Evaluating Architectural Shapes

Jun Yin,Jing Zhong,Pengyu Zeng,Peilin Li,Zixuan Dai,Miao Zhang,Shuai Lu

Main category: cs.CV

TL;DR: In contemporary architectural design, the growing complexity and diversity of design demands have made generative plugin tools essential for quickly producing initial concepts and exploring novel 3D forms. To address the challenge of objectively analyzing the differences between human-designed and machine-generated 3D forms, we built ArchForms-4000 dataset and proposed ArchShapeNet model. This study highlights the advantages of human-designed forms and provides insights for enhancing generative design tools in the future.

Details

Motivation: The motivation is to address the challenge of objectively analyzing the differences between human-designed and machine-generated 3D forms in contemporary architectural design. Method: Built ArchForms-4000, a dataset containing 2,000 architect-designed and 2,000 Evomass-generated 3D forms; Proposed ArchShapeNet, a 3D convolutional neural network tailored for classifying and analyzing architectural forms. Result: Conducted comparative experiments showing our model outperforms human experts in distinguishing form origins, achieving 94.29% accuracy, 96.2% precision, and 98.51% recall. Conclusion: This study not only highlights the distinctive advantages of human-designed forms in spatial organization, proportional harmony, and detail refinement but also provides valuable insights for enhancing generative design tools in the future. Abstract: In contemporary architectural design, the growing complexity and diversity of design demands have made generative plugin tools essential for quickly producing initial concepts and exploring novel 3D forms. However, objectively analyzing the differences between human-designed and machine-generated 3D forms remains a challenge, limiting our understanding of their respective strengths and hindering the advancement of generative tools. To address this, we built ArchForms-4000, a dataset containing 2,000 architect-designed and 2,000 Evomass-generated 3D forms; Proposed ArchShapeNet, a 3D convolutional neural network tailored for classifying and analyzing architectural forms, incorporating a saliency module to highlight key spatial features aligned with architectural reasoning; And conducted comparative experiments showing our model outperforms human experts in distinguishing form origins, achieving 94.29% accuracy, 96.2% precision, and 98.51% recall. This study not only highlights the distinctive advantages of human-designed forms in spatial organization, proportional harmony, and detail refinement but also provides valuable insights for enhancing generative design tools in the future.

[60] Real-Time, Low-Latency Surveillance Using Entropy-Based Adaptive Buffering and MobileNetV2 on Edge Devices

Poojashree Chandrashekar Pankaj M Sajjanar

Main category: cs.CV

TL;DR: This paper presents a high-performance, low-latency video surveillance system designed for resource-constrained environments. It integrates an entropy-based adaptive frame buffering algorithm with MobileNetV2 to achieve high throughput and low latency.

Details

Motivation: To design a video surveillance system that is capable of processing live streams with low latency and high accuracy on resource-constrained devices such as Raspberry Pi, Amazon, and NVIDIA Jetson Nano. Method: Propose an entropy-based adaptive frame buffering algorithm and integrate it with MobileNetV2 to process live video streams with sub-50ms end-to-end inference latency while maintaining over 92% detection accuracy. Result: The system achieves sub-50ms end-to-end inference latency on resource-constrained devices with over 92% detection accuracy on standard datasets focused on video surveillance. It also exhibits robustness to varying lighting, backgrounds, and speeds. Conclusion: The architecture is scalable, inexpensive, and compliant with stricter data privacy regulations than common surveillance systems, making it suitable for smart city or embedded security architectures. Abstract: This paper describes a high-performance, low-latency video surveillance system designed for resource-constrained environments. We have proposed a formal entropy-based adaptive frame buffering algorithm and integrated that with MobileNetV2 to achieve high throughput with low latency. The system is capable of processing live streams of video with sub-50ms end-to-end inference latency on resource-constrained devices (embedding platforms) such as Raspberry Pi, Amazon, and NVIDIA Jetson Nano. Our method maintains over 92% detection accuracy on standard datasets focused on video surveillance and exhibits robustness to varying lighting, backgrounds, and speeds. A number of comparative and ablation experiments validate the effectiveness of our design. Finally, our architecture is scalable, inexpensive, and compliant with stricter data privacy regulations than common surveillance systems, so that the system could coexist in a smart city or embedded security architecture.

[61] MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation

Kiet Dang Vu,Trung Thai Tran,Duc Dung Nguyen

Main category: cs.CV

TL;DR: MonoVQD，一个改进的DETR架构用于单目3D检测，通过引入掩码分离自注意力机制、变分查询去噪技术和自蒸馏策略，解决了传统方法中的优化、梯度消失和迭代改进问题，在KITTI和nuScenes数据集上展现了优越性能和广泛适用性。

Details

Motivation: 尽管DETR-like架构在单目3D检测中具有潜力，但其直接应用存在优化不稳定、梯度消失及迭代改进不足等局限性，限制了性能提升。 Method: 提出MonoVQD框架，包括：1) 掩码分离自注意力机制以稳定匈牙利匹配；2) 变分查询去噪技术解决梯度消失问题；3) 自蒸馏策略利用解码器层间信息协同提升查询质量。 Result: 在KITTI单目基准测试中表现优异，并且核心组件可无缝集成到其他架构中，在nuScenes多视角3D检测场景下也显著提升了性能，证明了其广泛的适用性和强大的泛化能力。 Conclusion: MonoVQD通过创新性的设计改进了DETR-based单目3D检测，有效解决了关键问题并提升了性能，具有广泛的适用性和良好的泛化能力。 Abstract: Precisely localizing 3D objects from a single image constitutes a central challenge in monocular 3D detection. While DETR-like architectures offer a powerful paradigm, their direct application in this domain encounters inherent limitations, preventing optimal performance. Our work addresses these challenges by introducing MonoVQD, a novel framework designed to fundamentally advance DETR-based monocular 3D detection. We propose three main contributions. First, we propose the Mask Separated Self-Attention mechanism that enables the integration of the denoising process into a DETR architecture. This improves the stability of Hungarian matching to achieve a consistent optimization objective. Second, we present the Variational Query Denoising technique to address the gradient vanishing problem of conventional denoising methods, which severely restricts the efficiency of the denoising process. This explicitly introduces stochastic properties to mitigate this fundamental limitation and unlock substantial performance gains. Finally, we introduce a sophisticated self-distillation strategy, leveraging insights from later decoder layers to synergistically improve query quality in earlier layers, thereby amplifying the iterative refinement process. Rigorous experimentation demonstrates that MonoVQD achieves superior performance on the challenging KITTI monocular benchmark. Highlighting its broad applicability, MonoVQD's core components seamlessly integrate into other architectures, delivering significant performance gains even in multi-view 3D detection scenarios on the nuScenes dataset and underscoring its robust generalization capabilities.

Chengzhi Xu,Yuyang Wang,Lai Wei,Lichao Sun,Weiran Huang

Main category: cs.CV

TL;DR: The paper proposes ChartIR, an iterative refinement method based on structured instructions to improve MLLMs' performance in chart-to-code generation. It distinguishes visual understanding and code translation tasks, designs specific instructions for each, and decomposes the pipeline into initial code generation and iterative refinement stages.

Details

Motivation: Multimodal large language models (MLLMs) have achieved impressive results on various vision tasks, but their performance on chart-to-code generation remains suboptimal. This motivates the need for a more effective approach to this complex task. Method: The proposed method, ChartIR, consists of two main components: visual understanding and code translation. For visual understanding, two types of structured instructions are designed - description and difference instructions. The overall chart generation pipeline is then decomposed into two stages - initial code generation and iterative refinement. Result: Experimental results demonstrate that the proposed method outperforms other methods on both the open-source model Qwen2-VL and the closed-source model GPT-4o. Conclusion: ChartIR, with its structured instructions and iterative refinement approach, effectively improves the performance of MLLMs in chart-to-code generation. Abstract: Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction. First, we distinguish two tasks: visual understanding and code translation. To accomplish the visual understanding component, we design two types of structured instructions: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations, thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement, enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.

[63] PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

Lukas Schiesser,Cornelius Wolff,Sophie Haas,Simon Pukrop

Main category: cs.CV

TL;DR: 在数据稀缺领域构建图像分类模型仍然很困难，因为收集大型标注数据集不切实际。本文提出了PictSure框架，该框架专注于嵌入模型（架构、预训练和训练动态）对少样本图像分类的影响。研究表明，嵌入模型的预训练方式显著影响训练成功和域外性能。PictSure在域外基准测试中优于现有的基于ICL的少样本图像分类模型，同时在域内任务上保持可比结果。

Details

Motivation: 在数据稀缺领域构建图像分类模型存在挑战，而现有研究忽略了图像嵌入在基于上下文学习（ICL）的少样本图像分类中的作用。 Method: 提出PictSure框架，系统地研究不同的视觉编码器类型、预训练目标和微调策略对下游少样本图像分类性能的影响。 Result: 实验证明，嵌入模型的预训练方式对训练成功和域外性能有重大影响，PictSure在域外基准测试中表现更优。 Conclusion: PictSure通过优化嵌入模型的预训练，在域外少样本图像分类任务中表现出色，同时在域内任务上保持竞争力。 Abstract: Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.

[64] Finding Optimal Kernel Size and Dimension in Convolutional Neural Networks An Architecture Optimization Approach

Shreyas Rajeev,B Sathish Babu

Main category: cs.CV

TL;DR: Kernel size selection in CNNs significantly impacts model performance. This paper introduces BKSEF, a framework for optimal kernel size determination per layer, balancing information gain, efficiency, and accuracy. Experiments show improvements in accuracy and reductions in computational cost across various datasets and applications.

Details

Motivation: Current CNN designs often use uniform 3x3 kernels without considering the potential benefits of varying kernel sizes. There is a need for a systematic approach to select kernel sizes that balance computational cost, feature extraction, and model accuracy. Method: The paper proposes BKSEF, which integrates principles from information theory, signal processing, and learning theory to determine optimal kernel sizes layer-wise. This involves evaluating information gain, computational efficiency, and accuracy improvements. Result: Experiments on multiple datasets demonstrate up to 3.1% accuracy improvement and 42.8% reduction in FLOPs compared to traditional models. Real-world case studies in medical image classification and traffic sign recognition further validate the effectiveness of BKSEF. Conclusion: BKSEF provides a mathematically grounded and empirically validated approach to optimize kernel sizes in CNNs, offering practical heuristics for efficient and application-aware designs. Abstract: Kernel size selection in Convolutional Neural Networks (CNNs) is a critical but often overlooked design decision that affects receptive field, feature extraction, computational cost, and model accuracy. This paper proposes the Best Kernel Size Estimation Function (BKSEF), a mathematically grounded and empirically validated framework for optimal, layer-wise kernel size determination. BKSEF balances information gain, computational efficiency, and accuracy improvements by integrating principles from information theory, signal processing, and learning theory. Extensive experiments on CIFAR-10, CIFAR-100, ImageNet-lite, ChestX-ray14, and GTSRB datasets demonstrate that BKSEF-guided architectures achieve up to 3.1 percent accuracy improvement and 42.8 percent reduction in FLOPs compared to traditional models using uniform 3x3 kernels. Two real-world case studies further validate the approach: one for medical image classification in a cloud-based setup, and another for traffic sign recognition on edge devices. The former achieved enhanced interpretability and accuracy, while the latter reduced latency and model size significantly, with minimal accuracy trade-off. These results show that kernel size can be an active, optimizable parameter rather than a fixed heuristic. BKSEF provides practical heuristics and theoretical support for researchers and developers seeking efficient and application-aware CNN designs. It is suitable for integration into neural architecture search pipelines and real-time systems, offering a new perspective on CNN optimization.

[65] Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis

Varun Mannam,Zhenyu Shi

Main category: cs.CV

TL;DR: 准确的视频标注在现代零售应用中至关重要，包括顾客行为分析、产品互动检测和店内活动识别。然而，传统方法依赖于耗时的人工标注，导致非鲁棒的关键帧选择和运营成本增加。本文提出了一种基于深度学习的方法，可以自动识别零售视频中的关键帧，并提供产品和客户的自动标注。通过实验验证，该方法比传统方法更优越，具有与人工标注相当的准确性，同时提高了零售视频标注的整体效率。显著地，该方法平均节省了2倍的视频标注成本。

Details

Motivation: 传统视频标注方法过于依赖人工，存在效率低、成本高和不稳定性的问题，特别是在零售领域。需要一种自动化的方法来提高效率并降低成本。 Method: 提出了一种基于深度学习的方法，利用深度神经网络学习区分性特征，通过嵌入视频帧并结合针对零售环境定制的目标检测技术，实现对零售视频的关键帧自动识别和标注。 Result: 实验结果表明，该方法的性能优于传统方法，达到了与人工标注相当的准确性，同时大幅提升了标注效率，并实现了平均2倍的成本节约。 Conclusion: 提出的深度学习方法能够有效自动化零售视频的关键帧检测和标注过程，显著减少时间和精力的投入，为零售应用（如购物者旅程分析、产品互动检测和店内安全监控）提供了重要价值。 Abstract: Accurate video annotation plays a vital role in modern retail applications, including customer behavior analysis, product interaction detection, and in-store activity recognition. However, conventional annotation methods heavily rely on time-consuming manual labeling by human annotators, introducing non-robust frame selection and increasing operational costs. To address these challenges in the retail domain, we propose a deep learning-based approach that automates key-frame identification in retail videos and provides automatic annotations of products and customers. Our method leverages deep neural networks to learn discriminative features by embedding video frames and incorporating object detection-based techniques tailored for retail environments. Experimental results showcase the superiority of our approach over traditional methods, achieving accuracy comparable to human annotator labeling while enhancing the overall efficiency of retail video annotation. Remarkably, our approach leads to an average of 2 times cost savings in video annotation. By allowing human annotators to verify/adjust less than 5% of detected frames in the video dataset, while automating the annotation process for the remaining frames without reducing annotation quality, retailers can significantly reduce operational costs. The automation of key-frame detection enables substantial time and effort savings in retail video labeling tasks, proving highly valuable for diverse retail applications such as shopper journey analysis, product interaction detection, and in-store security monitoring.

[66] Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction

Zhengquan Zhang,Feng Xu,Mengmi Zhang

Main category: cs.CV

TL;DR: The paper presents a novel active view selection (AVS) approach for 3D reconstruction using neural uncertainty maps predicted by UPNet, achieving comparable accuracy with fewer viewpoints and significantly reduced computational resources.

Details

Motivation: To address the challenge of identifying the minimal set of views that yields the most accurate 3D reconstruction, improving efficiency and reducing computational overhead in AVS methods. Method: Introduced UPNet, a lightweight feedforward deep neural network that predicts uncertainty maps from single input images. The approach aggregates these maps to suppress redundant viewpoints and selects the most informative ones for training 3D neural rendering models. Result: Achieved comparable reconstruction accuracy using half the viewpoints of the upper bound, with up to 400 times speedup and over 50% reductions in CPU, RAM, and GPU usage compared to baseline methods. Also generalized effectively to novel object categories without additional training. Conclusion: The proposed AVS method using UPNet successfully reduces computational requirements while maintaining high-quality 3D reconstructions, demonstrating strong generalization capabilities. Abstract: Some perspectives naturally provide more information than others. How can an AI system determine which viewpoint offers the most valuable insight for accurate and efficient 3D object reconstruction? Active view selection (AVS) for 3D reconstruction remains a fundamental challenge in computer vision. The aim is to identify the minimal set of views that yields the most accurate 3D reconstruction. Instead of learning radiance fields, like NeRF or 3D Gaussian Splatting, from a current observation and computing uncertainty for each candidate viewpoint, we introduce a novel AVS approach guided by neural uncertainty maps predicted by a lightweight feedforward deep neural network, named UPNet. UPNet takes a single input image of a 3D object and outputs a predicted uncertainty map, representing uncertainty values across all possible candidate viewpoints. By leveraging heuristics derived from observing many natural objects and their associated uncertainty patterns, we train UPNet to learn a direct mapping from viewpoint appearance to uncertainty in the underlying volumetric representations. Next, our approach aggregates all previously predicted neural uncertainty maps to suppress redundant candidate viewpoints and effectively select the most informative one. Using these selected viewpoints, we train 3D neural rendering models and evaluate the quality of novel view synthesis against other competitive AVS methods. Remarkably, despite using half of the viewpoints than the upper bound, our method achieves comparable reconstruction accuracy. In addition, it significantly reduces computational overhead during AVS, achieving up to a 400 times speedup along with over 50\% reductions in CPU, RAM, and GPU usage compared to baseline methods. Notably, our approach generalizes effectively to AVS tasks involving novel object categories, without requiring any additional training.

[67] DETONATE: A Benchmark for Text-to-Image Alignment and Kernelized Direct Preference Optimization

Renjith Prasad,Abhilekh Borah,Hasnat Md Abdullah,Chathurangi Shyalika,Gurpreet Singh,Ritvik Garimella,Rajarshi Roy,Harshul Surana,Nasrin Imanpour,Suranjana Trivedy,Amit Sheth,Amitava Das

Main category: cs.CV

TL;DR: This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment through Hybrid Loss, Kernelized Representations, and Divergence Selection. It also presents DETONATE, a large-scale benchmark assessing social bias in T2I systems, and AQI, a geometric measure quantifying alignment quality.

Details

Motivation: To improve the alignment of text-to-image models ensuring generated images capture user intent while maintaining safety and fairness. Method: DPO-Kernels enhance alignment across three dimensions: Hybrid Loss integrating embedding-based objectives with probability-based loss; Kernelized Representations using RBF, Polynomial, and Wavelet kernels for feature transformations; and Divergence Selection incorporating Wasserstein and R'enyi divergences. Additionally, DETONATE is introduced as a benchmark with 100K image pairs and AQI as a geometric measure. Result: Empirical results show that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). DETONATE and complete code are publicly released. Conclusion: DPO-Kernels significantly improve alignment in T2I models, providing enhanced optimization, richer feature transformations, and improved stability. The introduction of DETONATE and AQI offers valuable tools for assessing and improving alignment in T2I systems. Abstract: Alignment is crucial for text-to-image (T2I) models to ensure that generated images faithfully capture user intent while maintaining safety and fairness. Direct Preference Optimization (DPO), prominent in large language models (LLMs), is extending its influence to T2I systems. This paper introduces DPO-Kernels for T2I models, a novel extension enhancing alignment across three dimensions: (i) Hybrid Loss, integrating embedding-based objectives with traditional probability-based loss for improved optimization; (ii) Kernelized Representations, employing Radial Basis Function (RBF), Polynomial, and Wavelet kernels for richer feature transformations and better separation between safe and unsafe inputs; and (iii) Divergence Selection, expanding beyond DPO's default Kullback-Leibler (KL) regularizer by incorporating Wasserstein and R'enyi divergences for enhanced stability and robustness. We introduce DETONATE, the first large-scale benchmark of its kind, comprising approximately 100K curated image pairs categorized as chosen and rejected. DETONATE encapsulates three axes of social bias and discrimination: Race, Gender, and Disability. Prompts are sourced from hate speech datasets, with images generated by leading T2I models including Stable Diffusion 3.5 Large, Stable Diffusion XL, and Midjourney. Additionally, we propose the Alignment Quality Index (AQI), a novel geometric measure quantifying latent-space separability of safe/unsafe image activations, revealing hidden vulnerabilities. Empirically, we demonstrate that DPO-Kernels maintain strong generalization bounds via Heavy-Tailed Self-Regularization (HT-SR). DETONATE and complete code are publicly released.

[68] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

Yizhen Zhang,Yang Ding,Shuoshuo Zhang,Xinchen Zhang,Haoling Li,Zhong-zhi Li,Peijie Wang,Jie Wu,Lei Ji,Yelong Shen,Yujiu Yang,Yeyun Gong

Main category: cs.CV

TL;DR: 受DeepSeek-R1等强化学习方法的启发，本文提出了一种名为PeRL的通用强化学习方法及其多阶段策略，以解决多模态推理任务中的多图像位置推理问题。通过图像序列排列和轨迹过滤机制，PeRL在多图像基准测试中表现出色，并在单图像任务上保持了相当的性能。

Details

Motivation: 现有的多模态强化学习方法在单一图像的空间推理方面表现良好，但在涉及多图像位置推理的复杂场景中仍然存在困难，这促使研究者探索更高效的强化学习方法来解决这一问题。 Method: 提出了一种名为PeRL的强化学习方法以及一个多阶段策略：1) 使用图像序列排列模拟不同的位置关系，增加空间和位置多样性；2) 设计了一种轨迹过滤机制，专注于对学习最优行为贡献最大的轨迹，从而有效利用已学习的策略。 Result: 在5个多图像基准和3个单图像基准上的实验结果表明，PeRL模型显著超越了R1相关基线和交错的视觉语言模型基线，在多图像基准上达到了最先进的性能，同时在单图像任务上保持了相当的性能。 Conclusion: PeRL是一种有效的强化学习方法，能够显著提高多模态推理任务中涉及多图像位置推理的表现，同时不影响单图像任务的性能，为未来的多模态强化学习研究提供了新的方向。 Abstract: Inspired by the impressive reasoning capabilities demonstrated by reinforcement learning approaches like DeepSeek-R1, recent emerging research has begun exploring the use of reinforcement learning (RL) to enhance vision-language models (VLMs) for multimodal reasoning tasks. However, most existing multimodal reinforcement learning approaches remain limited to spatial reasoning within single-image contexts, yet still struggle to generalize to more complex and real-world scenarios involving multi-image positional reasoning, where understanding the relationships across images is crucial. To address this challenge, we propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks, and a multi-stage strategy designed to enhance the exploration-exploitation trade-off, thereby improving learning efficiency and task performance. Specifically, we introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity. Furthermore, we design a rollout filtering mechanism for resampling to focus on trajectories that contribute most to learning optimal behaviors to exploit learned policies effectively. We evaluate our model on 5 widely-used multi-image benchmarks and 3 single-image benchmarks. Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin, achieving state-of-the-art performance on multi-image benchmarks, while preserving comparable performance on single-image tasks.

[69] Frequency-Calibrated Membership Inference Attacks on Medical Image Diffusion Models

Xinkai Zhao,Yuta Tokuoka,Junichiro Iwasawa,Keita Oda

Main category: cs.CV

TL;DR: 提出了一种新的方法FCRE，用于在医疗图像扩散模型上进行成员推断攻击（MIA），通过聚焦中频范围的重建误差来提高隐私风险评估的准确性。

Details

Motivation: 现有的MIA方法依赖于扩散重建误差，在医疗图像上应用时存在挑战，因为重建误差受图像固有难度和高频细节重建困难的影响。 Method: 提出Frequency-Calibrated Reconstruction Error (FCRE) 方法，专注于特定中频范围的重建误差，并排除高频和低频区域的影响。分析逆扩散过程，获取中频重建误差，并计算重构图像与原始图像之间的结构相似性指数分数，以此判断成员身份。 Result: 实验表明，FCRE方法在多个医疗图像数据集上优于现有的MIA方法。 Conclusion: FCRE方法能够有效缓解图像固有难度对MIA的干扰，在医疗图像扩散模型的成员推断攻击中表现出优越性能。 Abstract: The increasing use of diffusion models for image generation, especially in sensitive areas like medical imaging, has raised significant privacy concerns. Membership Inference Attack (MIA) has emerged as a potential approach to determine if a specific image was used to train a diffusion model, thus quantifying privacy risks. Existing MIA methods often rely on diffusion reconstruction errors, where member images are expected to have lower reconstruction errors than non-member images. However, applying these methods directly to medical images faces challenges. Reconstruction error is influenced by inherent image difficulty, and diffusion models struggle with high-frequency detail reconstruction. To address these issues, we propose a Frequency-Calibrated Reconstruction Error (FCRE) method for MIAs on medical image diffusion models. By focusing on reconstruction errors within a specific mid-frequency range and excluding both high-frequency (difficult to reconstruct) and low-frequency (less informative) regions, our frequency-selective approach mitigates the confounding factor of inherent image difficulty. Specifically, we analyze the reverse diffusion process, obtain the mid-frequency reconstruction error, and compute the structural similarity index score between the reconstructed and original images. Membership is determined by comparing this score to a threshold. Experiments on several medical image datasets demonstrate that our FCRE method outperforms existing MIA methods.

[70] Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images

Md Abrar Jahin,Shahriar Soudeep,Arian Rahman Aditta,M. F. Mridha,Nafiz Fahad,Md. Jakir Hossen

Main category: cs.CV

TL;DR: The paper explores the use of Vision Transformer (ViT) architectures for distinguishing between quark- and gluon-initiated jets using calorimeter images, showing that ViT-based models outperform traditional CNNs.

Details

Motivation: Distinguishing between quark- and gluon-initiated jets is crucial in high-energy physics but remains challenging. While deep learning methods like CNNs have been used, the potential of Vision Transformers (ViTs) has not been fully explored for this task, especially under realistic conditions. Method: The authors use simulated 2012 CMS Open Data to construct multi-channel jet-view images from detector-level energy deposits and reconstructed tracks. They then evaluate ViT and ViT-CNN hybrid models for quark-gluon jet classification, comparing their performance against established CNN baselines. Result: ViT-based models, particularly ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform CNN baselines in F1-score, ROC-AUC, and accuracy, demonstrating the advantage of capturing long-range spatial correlations within jet substructure. Conclusion: This study establishes the first systematic framework for applying ViT architectures to calorimeter image-based jet classification using public collider data, providing robust performance baselines and a structured dataset for further research. Abstract: Distinguishing between quark- and gluon-initiated jets is a critical and challenging task in high-energy physics, pivotal for improving new physics searches and precision measurements at the Large Hadron Collider. While deep learning, particularly Convolutional Neural Networks (CNNs), has advanced jet tagging using image-based representations, the potential of Vision Transformer (ViT) architectures, renowned for modeling global contextual information, remains largely underexplored for direct calorimeter image analysis, especially under realistic detector and pileup conditions. This paper presents a systematic evaluation of ViTs and ViT-CNN hybrid models for quark-gluon jet classification using simulated 2012 CMS Open Data. We construct multi-channel jet-view images from detector-level energy deposits (ECAL, HCAL) and reconstructed tracks, enabling an end-to-end learning approach. Our comprehensive benchmarking demonstrates that ViT-based models, notably ViT+MaxViT and ViT+ConvNeXt hybrids, consistently outperform established CNN baselines in F1-score, ROC-AUC, and accuracy, highlighting the advantage of capturing long-range spatial correlations within jet substructure. This work establishes the first systematic framework and robust performance baselines for applying ViT architectures to calorimeter image-based jet classification using public collider data, alongside a structured dataset suitable for further deep learning research in this domain.

[71] Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors

Ziteng Li,Malte Kuhlmann,Ilana Nisky,Nicolás Navarro-Guerrero

Main category: cs.CV

TL;DR: The paper proposes two models based on LRCNs and Transformer architectures for predicting compliance metrics using RGB tactile images and other information from GelSight sensor. These models show significant performance improvement over the baseline.

Details

Motivation: Traditional compliance detection methods have limitations in terms of portability, scalability, and accuracy. Neural network-based approaches still suffer from insufficient prediction accuracy. Method: Two models are proposed based on Long-term Recurrent Convolutional Networks (LRCNs) and Transformer architectures. They leverage RGB tactile images and other information captured by the vision-based sensor GelSight to predict compliance metrics accurately. Result: The models exhibit significant performance improvement over the baseline when validated using multiple metrics. The correlation between sensor compliance and object compliance estimation shows that harder objects than the sensor are more challenging to estimate. Conclusion: The proposed models effectively estimate compliance metrics with improved accuracy, offering advancements over traditional methods and previous neural network-based approaches. Abstract: Compliance is a critical parameter for describing objects in engineering, agriculture, and biomedical applications. Traditional compliance detection methods are limited by their lack of portability and scalability, rely on specialized, often expensive equipment, and are unsuitable for robotic applications. Moreover, existing neural network-based approaches using vision-based tactile sensors still suffer from insufficient prediction accuracy. In this paper, we propose two models based on Long-term Recurrent Convolutional Networks (LRCNs) and Transformer architectures that leverage RGB tactile images and other information captured by the vision-based sensor GelSight to predict compliance metrics accurately. We validate the performance of these models using multiple metrics and demonstrate their effectiveness in accurately estimating compliance. The proposed models exhibit significant performance improvement over the baseline. Additionally, we investigated the correlation between sensor compliance and object compliance estimation, which revealed that objects that are harder than the sensor are more challenging to estimate.

[72] Hyper-Local Deformable Transformers for Text Spotting on Historical Maps

Yijun Lin,Yao-Yi Chiang

Main category: cs.CV

TL;DR: This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps with a novel hyper-local sampling module and SynthMap+, a method to generate synthetic map images. Experiments show that PALETTE outperforms current methods on historical map datasets.

Details

Motivation: Text extraction from historical maps is challenging due to the lack of effective methods and training data. Existing approaches are tailored to specific map styles and machine learning-based methods face difficulties in extracting precise image features for predicting sub-components in a text instance. Method: PALETTE introduces a hyper-local sampling module to learn localized image features around target boundary points and characters for detection and recognition, as well as hyper-local positional embeddings to learn spatial interactions between boundary points and characters. Additionally, SynthMap+ is presented as a novel approach to automatically generate synthetic map images for training text spotters. Result: PALETTE with SynthMap+ outperforms state-of-the-art text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. It has been deployed to process over 60,000 maps and generated over 100 million text labels. Conclusion: PALETTE is an effective solution for text spotting on historical maps, overcoming challenges posed by lengthy, rotated text and complex backgrounds. Abstract: Text on historical maps contains valuable information providing georeferenced historical, political, and cultural contexts. However, text extraction from historical maps is challenging due to the lack of (1) effective methods and (2) training data. Previous approaches use ad-hoc steps tailored to only specific map styles. Recent machine learning-based text spotters (e.g., for scene images) have the potential to solve these challenges because of their flexibility in supporting various types of text instances. However, these methods remain challenges in extracting precise image features for predicting every sub-component (boundary points and characters) in a text instance. This is critical because map text can be lengthy and highly rotated with complex backgrounds, posing difficulties in detecting relevant image features from a rough text region. This paper proposes PALETTE, an end-to-end text spotter for scanned historical maps of a wide variety. PALETTE introduces a novel hyper-local sampling module to explicitly learn localized image features around the target boundary points and characters of a text instance for detection and recognition. PALETTE also enables hyper-local positional embeddings to learn spatial interactions between boundary points and characters within and across text instances. In addition, this paper presents a novel approach to automatically generate synthetic map images, SynthMap+, for training text spotters for historical maps. The experiment shows that PALETTE with SynthMap+ outperforms SOTA text spotters on two new benchmark datasets of historical maps, particularly for long and angled text. We have deployed PALETTE with SynthMap+ to process over 60,000 maps in the David Rumsey Historical Map collection and generated over 100 million text labels to support map searching. The project is released at https://github.com/kartta-foundation/mapkurator-palette-doc.

[73] Break Stylistic Sophon: Are We Really Meant to Confine the Imagination in Style Transfer?

Gary Song Yan,Yusen Zhang,Jinyu Zhao,Hao Zhang,Zhangping Yang,Guanye Xiong,Yanfei Liu,Tao Zhang,Yujie He,Siyuan Tian,Yao Gou,Min Li

Main category: cs.CV

TL;DR: This paper presents StyleWallfacer, a novel framework for style transfer and text-driven stylization that uses semantic-based style injection, human feedback data augmentation, and a triple diffusion process to achieve high-quality results.

Details

Motivation: To address issues in traditional style transfer methods and unify the framework for different tasks, enabling artist level style transfer and text driven stylization. Method: 1) Semantic-based style injection method using BLIP and CLIP space; 2) Data augmentation strategy based on human feedback; 3) Training-free triple diffusion process manipulating self-attention layers. Result: Achieved high-quality image-driven style transfer and text-driven stylization with artist-level results while preserving original image content. Also achieved image color editing during style transfer for the first time. Conclusion: StyleWallfacer revolutionizes the field of style transfer by providing a unified framework that addresses various issues and delivers superior results. Abstract: In this pioneering study, we introduce StyleWallfacer, a groundbreaking unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to revolutionize the field by enabling artist level style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering artist-level style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time.

[74] Enhancing Vector Quantization with Distributional Matching: A Theoretical and Empirical Study

Xianghong Fang,Litao Guo,Hengchao Chen,Yuxuan Zhang,XiaofanXia,Dingjie Song,Yexin Liu,Hao Wang,Harry Yang,Yuan Yuan,Qiang Sun

Main category: cs.CV

TL;DR: The paper addresses training instability and codebook collapse in vector quantization by using Wasserstein distance to align feature and code vector distributions, achieving high codebook utilization and reduced quantization error.

Details

Motivation: Existing vector quantization methods suffer from training instability and codebook collapse due to the mismatch between the distributions of features and code vectors, leading to unrepresentative code vectors and significant data information loss. Method: The authors employ the Wasserstein distance to align the distributions of features and code vectors in order to resolve the issues of training instability and codebook collapse. Result: The approach achieves near 100% codebook utilization and significantly reduces the quantization error. Conclusion: Empirical and theoretical analyses confirm the effectiveness of the proposed method in addressing the limitations of current vector quantization techniques. Abstract: The success of autoregressive models largely depends on the effectiveness of vector quantization, a technique that discretizes continuous features by mapping them to the nearest code vectors within a learnable codebook. Two critical issues in existing vector quantization methods are training instability and codebook collapse. Training instability arises from the gradient discrepancy introduced by the straight-through estimator, especially in the presence of significant quantization errors, while codebook collapse occurs when only a small subset of code vectors are utilized during training. A closer examination of these issues reveals that they are primarily driven by a mismatch between the distributions of the features and code vectors, leading to unrepresentative code vectors and significant data information loss during compression. To address this, we employ the Wasserstein distance to align these two distributions, achieving near 100\% codebook utilization and significantly reducing the quantization error. Both empirical and theoretical analyses validate the effectiveness of the proposed approach.

[75] SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts

Yufei Liu,Haoke Xiao,Jiaxing Chai,Yongcun Zhang,Rong Wang,Zijie Meng,Zhiming Luo

Main category: cs.CV

TL;DR: The paper introduces SynPo, a training-free few-shot method based on LVMs (e.g., SAM) that improves the quality of negative prompts for better performance on low-contrast medical images.

Details

Motivation: Existing training-free methods based on Large Vision Models (LVMs) fail to effectively utilize negative prompts, resulting in poor performance on low-contrast medical images. Method: Propose SynPo which includes designing a Confidence Map Synergy Module combining DINOv2 and SAM strengths, selecting top-k pixels as positive points set, choosing negative points set using Gaussian distribution, and performing independent K-means clustering for both sets. These selected points are used as high-quality prompts for SAM to obtain segmentation results. Result: Extensive experiments show that SynPo achieves performance comparable to state-of-the-art training-based few-shot methods. Conclusion: SynPo is a promising training-free few-shot method based on LVMs that significantly enhances the utilization of negative prompts, leading to improved performance on low-contrast medical image segmentation. Abstract: The advent of Large Vision Models (LVMs) offers new opportunities for few-shot medical image segmentation. However, existing training-free methods based on LVMs fail to effectively utilize negative prompts, leading to poor performance on low-contrast medical images. To address this issue, we propose SynPo, a training-free few-shot method based on LVMs (e.g., SAM), with the core insight: improving the quality of negative prompts. To select point prompts in a more reliable confidence map, we design a novel Confidence Map Synergy Module by combining the strengths of DINOv2 and SAM. Based on the confidence map, we select the top-k pixels as the positive points set and choose the negative points set using a Gaussian distribution, followed by independent K-means clustering for both sets. Then, these selected points are leveraged as high-quality prompts for SAM to get the segmentation results. Extensive experiments demonstrate that SynPo achieves performance comparable to state-of-the-art training-based few-shot methods.

[76] Enhancing point cloud analysis via neighbor aggregation correction based on cross-stage structure correlation

Jiaqi Shi,Jin Xiao,Xiaoguang Hu,Boyang Song,Hao Jiang,Tianyou Chen,Baochang Zhang

Main category: cs.CV

TL;DR: 提出了一种新的点云分析模块PDSA，通过高维空间中的相关性校正特征分布，提高计算效率和鲁棒性，在语义分割和分类任务中表现出显著的性能提升和泛化能力。

Details

Motivation: 现有的点云分析方法在聚合局部结构时存在无关点干扰和特征层次差距的问题，而改进这些限制的方法通常计算开销大且对噪声敏感。 Method: 提出了Point Distribution Set Abstraction (PDSA) 模块，利用高维空间中的相关性校正特征分布。通过轻量级的跨阶段结构描述符区分点相关性，减少邻居特征矩阵的方差并增加类别可分性。引入关键点机制优化计算开销。 Result: 在语义分割和分类任务中，基于不同基线的实验结果验证了所提方法的泛化能力，并实现了参数成本更低的显著性能提升。消融研究和可视化结果进一步证明了方法的有效性和合理性。 Conclusion: PDSA模块提高了点云分析的计算效率和鲁棒性，具有良好的泛化能力和性能提升效果。 Abstract: Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three-dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limitation by refining spatial description though explicit modeling of cross-stage structure, these enhancement methods based on direct geometric structure encoding have problems of high computational overhead and noise sensitivity. To overcome these problems, we propose the Point Distribution Set Abstraction module (PDSA) that utilizes the correlation in the high-dimensional space to correct the feature distribution during aggregation, which improves the computational efficiency and robustness. PDSA distinguishes the point correlation based on a lightweight cross-stage structural descriptor, and enhances structural homogeneity by reducing the variance of the neighbor feature matrix and increasing classes separability though long-distance modeling. Additionally, we introducing a key point mechanism to optimize the computational overhead. The experimental result on semantic segmentation and classification tasks based on different baselines verify the generalization of the method we proposed, and achieve significant performance improvement with less parameter cost. The corresponding ablation and visualization results demonstrate the effectiveness and rationality of our method. The code and training weight is available at: https://github.com/AGENT9717/PointDistribution

[77] Echo-DND: A dual noise diffusion model for robust and precise left ventricle segmentation in echocardiography

Abdur Rahman,Keerthiveena Balraj,Manojkumar Ramteke,Anurag Singh Rathore

Main category: cs.CV

TL;DR: Recent advancements in diffusion probabilistic models (DPMs) have significantly impacted image processing, especially in medical applications. This paper introduces Echo-DND, a dual-noise diffusion model designed for accurate segmentation of the left ventricle (LV) in echocardiograms. The model combines Gaussian and Bernoulli noises, uses a multi-scale fusion conditioning module, and applies spatial coherence calibration to enhance segmentation precision. Evaluated on CAMUS and EchoNet-Dynamic datasets, Echo-DND outperforms existing models with high Dice scores, setting a new standard in echocardiogram segmentation.

Details

Motivation: Accurate segmentation of the left ventricle in echocardiograms is critical for diagnosis and treatment. However, ultrasound images are often noisy, have low contrast, and unclear LV boundaries, making the segmentation process challenging. Thus, there is a need for a more precise and effective model for this task. Method: The proposed method, Echo-DND, is a dual-noise diffusion model that incorporates Gaussian and Bernoulli noises. It also features a multi-scale fusion conditioning module to improve segmentation accuracy and uses spatial coherence calibration to ensure spatial integrity in segmentation masks. Result: Echo-DND was rigorously tested on the CAMUS and EchoNet-Dynamic datasets. The results showed that it outperforms current state-of-the-art models, achieving Dice scores of 0.962 and 0.939 respectively. Conclusion: Echo-DND establishes a new benchmark in echocardiogram segmentation, demonstrating superior performance compared to existing models. Its architecture has potential for broader use in other medical imaging tasks, enhancing diagnostic accuracy across various medical fields. Abstract: Recent advancements in diffusion probabilistic models (DPMs) have revolutionized image processing, demonstrating significant potential in medical applications. Accurate segmentation of the left ventricle (LV) in echocardiograms is crucial for diagnostic procedures and necessary treatments. However, ultrasound images are notoriously noisy with low contrast and ambiguous LV boundaries, thereby complicating the segmentation process. To address these challenges, this paper introduces Echo-DND, a novel dual-noise diffusion model specifically designed for this task. Echo-DND leverages a unique combination of Gaussian and Bernoulli noises. It also incorporates a multi-scale fusion conditioning module to improve segmentation precision. Furthermore, it utilizes spatial coherence calibration to maintain spatial integrity in segmentation masks. The model's performance was rigorously validated on the CAMUS and EchoNet-Dynamic datasets. Extensive evaluations demonstrate that the proposed framework outperforms existing SOTA models. It achieves high Dice scores of 0.962 and 0.939 on these datasets, respectively. The proposed Echo-DND model establishes a new standard in echocardiogram segmentation, and its architecture holds promise for broader applicability in other medical imaging tasks, potentially improving diagnostic accuracy across various medical domains. Project page: https://abdur75648.github.io/Echo-DND

[78] ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections

Ziling Huang,Yidan Zhang,Shin'ichi Satoh

Main category: cs.CV

TL;DR: ReSeDis is introduced as the first task that unifies corpus-level retrieval with pixel-level grounding, offering a realistic testbed for building robust multimodal search systems.

Details

Motivation: Current techniques only address one side of the challenge faced by large-scale visual search engines. Visual grounding yields tight boxes and masks but produces false alarms when applied to web-scale collections, while text-to-image retrieval excels at ranking relevant images but offers no fine-grained localization. Method: The paper introduces Referring Search and Discovery (ReSeDis), which must decide whether the queried object appears in each image and where it is, returning bounding boxes or segmentation masks. A benchmark is curated with unique mappings of descriptions to object instances, and a task-specific metric is designed to score retrieval recall and localization precision. Result: A zero-shot baseline using a frozen vision-language model reveals significant headroom for future study. Conclusion: ReSeDis provides a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems. Abstract: Large-scale visual search engines are expected to solve a dual problem at once: (i) locate every image that truly contains the object described by a sentence and (ii) identify the object's bounding box or exact pixels within each hit. Existing techniques address only one side of this challenge. Visual grounding yields tight boxes and masks but rests on the unrealistic assumption that the object is present in every test image, producing a flood of false alarms when applied to web-scale collections. Text-to-image retrieval excels at sifting through massive databases to rank relevant images, yet it stops at whole-image matches and offers no fine-grained localization. We introduce Referring Search and Discovery (ReSeDis), the first task that unifies corpus-level retrieval with pixel-level grounding. Given a free-form description, a ReSeDis model must decide whether the queried object appears in each image and, if so, where it is, returning bounding boxes or segmentation masks. To enable rigorous study, we curate a benchmark in which every description maps uniquely to object instances scattered across a large, diverse corpus, eliminating unintended matches. We further design a task-specific metric that jointly scores retrieval recall and localization precision. Finally, we provide a straightforward zero-shot baseline using a frozen vision-language model, revealing significant headroom for future study. ReSeDis offers a realistic, end-to-end testbed for building the next generation of robust and scalable multimodal search systems.

[79] Conquering the Retina: Bringing Visual in-Context Learning to OCT

Alessio Negrini,Simon Reiß

Main category: cs.CV

TL;DR: This paper explores training generalist models for retinal OCT using visual in-context learning (VICL), proposes an evaluation protocol, evaluates a state-of-the-art approach on multiple datasets, and releases code openly.

Details

Motivation: The motivation is to overcome the limitation of specialized models that are only applicable to predefined tasks by developing generalist models which can adapt to various tasks without task-specific development. Method: Training generalist models for retinal optical coherence tomography using visual in-context learning (VICL) with few examples provided at inference time, and proposing a broad evaluation protocol tailored to VICL in OCT. Result: Established a first baseline for in-context learning in OCT, highlighting its potential and current limitations. Conclusion: Generalist models trained with VICL show promise for retinal OCT analysis, but have limitations that need further research. The open release of code aims to promote further research and practical adoption. Abstract: Recent advancements in medical image analysis have led to the development of highly specialized models tailored to specific clinical tasks. These models have demonstrated exceptional performance and remain a crucial research direction. Yet, their applicability is limited to predefined tasks, requiring expertise and extensive resources for development and adaptation. In contrast, generalist models offer a different form of utility: allowing medical practitioners to define tasks on the fly without the need for task-specific model development. In this work, we explore how to train generalist models for the domain of retinal optical coherence tomography using visual in-context learning (VICL), i.e., training models to generalize across tasks based on a few examples provided at inference time. To facilitate rigorous assessment, we propose a broad evaluation protocol tailored to VICL in OCT. We extensively evaluate a state-of-the-art medical VICL approach on multiple retinal OCT datasets, establishing a first baseline to highlight the potential and current limitations of in-context learning for OCT. To foster further research and practical adoption, we openly release our code.

[80] Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models

Xuelin Shen,Jiayin Xu,Kangsheng Yin,Wenhan Yang

Main category: cs.CV

TL;DR: This paper proposes Privacy-Shielded Image Compression (PSIC) with a Conditional Latent Trigger Generation (CLTG) module and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function to protect images from VLP models while maintaining compression functionality and perceptual quality.

Details

Motivation: The motivation is to protect users' privacy by preventing publicly posted images from being exploited by vision-language pretrained (VLP) models through implementing defenses at the image compression stage. Method: The method involves proposing PSIC, which produces bitstreams with multiple decoding options. It includes CLTG for generating bias information and UAEO for optimizing encryption based on VLP model uncertainty. An adaptive multi-objective optimization strategy is also incorporated. Result: Extensive experiments across multiple downstream tasks have shown the effectiveness of the proposed scheme in protecting images while maintaining compression functionality and perceptual quality. Conclusion: The proposed PSIC scheme is effective, plug-and-play, and can be integrated into existing LIC models. Abstract: The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. In this context, this paper seeks to protect users' privacy by implementing defenses at the image compression stage to prevent exploitation. Specifically, we propose a flexible coding method, termed Privacy-Shielded Image Compression (PSIC), that can produce bitstreams with multiple decoding options. By default, the bitstream is decoded to preserve satisfactory perceptual quality while preventing interpretation by VLP models. Our method also retains the original image compression functionality. With a customizable input condition, the proposed scheme can reconstruct the image that preserves its full semantic information. A Conditional Latent Trigger Generation (CLTG) module is proposed to produce bias information based on customizable conditions to guide the decoding process into different reconstructed versions, and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function is designed to leverage the soft labels inferred from the target VLP model's uncertainty on the training data. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing Learned Image Compression (LIC) models. Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of our design.

[81] DM-FNet: Unified multimodal medical image fusion via diffusion process-trained encoder-decoder

Dan He,Weisheng Li,Guofen Wang,Yuping Huang,Shiqiang Liu

Main category: cs.CV

TL;DR: A two-stage diffusion model-based fusion network (DM-FNet) is proposed for multimodal medical image fusion, which enhances feature recognition and interaction, leading to high-quality fused images.

Details

Motivation: Existing MMIF methods have limited capacity to capture detailed features during conventional training and suffer from insufficient cross-modal feature interaction, resulting in suboptimal fused image quality. Method: The study proposes DM-FNet with two stages: Stage I uses a diffusion process to train UNet for image reconstruction, capturing detailed information. Stage II inputs noisy images into the fusion network and integrates three key fusion modules to adaptively process different modalities. A hybrid loss function is also used. Result: Experimental results across various medical image types show exceptional performance in objective evaluation metrics. The fused image preserves brightness, radioactive tracer distribution, textures, and edges. Conclusion: DM-FNet addresses limitations of existing MMIF methods by enhancing feature recognition and interaction, leading to high-quality fused images suitable for comprehensive and accurate diagnosis. Abstract: Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality fusion results requires a careful balance of brightness, color, contrast, and detail; this ensures that the fused images effectively display relevant anatomical structures and reflect the functional status of the tissues. However, existing MMIF methods have limited capacity to capture detailed features during conventional training and suffer from insufficient cross-modal feature interaction, leading to suboptimal fused image quality. To address these issues, this study proposes a two-stage diffusion model-based fusion network (DM-FNet) to achieve unified MMIF. In Stage I, a diffusion process trains UNet for image reconstruction. UNet captures detailed information through progressive denoising and represents multilevel data, providing a rich set of feature representations for the subsequent fusion network. In Stage II, noisy images at various steps are input into the fusion network to enhance the model's feature recognition capability. Three key fusion modules are also integrated to process medical images from different modalities adaptively. Ultimately, the robust network structure and a hybrid loss function are integrated to harmonize the fused image's brightness, color, contrast, and detail, enhancing its quality and information density. The experimental results across various medical image types demonstrate that the proposed method performs exceptionally well regarding objective evaluation metrics. The fused image preserves appropriate brightness, a comprehensive distribution of radioactive tracers, rich textures, and clear edges. The code is available at https://github.com/HeDan-11/DM-FNet.

[82] video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models

Changli Tang,Yixuan Li,Yudong Yang,Jimin Zhuang,Guangzhi Sun,Wei Li,Zejun Ma,Chao Zhang

Main category: cs.CV

TL;DR: video-SALMONN 2是一个先进的音视频大语言模型，通过低秩适应和多轮偏好优化显著提升了视频描述的准确性和完整性。

Details

Motivation: 当前视频理解中的关键挑战是如何生成详细且准确的自然语言描述，特别是在结合音频信息的情况下。 Method: 提出了video-SALMONN 2模型，采用低秩适应（LoRA）和定向偏好优化（DPO），并引入多轮DPO (MrDPO)方法来提升训练效果。该方法包括周期性更新DPO参考模型、合并和重新初始化LoRA模块以及利用真实视频字幕进行指导。 Result: 实验表明，MrDPO使video-SALMONN 2的字幕准确性提高了28%，并且在参数量仅为70亿的情况下，超越了GPT-4o和Gemini-1.5-Pro等领先模型，并在视频问答基准测试中表现出色。 Conclusion: video-SALMONN 2通过创新的训练策略和模型设计，在视频字幕生成任务上达到了新的高度，同时代码已开源。 Abstract: Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.

[83] Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images

Liangjie Meng,Danxia Li,Jinrong He,Lili Ma,Zhixin Li

Main category: cs.CV

TL;DR: The paper proposes C-AFBiFPN, a novel framework for SAR ship detection which enhances feature representation and fusion to improve detection accuracy.

Details

Motivation: SAR ship detection encounters challenges such as scale variations among ships, small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. Method: C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network and integrates BiFormer attention within the fusion strategy of BiFPN to create the AFBiFPN network. Result: Experimental results on SAR Ship Detection Dataset (SSDD) show that the proposed approach significantly improves detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features. Conclusion: C-AFBiFPN substantially enhances the performance of SAR ship detection in terms of accuracy, robustness, and adaptability. Abstract: Synthetic Aperture Radar (SAR) enables submeter-resolution imaging and all-weather monitoring via active microwave and advanced signal processing. Currently, SAR has found extensive applications in critical maritime domains such as ship detection. However, SAR ship detection faces several challenges, including significant scale variations among ships, the presence of small offshore vessels mixed with noise, and complex backgrounds for large nearshore ships. To address these issues, this paper proposes a novel feature enhancement and fusion framework named C-AFBiFPN. C-AFBiFPN constructs a Convolutional Feature Enhancement (CFE) module following the backbone network, aiming to enrich feature representation and enhance the ability to capture and represent local details and contextual information. Furthermore, C-AFBiFPN innovatively integrates BiFormer attention within the fusion strategy of BiFPN, creating the AFBiFPN network. AFBiFPN improves the global modeling capability of cross-scale feature fusion and can adaptively focus on critical feature regions. The experimental results on SAR Ship Detection Dataset (SSDD) indicate that the proposed approach substantially enhances detection accuracy for small targets, robustness against occlusions, and adaptability to multi-scale features.

[84] RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories

Qingsong Yan,Qiang Wang,Kaiyong Zhao,Jie Chen,Bo Li,Xiaowen Chu,Fei Deng

Main category: cs.CV

TL;DR: RA-NeRF is a new method that can predict highly accurate camera poses even with complex trajectories by using photometric consistency, flow-driven pose regulation, and an implicit pose filter.

Details

Motivation: Existing 3D reconstruction and SLAM tools like NeRF and 3DGS require accurate camera pose priors, but current approaches to achieve this are not satisfactory, especially for complex camera trajectories. Method: RA-NeRF uses an incremental pipeline where it reconstructs the scene using NeRF with photometric consistency, incorporates flow-driven pose regulation to enhance robustness during initialization and localization, and employs an implicit pose filter to capture the camera movement pattern and eliminate noise for pose estimation. Result: Extensive experiments on Tanks&Temple and NeRFBuster datasets show RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality. Conclusion: RA-NeRF effectively and robustly reconstructs scenes under complex pose trajectories. Abstract: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks\&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.

[85] Retrospective Memory for Camouflaged Object Detection

Chenxi Zhang,Jiayun Wu,Qing Zhang,Yazhe Zhai,Youwei Pang

Main category: cs.CV

TL;DR: The paper introduces RetroMem, a recall-augmented architecture for camouflaged object detection (COD) that integrates historical knowledge to enhance perception and inference. It uses a two-stage training paradigm with Dense Multi-scale Adapter (DMA) for learning and Dynamic Memory Mechanism (DMM) along with Inference Pattern Reconstruction (IPR) for recall, significantly improving performance on COD.

Details

Motivation: Existing COD methods rely on static visual representation modeling without explicit mechanisms to acquire historical context, which limits their effectiveness in challenging scenes. Method: RetroMem employs a two-stage training process: 1) Learning stage with DMA to capture multi-scale visual information from pretrained encoder; 2) Recall stage with DMM and IPR to reconstruct camouflage pattern inference using learned knowledge. Result: RetroMem significantly outperforms existing state-of-the-art methods on several widely used datasets for camouflaged object detection. Conclusion: RetroMem enhances COD by dynamically modulating pattern perception and inference through integration of historical knowledge, achieving superior results. Abstract: Camouflaged object detection (COD) primarily focuses on learning subtle yet discriminative representations from complex scenes. Existing methods predominantly follow the parametric feedforward architecture based on static visual representation modeling. However, they lack explicit mechanisms for acquiring historical context, limiting their adaptation and effectiveness in handling challenging camouflage scenes. In this paper, we propose a recall-augmented COD architecture, namely RetroMem, which dynamically modulates camouflage pattern perception and inference by integrating relevant historical knowledge into the process. Specifically, RetroMem employs a two-stage training paradigm consisting of a learning stage and a recall stage to construct, update, and utilize memory representations effectively. During the learning stage, we design a dense multi-scale adapter (DMA) to improve the pretrained encoder's capability to capture rich multi-scale visual information with very few trainable parameters, thereby providing foundational inferences. In the recall stage, we propose a dynamic memory mechanism (DMM) and an inference pattern reconstruction (IPR). These components fully leverage the latent relationships between learned knowledge and current sample context to reconstruct the inference of camouflage patterns, thereby significantly improving the model's understanding of camouflage scenes. Extensive experiments on several widely used datasets demonstrate that our RetroMem significantly outperforms existing state-of-the-art methods.

[86] Domain Adaptation for Image Classification of Defects in Semiconductor Manufacturing

Adrian Poniatowski,Natalie Gentner,Manuel Barusco,Davide Dalle Pezze,Samuele Salti,Gian Antonio Susto

Main category: cs.CV

TL;DR: In the semiconductor sector, deep learning methods like Domain Adaptation (DA) are crucial for reducing time to market and enhancing quality. The paper proposes DBACS, a CycleGAN-inspired model with additional loss terms, tested on Electron Microscope images in unsupervised and semi-supervised settings.

Details

Motivation: To address the challenges of high demand, competition, and need for efficiency in the semiconductor industry by leveraging the success of deep learning methods, particularly Domain Adaptation (DA), which reduces manual re-labeling and re-training. Method: The DBACS approach, inspired by CycleGAN and enhanced with additional loss terms, is proposed and tested in both unsupervised and semi-supervised settings using real-world Electron Microscope images. Result: The approaches were validated on real-world Electron Microscope images, demonstrating the effectiveness of DBACS in advancing DA techniques for the semiconductor field. Conclusion: DBACS proves useful in improving robustness and scalability in domain adaptation for semiconductor applications, reducing computational costs and allowing experts to focus on high-value tasks. Abstract: In the semiconductor sector, due to high demand but also strong and increasing competition, time to market and quality are key factors in securing significant market share in various application areas. Thanks to the success of deep learning methods in recent years in the computer vision domain, Industry 4.0 and 5.0 applications, such as defect classification, have achieved remarkable success. In particular, Domain Adaptation (DA) has proven highly effective since it focuses on using the knowledge learned on a (source) domain to adapt and perform effectively on a different but related (target) domain. By improving robustness and scalability, DA minimizes the need for extensive manual re-labeling or re-training of models. This not only reduces computational and resource costs but also allows human experts to focus on high-value tasks. Therefore, we tested the efficacy of DA techniques in semi-supervised and unsupervised settings within the context of the semiconductor field. Moreover, we propose the DBACS approach, a CycleGAN-inspired model enhanced with additional loss terms to improve performance. All the approaches are studied and validated on real-world Electron Microscope images considering the unsupervised and semi-supervised settings, proving the usefulness of our method in advancing DA techniques for the semiconductor field.

[87] MSNeRV: Neural Video Representation with Multi-Scale Feature Fusion

Jun Zhu,Xinfeng Zhang,Lv Tang,JunHao Jiang

Main category: cs.CV

TL;DR: MSNeRV是一种用于神经视频表示的多尺度特征融合框架，它通过增强时间一致性、多分辨率和多频率信息集成以及改进特征提取，在动态场景中表现出比VTM-23.7更好的压缩效率。

Details

Motivation: 现有的INR方法在处理细节密集和快速变化的视频内容时存在困难，主要是因为内部网络特性的利用不足以及网络设计缺乏对视频特性的考虑。 Method: 提出了一种多尺度特征融合框架MSNeRV。编码阶段采用时间窗口增强时间一致性，并将视频分为多个GoPs，使用GoP级别的网格进行背景表示。设计了具有尺度自适应损失函数的多尺度空间解码器，以整合多分辨率和多频率信息。还引入了一个多尺度特征块，以充分利用隐藏特性。 Result: 实验结果表明，该模型在INR方法中表现出优越的表示能力，并且在动态场景中的压缩效率方面超越了VTM-23.7（随机访问）。 Conclusion: MSNeRV在视频表示和压缩方面展现了卓越的能力，特别是在动态场景中的压缩效率方面优于现有技术。 Abstract: Implicit Neural representations (INRs) have emerged as a promising approach for video compression, and have achieved comparable performance to the state-of-the-art codecs such as H.266/VVC. However, existing INR-based methods struggle to effectively represent detail-intensive and fast-changing video content. This limitation mainly stems from the underutilization of internal network features and the absence of video-specific considerations in network design. To address these challenges, we propose a multi-scale feature fusion framework, MSNeRV, for neural video representation. In the encoding stage, we enhance temporal consistency by employing temporal windows, and divide the video into multiple Groups of Pictures (GoPs), where a GoP-level grid is used for background representation. Additionally, we design a multi-scale spatial decoder with a scale-adaptive loss function to integrate multi-resolution and multi-frequency information. To further improve feature extraction, we introduce a multi-scale feature block that fully leverages hidden features. We evaluate MSNeRV on HEVC ClassB and UVG datasets for video representation and compression. Experimental results demonstrate that our model exhibits superior representation capability among INR-based approaches and surpasses VTM-23.7 (Random Access) in dynamic scenarios in terms of compression efficiency.

Qian Li,Feng Liu,Shuojue Yang,Daiyun Shen,Yueming Jin

Main category: cs.CV

TL;DR: 提出了一种新的框架BCRNet，用于增强腹腔镜肝脏手术中的解剖标志检测。通过多模态特征提取、自适应曲线提议初始化和分层曲线细化机制等步骤，显著提高了标志检测的准确性。实验表明BCRNet在L3D和P2ILF数据集上表现优于现有方法。

Details

Motivation: 腹腔镜肝脏手术中准确识别关键解剖结构具有挑战性，而基于2D-3D配准的增强现实系统可以提高手术导航效果。其中，精确检测腹腔镜图像中的曲线条理解剖标志是注册过程的重要部分。 Method: 1. 设计了一个多模态特征提取（MFE）模块来捕获语义特征。 2. 提出自适应曲线提议初始化（ACPI），生成像素对齐的贝塞尔曲线和置信度评分。 3. 开发了分层曲线细化（HCR）机制，通过多阶段过程迭代优化曲线提议，利用多尺度像素级特征进行精确调整。 Result: 在L3D和P2ILF数据集上的广泛评估显示，BCRNet显著优于当前最先进的方法，性能提升明显。 Conclusion: BCRNet通过贝塞尔曲线细化策略有效提升了腹腔镜肝脏手术中解剖标志的检测精度，为增强现实辅助手术提供了有力支持。 Abstract: Laparoscopic liver surgery, while minimally invasive, poses significant challenges in accurately identifying critical anatomical structures. Augmented reality (AR) systems, integrating MRI/CT with laparoscopic images based on 2D-3D registration, offer a promising solution for enhancing surgical navigation. A vital aspect of the registration progress is the precise detection of curvilinear anatomical landmarks in laparoscopic images. In this paper, we propose BCRNet (Bezier Curve Refinement Net), a novel framework that significantly enhances landmark detection in laparoscopic liver surgery primarily via the Bezier curve refinement strategy. The framework starts with a Multi-modal Feature Extraction (MFE) module designed to robustly capture semantic features. Then we propose Adaptive Curve Proposal Initialization (ACPI) to generate pixel-aligned Bezier curves and confidence scores for reliable initial proposals. Additionally, we design the Hierarchical Curve Refinement (HCR) mechanism to enhance these proposals iteratively through a multi-stage process, capturing fine-grained contextual details from multi-scale pixel-level features for precise Bezier curve adjustment. Extensive evaluations on the L3D and P2ILF datasets demonstrate that BCRNet outperforms state-of-the-art methods, achieving significant performance improvements. Code will be available.

[89] AI-driven visual monitoring of industrial assembly tasks

Mattia Nardon,Stefano Messelodi,Antonio Granata,Fabio Poiesi,Alberto Danese,Davide Boscaini

Main category: cs.CV

TL;DR: ViMAT is an AI-driven system for real-time visual monitoring of assembly tasks without requiring rigid workspace setups or visual markers. It combines a perception module with a reasoning module and has been validated on two assembly tasks.

Details

Motivation: Visual monitoring of industrial assembly tasks is critical but existing commercial solutions typically require rigid workspace setups or visual markers which limit their flexibility and applicability. Method: ViMAT uses a perception module to extract visual observations from multi-view video streams and a reasoning module to infer the most likely action being performed based on the observed assembly state and prior task knowledge. Result: ViMAT was validated on two assembly tasks involving LEGO components and hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios. Conclusion: ViMAT provides a flexible solution for real-time visual monitoring of assembly tasks without the need for rigid workspace setups or visual markers. Abstract: Visual monitoring of industrial assembly tasks is critical for preventing equipment damage due to procedural errors and ensuring worker safety. Although commercial solutions exist, they typically require rigid workspace setups or the application of visual markers to simplify the problem. We introduce ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without these constraints. ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. We validate ViMAT on two assembly tasks, involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations. Project page: https://tev-fbk.github.io/ViMAT

[90] MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering

Xinqi Fan,Jingting Li,John See,Moi Hoon Yap,Wen-Huang Cheng,Xiaobai Li,Xiaopeng Hong,Su-Jing Wang,Adrian K. Davision

Main category: cs.CV

TL;DR: Facial micro-expression (ME) analysis is advanced by integrating spotting and recognition in a unified pipeline (ME-STR) and exploring ME understanding through visual question answering (ME-VQA), utilizing multimodal large models.

Details

Motivation: Conventional approaches for ME spotting and recognition are suboptimal, especially for analyzing long-duration videos in realistic settings. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers new opportunities to enhance ME analysis. Method: The ME grand challenge (MEGC) 2025 introduces two tasks: ME spot-then-recognize (ME-STR), which integrates ME spotting and recognition in a sequential pipeline; and ME visual question answering (ME-VQA), which uses MLLMs or LVLMs to address diverse question types related to MEs. Result: Participating algorithms will be tested on a specific dataset and their results will be ranked on a leaderboard. Conclusion: The integration of ME spotting and recognition in a unified pipeline and the use of MLLMs/LVLMs for ME understanding represent evolving research directions in facial micro-expression analysis. Abstract: Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. However, conventional approaches that treat spotting and recognition as separate tasks are suboptimal, particularly for analyzing long-duration videos in realistic settings. Concurrently, the emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2025 introduces two tasks that reflect these evolving research directions: (1) ME spot-then-recognize (ME-STR), which integrates ME spotting and subsequent recognition in a unified sequential pipeline; and (2) ME visual question answering (ME-VQA), which explores ME understanding through visual question answering, leveraging MLLMs or LVLMs to address diverse question types related to MEs. All participating algorithms are required to run on this test set and submit their results on a leaderboard. More details are available at https://megc2025.github.io.

[91] MapFM: Foundation Model-Driven HD Mapping with Multi-Task Contextual Learning

Leonid Ivanov,Vasily Yuryev,Dmitry Yudin

Main category: cs.CV

TL;DR: An enhanced End-to-End model named MapFM is introduced for online vectorized HD map generation in autonomous driving, boosting feature representation quality and integrating auxiliary prediction heads for semantic segmentation.

Details

Motivation: High-definition (HD) maps and semantic maps in bird's-eye view (BEV) are crucial for accurate localization, planning, and decision-making in autonomous driving. Method: The paper proposes MapFM, an enhanced End-to-End model that incorporates a powerful foundation model for encoding camera images to boost feature representation quality. Auxiliary prediction heads for semantic segmentation in the BEV representation are integrated to enrich the model's understanding of the environment and improve prediction quality through multi-task learning. Result: This approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. Conclusion: MapFM significantly enhances the quality of vectorized HD maps for autonomous driving applications. Abstract: In autonomous driving, high-definition (HD) maps and semantic maps in bird's-eye view (BEV) are essential for accurate localization, planning, and decision-making. This paper introduces an enhanced End-to-End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model's understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi-task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at https://github.com/LIvanoff/MapFM.

[92] OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models

Lanfeng Zhong,Xin Liao,Shichuan Zhang,Shaoting Zhang,Guotai Wang

Main category: cs.CV

TL;DR: This paper proposes OpenPath, an open-set active learning approach for pathology image classification using a pre-trained Vision-Language Model (VLM) to enhance selection of In-Distribution and informative samples.

Details

Motivation: Existing Active Learning methods are mostly designed for closed-set scenarios, leading to inefficiency when dealing with Out-Of-Distribution data in real-world clinical unlabeled pools. Also, random selection in the first query round causes labeling cost waste in open-set scenarios. Method: OpenPath leverages a pre-trained Vision-Language Model (VLM) to select In-Distribution (ID) and informative samples from the unlabeled pool. In the first query round, task-specific prompts combining target and relevant non-target class prompts are used. In subsequent rounds, Diverse Informative ID Sampling (DIS) including Prototype-based ID candidate Selection (PIS) and Entropy-Guided Stochastic Sampling (EGSS) is employed to ensure purity and informativeness while avoiding OOD samples. Result: Experiments on two public pathology image datasets demonstrate that OpenPath significantly improves model performance due to high purity of selected samples and outperforms several state-of-the-art open-set AL methods. Conclusion: OpenPath addresses the challenges of existing AL methods by effectively selecting ID and informative samples, enhancing model performance in pathology image classification. Abstract: Pathology image classification plays a crucial role in accurate medical diagnosis and treatment planning. Training high-performance models for this task typically requires large-scale annotated datasets, which are both expensive and time-consuming to acquire. Active Learning (AL) offers a solution by iteratively selecting the most informative samples for annotation, thereby reducing the labeling effort. However, most AL methods are designed under the assumption of a closed-set scenario, where all the unannotated images belong to target classes. In real-world clinical environments, the unlabeled pool often contains a substantial amount of Out-Of-Distribution (OOD) data, leading to low efficiency of annotation in traditional AL methods. Furthermore, most existing AL methods start with random selection in the first query round, leading to a significant waste of labeling costs in open-set scenarios. To address these challenges, we propose OpenPath, a novel open-set active learning approach for pathological image classification leveraging a pre-trained Vision-Language Model (VLM). In the first query, we propose task-specific prompts that combine target and relevant non-target class prompts to effectively select In-Distribution (ID) and informative samples from the unlabeled pool. In subsequent queries, Diverse Informative ID Sampling (DIS) that includes Prototype-based ID candidate Selection (PIS) and Entropy-Guided Stochastic Sampling (EGSS) is proposed to ensure both purity and informativeness in a query, avoiding the selection of OOD samples. Experiments on two public pathology image datasets show that OpenPath significantly enhances the model's performance due to its high purity of selected samples, and outperforms several state-of-the-art open-set AL methods. The code is available at \href{https://github.com/HiLab-git/OpenPath}{https://github.com/HiLab-git/OpenPath}..

[93] Open-World Object Counting in Videos

Niki Amini-Naieni,Andrew Zisserman

Main category: cs.CV

TL;DR: The paper introduces a new task of open-world object counting in videos, defines CountVid model for this task and VideoCount dataset for evaluation.

Details

Motivation: Object counting in videos is particularly challenging in crowded scenes with occlusions and similar objects. The need to avoid double counting and identify reappearances motivates the development of a specialized model. Method: Introduced CountVid model leverages an image-based counting model and a promptable video segmentation and tracking model for automated, open-world object counting across video frames. Result: CountVid provides accurate object counts and significantly outperforms strong baselines when evaluated using the VideoCount dataset. Conclusion: CountVid model successfully addresses the challenges of open-world object counting in videos. The VideoCount dataset, CountVid model, and all the code are publicly available. Abstract: We introduce a new task of open-world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and similar objects, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image-based counting model, and a promptable video segmentation and tracking model to enable automated, open-world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for our novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x-rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://github.com/niki-amini-naieni/CountVid/.

[94] Unsupervised Pelage Pattern Unwrapping for Animal Re-identification

Aleksandr Algasov,Ekaterina Nepovinnykh,Fedor Zolotarev,Tuomas Eerola,Heikki Kälviäinen,Pavel Zemčík,Charles V. Stewart

Main category: cs.CV

TL;DR: The paper proposes a geometry-aware texture mapping approach to improve animal re-identification accuracy by unwrapping fur patterns into a canonical UV space, demonstrating improved results on seals and leopards.

Details

Motivation: To address the challenge of geometric distortions in animal fur or skin patterns caused by body movements and posture changes, which affects the accuracy of individual re-identification methods. Method: Propose a geometry-aware texture mapping approach that uses surface normal estimation to unwarp pelage patterns into a canonical UV space, preserving geometric consistency between 3D surface and 2D texture space. The method is integrated with existing re-identification techniques and can be trained self-supervised without ground truth UV annotations. Result: Experiments on seal and leopard datasets show an improvement of up to 5.4% in re-identification accuracy across diverse poses and viewing angles. Conclusion: The geometry-aware texture mapping approach enables more robust feature matching for animal re-identification, particularly for species with highly deformable fur patterns. Abstract: Existing individual re-identification methods often struggle with the deformable nature of animal fur or skin patterns which undergo geometric distortions due to body movement and posture changes. In this paper, we propose a geometry-aware texture mapping approach that unwarps pelage patterns, the unique markings found on an animal's skin or fur, into a canonical UV space, enabling more robust feature matching. Our method uses surface normal estimation to guide the unwrapping process while preserving the geometric consistency between the 3D surface and the 2D texture space. We focus on two challenging species: Saimaa ringed seals (Pusa hispida saimensis) and leopards (Panthera pardus). Both species have distinctive yet highly deformable fur patterns. By integrating our pattern-preserving UV mapping with existing re-identification techniques, we demonstrate improved accuracy across diverse poses and viewing angles. Our framework does not require ground truth UV annotations and can be trained in a self-supervised manner. Experiments on seal and leopard datasets show up to a 5.4% improvement in re-identification accuracy.

[95] When Model Knowledge meets Diffusion Model: Diffusion-assisted Data-free Image Synthesis with Alignment of Domain and Class

Yujin Kim,Hyunsoo Kim,Hyunwoo J. Kim,Suhyun Kim

Main category: cs.CV

TL;DR: The paper introduces DDIS, a Diffusion-assisted Data-free Image Synthesis method that improves synthetic image quality by leveraging a text-to-image diffusion model as an image prior. It includes Domain Alignment Guidance and Class Alignment Token for better alignment with the training data distribution, showing SOTA performance in data-free applications.

Details

Motivation: Existing DFIS methods produce samples that deviate from the training data distribution due to lack of prior knowledge about natural images when pre-trained model's training data is unavailable. Method: Propose DDIS which uses a text-to-image diffusion model as a powerful image prior, introducing Domain Alignment Guidance (DAG) and optimizing a single Class Alignment Token (CAT) embedding. Result: Experiments on PACS and ImageNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution. Conclusion: DDIS achieves SOTA performance in data-free applications. Abstract: Open-source pre-trained models hold great potential for diverse applications, but their utility declines when their training data is unavailable. Data-Free Image Synthesis (DFIS) aims to generate images that approximate the learned data distribution of a pre-trained model without accessing the original data. However, existing DFIS meth ods produce samples that deviate from the training data distribution due to the lack of prior knowl edge about natural images. To overcome this limitation, we propose DDIS, the first Diffusion-assisted Data-free Image Synthesis method that leverages a text-to-image diffusion model as a powerful image prior, improving synthetic image quality. DDIS extracts knowledge about the learned distribution from the given model and uses it to guide the diffusion model, enabling the generation of images that accurately align with the training data distribution. To achieve this, we introduce Domain Alignment Guidance (DAG) that aligns the synthetic data domain with the training data domain during the diffusion sampling process. Furthermore, we optimize a single Class Alignment Token (CAT) embedding to effectively capture class-specific attributes in the training dataset. Experiments on PACS and Ima geNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution, achieving SOTA performance in data-free applications.

[96] NERO: Explainable Out-of-Distribution Detection with Neuron-level Relevance

Anju Chhetri,Jari Korhonen,Prashnna Gyawali,Binod Bhattarai

Main category: cs.CV

TL;DR: In medical imaging, distinguishing out-of-distribution (OOD) samples is crucial for model reliability. This paper proposes NERO, a new OOD scoring method that uses neuron-level relevance to improve detection accuracy and explainability.

Details

Motivation: To enhance the reliability of deep learning models in medical imaging by improving the detection of out-of-distribution (OOD) samples. Method: NERO leverages neuron-level relevance at the feature layer, clusters this relevance for each in-distribution class to form centroids, and uses a relevance distance metric to measure deviation from these centroids. Scaled relevance in the bias term and feature norms are also incorporated. Result: Validated across multiple deep learning architectures on gastrointestinal imaging benchmarks (Kvasir and GastroVision), NERO shows improvements over state-of-the-art OOD detection methods. Conclusion: NERO provides a novel approach for OOD detection in medical imaging that enhances separability and offers explainability. Abstract: Ensuring reliability is paramount in deep learning, particularly within the domain of medical imaging, where diagnostic decisions often hinge on model outputs. The capacity to separate out-of-distribution (OOD) samples has proven to be a valuable indicator of a model's reliability in research. In medical imaging, this is especially critical, as identifying OOD inputs can help flag potential anomalies that might otherwise go undetected. While many OOD detection methods rely on feature or logit space representations, recent works suggest these approaches may not fully capture OOD diversity. To address this, we propose a novel OOD scoring mechanism, called NERO, that leverages neuron-level relevance at the feature layer. Specifically, we cluster neuron-level relevance for each in-distribution (ID) class to form representative centroids and introduce a relevance distance metric to quantify a new sample's deviation from these centroids, enhancing OOD separability. Additionally, we refine performance by incorporating scaled relevance in the bias term and combining feature norms. Our framework also enables explainable OOD detection. We validate its effectiveness across multiple deep learning architectures on the gastrointestinal imaging benchmarks Kvasir and GastroVision, achieving improvements over state-of-the-art OOD detection methods.

[97] Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

Team Hunyuan3D,Shuhui Yang,Mingxin Yang,Yifei Feng,Xin Huang,Sheng Zhang,Zebin He,Di Luo,Haolin Liu,Yunfei Zhao,Qingxiang Lin,Zeqiang Lai,Xianghui Yang,Huiwen Shi,Zibo Zhao,Bowen Zhang,Hongyu Yan,Lifu Wang,Sicong Liu,Jihong Zhang,Meng Chen,Liang Dong,Yiwen Jia,Yulin Cai,Jiaao Yu,Yixuan Tang,Dongyuan Guo,Junlin Yu,Hao Zhang,Zheng Ye,Peng He,Runzhou Wu,Shida Wei,Chao Zhang,Yonghao Tan,Yifu Sun,Lin Niu,Shirui Huang,Bojian Zheng,Shu Liu,Shilin Chen,Xiang Yuan,Xiaofeng Yang,Kai Liu,Jianchen Zhu,Peng Chen,Tian Liu,Di Wang,Yuhong Liu,Linus,Jie Jiang,Jingwei Huang,Chunchao Guo

Main category: cs.CV

TL;DR: 本论文通过Hunyuan3D 2.1系统介绍了生成高分辨率3D模型的全流程，包括数据处理、模型训练和性能评估，适合游戏、虚拟现实和工业设计等领域。

Details

Motivation: 尽管已有突破性模型革新了3D生成领域，但因收集、处理和训练3D模型的复杂性，该领域仍主要局限于研究人员、开发者和设计师。 Method: 论文以Hunyuan3D 2.1为例，提供了一个全面的指南，涵盖3D数据处理、模型训练和性能评估。系统由两个核心组件组成：Hunyuan3D-DiT用于形状生成，Hunyuan3D-Paint用于纹理合成。 Result: 读者将了解整个工作流程，包括数据准备、模型架构、训练策略、评估指标和部署方法。 Conclusion: 通过本教程，读者将能够微调或开发一个适用于游戏、虚拟现实和工业设计等领域的强大3D生成模型。 Abstract: 3D AI-generated content (AIGC) is a passionate field that has significantly accelerated the creation of 3D models in gaming, film, and design. Despite the development of several groundbreaking models that have revolutionized 3D generation, the field remains largely accessible only to researchers, developers, and designers due to the complexities involved in collecting, processing, and training 3D models. To address these challenges, we introduce Hunyuan3D 2.1 as a case study in this tutorial. This tutorial offers a comprehensive, step-by-step guide on processing 3D data, training a 3D generative model, and evaluating its performance using Hunyuan3D 2.1, an advanced system for producing high-resolution, textured 3D assets. The system comprises two core components: the Hunyuan3D-DiT for shape generation and the Hunyuan3D-Paint for texture synthesis. We will explore the entire workflow, including data preparation, model architecture, training strategies, evaluation metrics, and deployment. By the conclusion of this tutorial, you will have the knowledge to finetune or develop a robust 3D generative model suitable for applications in gaming, virtual reality, and industrial design.

[98] Multimodal Large Language Models for Medical Report Generation via Customized Prompt Tuning

Chunlei Li,Jingyang Hou,Yilei Shi,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou

Main category: cs.CV

TL;DR: The paper introduces MRG-LLM, a multimodal large language model that combines a frozen LLM with a learnable visual encoder and uses dynamic prompt customization for medical report generation from imaging data. It achieves state-of-the-art performance in experiments on IU X-ray and MIMIC-CXR datasets.

Details

Motivation: Medical report generation from imaging data is a challenging task in clinical practice, and while large language models show promise, their integration with medical imaging data needs further exploration. Method: The method involves creating MRG-LLM which integrates a frozen LLM with a learnable visual encoder. A dynamic prompt customization mechanism is introduced using conditional affine transformations derived from visual features, with two implementations: prompt-wise and promptbook-wise customization. Result: Extensive experiments on the IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Conclusion: MRG-LLM successfully addresses the challenge of medical report generation from imaging data, achieving top performance in relevant datasets, and the code will be publicly available. Abstract: Medical report generation from imaging data remains a challenging task in clinical practice. While large language models (LLMs) show great promise in addressing this challenge, their effective integration with medical imaging data still deserves in-depth exploration. In this paper, we present MRG-LLM, a novel multimodal large language model (MLLM) that combines a frozen LLM with a learnable visual encoder and introduces a dynamic prompt customization mechanism. Our key innovation lies in generating instance-specific prompts tailored to individual medical images through conditional affine transformations derived from visual features. We propose two implementations: prompt-wise and promptbook-wise customization, enabling precise and targeted report generation. Extensive experiments on IU X-ray and MIMIC-CXR datasets demonstrate that MRG-LLM achieves state-of-the-art performance in medical report generation. Our code will be made publicly available.

[99] GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects

Shujia Li,Haiyu Zhang,Xinyuan Chen,Yaohui Wang,Yutong Ban

Main category: cs.CV

TL;DR: 本研究提出了GenHOI，一个用于4D人-物交互（HOI）合成的两阶段框架。通过Object-AnchorNet和Contact-Aware Diffusion Model (ContactDM)，实现了对未见物体的良好泛化以及高质量4D HOI序列生成。在公开数据集上的实验表明，该方法达到业界领先水平。

Details

Motivation: 尽管扩散模型和大规模运动数据集推动了文本驱动的人类动作合成的发展，但将其扩展到4D人类-物体交互（HOI）仍面临挑战，主要原因在于缺乏大规模4D HOI数据集。 Method: 研究引入了GenHOI，一个新型两阶段框架：第一阶段使用Object-AnchorNet从3D HOI数据集中学习，重建稀疏3D HOI关键帧以适应未见物体；第二阶段采用Contact-Aware Diffusion Model (ContactDM)将稀疏3D HOI关键帧插值为时间上连贯的4D HOI序列，并提出Contact-Aware Encoder和Contact-Aware HOI Attention来提升生成质量。 Result: 实验结果表明，该方法在OMOMO和3D-FUTURE数据集上达到了最先进的水平，展示了强大的未见物体泛化能力和高保真4D HOI生成能力。 Conclusion: GenHOI框架有效缓解了对大规模4D HOI数据集的依赖，实现了对未见物体的良好泛化以及高质量4D HOI序列的生成。 Abstract: While diffusion models and large-scale motion datasets have advanced text-driven human motion synthesis, extending these advances to 4D human-object interaction (HOI) remains challenging, mainly due to the limited availability of large-scale 4D HOI datasets. In our study, we introduce GenHOI, a novel two-stage framework aimed at achieving two key objectives: 1) generalization to unseen objects and 2) the synthesis of high-fidelity 4D HOI sequences. In the initial stage of our framework, we employ an Object-AnchorNet to reconstruct sparse 3D HOI keyframes for unseen objects, learning solely from 3D HOI datasets, thereby mitigating the dependence on large-scale 4D HOI datasets. Subsequently, we introduce a Contact-Aware Diffusion Model (ContactDM) in the second stage to seamlessly interpolate sparse 3D HOI keyframes into densely temporally coherent 4D HOI sequences. To enhance the quality of generated 4D HOI sequences, we propose a novel Contact-Aware Encoder within ContactDM to extract human-object contact patterns and a novel Contact-Aware HOI Attention to effectively integrate the contact signals into diffusion models. Experimental results show that we achieve state-of-the-art results on the publicly available OMOMO and 3D-FUTURE datasets, demonstrating strong generalization abilities to unseen objects, while enabling high-fidelity 4D HOI generation.

[100] NTIRE 2025 Image Shadow Removal Challenge Report

Florin-Alexandru Vasluianu,Tim Seizinger,Zhuyun Zhou,Cailian Chen,Zongwei Wu,Radu Timofte,Mingjia Li,Jin Hu,Hainuo Wang,Hengxing Liu,Jiarui Wang,Qiming Hu,Xiaojie Guo,Xin Lu,Jiarong Yang,Yuanfei Bao,Anya Hu,Zihao Fan,Kunyu Wang,Jie Xiao,Xi Wang,Xueyang Fu,Zheng-Jun Zha,Yu-Fan Lin,Chia-Ming Lee,Chih-Chung Hsu,Xingbo Wang,Dong Li,Yuxu Chen,Bin Chen,Yuanbo Zhou,Yuanbin Chen,Hongwei Wang,Jiannan Lin,Qinquan Gao,Tong Tong,Zhao Zhang,Yanyan Wei,Wei Dong,Han Zhou,Seyed Amirreza Mousavi,Jun Chen,Haobo Liang,Jiajie Jing,Junyu Li,Yan Yang,Seoyeon Lee,Chaewon Kim,Ziyu Feng,Shidi Chen,Bowen Luan,Zewen Chen,Vijayalaxmi Ashok Aralikatti,G Gyaneshwar Rao,Nikhil Akalwadi,Chaitra Desai,Ramesh Ashok Tabib,Uma Mudenagudi,Anas M. Ali,Bilel Benjdira,Wadii Boulila,Alexandru Brateanu,Cosmin Ancuti,Tanmay Chaturvedi,Manish Kumar,Anmol Srivastav,Daksh Trivedi,Shashwat Thakur,Kishor Upla,Zeyu Xiao,Zhuoyuan Li,Boda Zhou,Shashank Shekhar,Kele Xu,Qisheng Xu,Zijian Gao,Tianjiao Wan,Suiyi Zhao,Bo Wang,Yan Luo,Mingshen Wang,Yilin Zhang

Main category: cs.CV

TL;DR: This paper explores the results of the NTIRE 2025 Shadow Removal Challenge which had two evaluation tracks and used images from the WSRD+ dataset.

Details

Motivation: To further develop techniques for shadow removal by evaluating different methods through a challenge with diverse objects, textures, and materials. Method: Conducting a challenge with two evaluation tracks - one focusing on reconstruction fidelity and the other on visual perception via user study. Result: A total of 306 participants registered and 17 teams successfully submitted their solutions. The methods were evaluated using the WSRD+ dataset. Conclusion: The NTIRE 2025 Shadow Removal Challenge provided valuable insights into effective shadow removal techniques. Abstract: This work examines the findings of the NTIRE 2025 Shadow Removal Challenge. A total of 306 participants have registered, with 17 teams successfully submitting their solutions during the final evaluation phase. Following the last two editions, this challenge had two evaluation tracks: one focusing on reconstruction fidelity and the other on visual perception through a user study. Both tracks were evaluated with images from the WSRD+ dataset, simulating interactions between self- and cast-shadows with a large number of diverse objects, textures, and materials.

[101] CLAIM: Clinically-Guided LGE Augmentation for Realistic and Diverse Myocardial Scar Synthesis and Segmentation

Farheen Ramzan,Yusuf Kiberu,Nikesh Jathanna,Shahnaz Jamil-Copley,Richard H. Clayton,Chen,Chen

Main category: cs.CV

TL;DR: The paper presents CLAIM, a framework for myocardial scar synthesis and segmentation using clinically-guided LGE augmentation. It introduces the SMILE module for anatomically consistent scar generation and uses joint training to improve both synthesis realism and segmentation accuracy.

Details

Motivation: Deep learning-based myocardial scar segmentation from LGE cardiac MRI has great potential for diagnosis and treatment planning but is restricted by limited availability and variability of high-quality images with scar labels. Method: Introduced CLAIM framework includes SMILE module for scar mask generation guided by clinical knowledge and uses AHA 17-segment model. Joint training strategy optimizes scar segmentation network alongside generator for enhanced realism and accuracy. Result: Experimental results indicate that CLAIM produces anatomically coherent scar patterns with higher Dice similarity to real scar distributions compared to baseline models. Conclusion: CLAIM enables controllable and realistic myocardial scar synthesis which can be useful for downstream medical imaging tasks. Abstract: Deep learning-based myocardial scar segmentation from late gadolinium enhancement (LGE) cardiac MRI has shown great potential for accurate and timely diagnosis and treatment planning for structural cardiac diseases. However, the limited availability and variability of LGE images with high-quality scar labels restrict the development of robust segmentation models. To address this, we introduce CLAIM: \textbf{C}linically-Guided \textbf{L}GE \textbf{A}ugmentation for Real\textbf{i}stic and Diverse \textbf{M}yocardial Scar Synthesis and Segmentation framework, a framework for anatomically grounded scar generation and segmentation. At its core is the SMILE module (Scar Mask generation guided by cLinical knowledgE), which conditions a diffusion-based generator on the clinically adopted AHA 17-segment model to synthesize images with anatomically consistent and spatially diverse scar patterns. In addition, CLAIM employs a joint training strategy in which the scar segmentation network is optimized alongside the generator, aiming to enhance both the realism of synthesized scars and the accuracy of the scar segmentation performance. Experimental results show that CLAIM produces anatomically coherent scar patterns and achieves higher Dice similarity with real scar distributions compared to baseline models. Our approach enables controllable and realistic myocardial scar synthesis and has demonstrated utility for downstream medical imaging task.

[102] RaCalNet: Radar Calibration Network for Sparse-Supervised Metric Depth Estimation

Xingrui Qin,Wentao Zhao,Chuan Cao,Yihe Niu,Houcheng Jiang,Jingchuan Wang

Main category: cs.CV

TL;DR: RaCalNet提出了一种新的框架，仅用稀疏LiDAR监督来学习改进的雷达测量值，从而消除了对密集监督的需求。该方法在ZJU-4DRadarCam数据集和实际部署场景中表现出色，显著降低了RMSE。

Details

Motivation: 密集度量深度估计通常需要密集的LiDAR监督，这既昂贵又数据密集。为了解决这一问题，研究者们希望开发一种不需要密集监督的方法。 Method: RaCalNet首先重新校准和细化稀疏雷达点以构建准确的深度先验，然后利用这些先验作为可靠的锚点来指导单目深度预测。这种方法依赖于稀疏LiDAR监督，监督密度仅为约1%。 Result: 尽管仅依赖稀疏监督，RaCalNet仍超越了最先进的密集监督方法，在生成的深度图中具有清晰的物体轮廓和精细的纹理。在ZJU-4DRadarCam数据集和实际部署场景中，分别减少了35.30%和34.89%的RMSE。 Conclusion: RaCalNet通过使用稀疏LiDAR监督来学习改进的雷达测量值，成功地消除了对密集监督的需求，并在多个实验中证明了其有效性。 Abstract: Dense metric depth estimation using millimeter-wave radar typically requires dense LiDAR supervision, generated via multi-frame projection and interpolation, to guide the learning of accurate depth from sparse radar measurements and RGB images. However, this paradigm is both costly and data-intensive. To address this, we propose RaCalNet, a novel framework that eliminates the need for dense supervision by using sparse LiDAR to supervise the learning of refined radar measurements, resulting in a supervision density of merely around 1% compared to dense-supervised methods. Unlike previous approaches that associate radar points with broad image regions and rely heavily on dense labels, RaCalNet first recalibrates and refines sparse radar points to construct accurate depth priors. These priors then serve as reliable anchors to guide monocular depth prediction, enabling metric-scale estimation without resorting to dense supervision. This design improves structural consistency and preserves fine details. Despite relying solely on sparse supervision, RaCalNet surpasses state-of-the-art dense-supervised methods, producing depth maps with clear object contours and fine-grained textures. Extensive experiments on the ZJU-4DRadarCam dataset and real-world deployment scenarios demonstrate its effectiveness, reducing RMSE by 35.30% and 34.89%, respectively.

[103] Control and Realism: Best of Both Worlds in Layout-to-Image without Training

Bonan Li,Yinhan Hu,Songhua Liu,Xinchao Wang

Main category: cs.CV

TL;DR: WinWinLay是一种无训练方法，通过非局部注意力能量函数和自适应更新策略改进布局到图像生成的精确性和真实性。

Details

Motivation: 现有的文本到图像扩散模型在布局到图像生成中面临定位不精确和不真实伪影的问题。 Method: 提出非局部注意力能量函数以重新分配注意力分数，使对象更好地符合指定的空间条件；引入基于朗之万动力学的自适应更新方案，促进在域内更新同时尊重布局约束。 Result: 广泛实验证明WinWinLay在控制元素放置和实现照片级视觉保真度方面表现出色，优于当前最先进的方法。 Conclusion: WinWinLay在无需特定数据训练的情况下，有效提升了布局到图像生成的精度和真实性。 Abstract: Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies, Non-local Attention Energy Function and Adaptive Update, that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.

[104] Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie,Zhenheng Yang,Mike Zheng Shou

Main category: cs.CV

TL;DR: This paper introduces Show-o2, an improved multimodal model using autoregressive modeling and flow matching for effective multimodal understanding and generation across text, images, and videos.

Details

Motivation: To create a scalable and versatile unified multimodal model that can effectively handle tasks involving text, images, and videos. Method: The model is built on a 3D causal variational autoencoder space with dual-path spatial(-temporal) fusion for constructing unified visual representations. Autoregressive modeling is applied to the language head for text token prediction, while flow matching is used in the flow head for image/video generation. A two-stage training approach is employed to facilitate learning and scaling. Result: Show-o2 models exhibit versatility in handling a wide array of multimodal understanding and generation tasks across diverse modalities. Conclusion: The Show-o2 models successfully demonstrate scalability and effectiveness in multimodal tasks, with code and models being publicly released. Abstract: This paper presents improved native unified multimodal models, \emph{i.e.,} Show-o2, that leverage autoregressive modeling and flow matching. Built upon a 3D causal variational autoencoder space, unified visual representations are constructed through a dual-path of spatial (-temporal) fusion, enabling scalability across image and video modalities while ensuring effective multimodal understanding and generation. Based on a language model, autoregressive modeling and flow matching are natively applied to the language head and flow head, respectively, to facilitate text token prediction and image/video generation. A two-stage training recipe is designed to effectively learn and scale to larger models. The resulting Show-o2 models demonstrate versatility in handling a wide range of multimodal understanding and generation tasks across diverse modalities, including text, images, and videos. Code and models are released at https://github.com/showlab/Show-o.

[105] Baltimore Atlas: FreqWeaver Adapter for Semi-supervised Ultra-high Spatial Resolution Land Cover Classification

Junhao Wu,Aboagye-Ntow Stephen,Chuyuan Wang,Gang Chen,Xin Huang

Main category: cs.CV

TL;DR: This paper presents a parameter-efficient semi-supervised segmentation framework for ultra-high (0.3 m) spatial resolution land cover classification imagery, improving upon current methods with better structural consistency and less reliance on annotated data.

Details

Motivation: Existing methods for high-resolution land cover classification focus on 1-meter resolution imagery and depend heavily on annotated data, which is impractical for higher-resolution imagery under weak supervision. Method: The authors propose a semi-supervised segmentation framework for 0.3 m spatial resolution imagery that uses the knowledge of SAM2 and incorporates a remote sensing-specific FreqWeaver Adapter to enhance fine-grained detail modeling while keeping a lightweight design. Result: The proposed method achieves robust segmentation results with superior structural consistency, showing a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches. Conclusion: The parameter-efficient semi-supervised segmentation framework effectively leverages unlabeled data and maintains minimal parameter overhead, offering an advancement in ultra-high spatial resolution land cover classification. Abstract: Ultra-high Spatial Resolution Land Cover Classification is essential for fine-grained land cover analysis, yet it remains challenging due to the high cost of pixel-level annotations, significant scale variation, and the limited adaptability of large-scale vision models. Existing methods typically focus on 1-meter spatial resolution imagery and rely heavily on annotated data, whereas practical applications often require processing higher-resolution imagery under weak supervision. To address this, we propose a parameter-efficient semi-supervised segmentation framework for 0.3 m spatial resolution imagery, which leverages the knowledge of SAM2 and introduces a remote sensing-specific FreqWeaver Adapter to enhance fine-grained detail modeling while maintaining a lightweight design at only 5.96% of the total model parameters. By effectively leveraging unlabeled data and maintaining minimal parameter overhead, the proposed method delivers robust segmentation results with superior structural consistency, achieving a 1.78% improvement over existing parameter-efficient tuning strategies and a 3.44% gain compared to state-of-the-art high-resolution remote sensing segmentation approaches.

[106] A Unified Graph-based Framework for Scalable 3D Tree Reconstruction and Non-Destructive Biomass Estimation from Point Clouds

Di Wang,Shi Li

Main category: cs.CV

TL;DR: This paper presents a novel unified framework for estimating forest above-ground biomass (AGB) through 3D tree structural reconstruction using an innovative graph-based pipeline, which enables end-to-end processing of large-scale point clouds and significantly reduces dependency on specialized pre-processing tools.

Details

Motivation: Estimating AGB is crucial for assessing carbon storage and supporting sustainable forest management. Current QSM methods face limitations as they are designed for individual trees, depend on high-quality TLS data, and require multiple pre-processing steps. Method: The study proposes a graph-based pipeline that integrates tree segmentation, leaf-wood separation, and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. This approach allows end-to-end processing of large-scale point clouds. Result: Experimental results show strong performance under challenging conditions with ~20% relative error in leaf-on scenarios and ~30% relative error in low-density ULS datasets with partial coverage. Conclusion: The proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation, reducing dependency on specialized pre-processing tools and establishing ULS as a viable alternative to TLS. Abstract: Estimating forest above-ground biomass (AGB) is crucial for assessing carbon storage and supporting sustainable forest management. Quantitative Structural Model (QSM) offers a non-destructive approach to AGB estimation through 3D tree structural reconstruction. However, current QSM methods face significant limitations, as they are primarily designed for individual trees,depend on high-quality point cloud data from terrestrial laser scanning (TLS), and also require multiple pre-processing steps that hinder scalability and practical deployment. This study presents a novel unified framework that enables end-to-end processing of large-scale point clouds using an innovative graph-based pipeline. The proposed approach seamlessly integrates tree segmentation,leaf-wood separation and 3D skeletal reconstruction through dedicated graph operations including pathing and abstracting for tree topology reasoning. Comprehensive validation was conducted on datasets with varying leaf conditions (leaf-on and leaf-off), spatial scales (tree- and plot-level), and data sources (TLS and UAV-based laser scanning, ULS). Experimental results demonstrate strong performance under challenging conditions, particularly in leaf-on scenarios (~20% relative error) and low-density ULS datasets with partial coverage (~30% relative error). These findings indicate that the proposed framework provides a robust and scalable solution for large-scale, non-destructive AGB estimation. It significantly reduces dependency on specialized pre-processing tools and establishes ULS as a viable alternative to TLS. To our knowledge, this is the first method capable of enabling seamless, end-to-end 3D tree reconstruction at operational scales. This advancement substantially improves the feasibility of QSM-based AGB estimation, paving the way for broader applications in forest inventory and climate change research.

[107] One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

Yujing Sun,Lingchen Sun,Shuaizheng Liu,Rongyuan Wu,Zhengqiang Zhang,Lei Zhang

Main category: cs.CV

TL;DR: DLoRAL提出了一种基于双LoRA学习范式的视频超分辨率方法，通过交叉帧检索模块和一致性/细节增强的LoRA分支，实现了时空一致且细节丰富的高质量视频恢复。

Details

Motivation: 现有的基于稳定扩散模型（SD）的视频超分辨率方法在处理真实场景时，往往为了保持时间一致性而牺牲空间细节，导致视觉质量不理想。因此需要一种新方法，在增强视频细节的同时保持时间一致性。 Method: 提出了Dual LoRA Learning（DLoRAL）范式，包含两个阶段：1) 使用Cross-Frame Retrieval (CFR) 模块聚合帧间信息，并通过Consistency-LoRA (C-LoRA) 学习鲁棒的时间表示；2) 固定CFR和C-LoRA模块，使用Detail-LoRA (D-LoRA) 增强空间细节，同时与C-LoRA定义的时间空间对齐以保持一致性。这两个阶段交替迭代优化，最终将两个LoRA分支合并到SD模型中，实现单步扩散的高效高质量视频恢复。 Result: 实验表明，DLoRAL在准确性和速度上均表现出色，能够生成时空一致且细节丰富的视频。代码和模型已开源。 Conclusion: DLoRAL通过结合时间一致性和空间细节增强，成功解决了现有方法在视频超分辨率中时间和空间性能权衡的问题，为高效高质量的视频恢复提供了新的解决方案。 Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

Kyobin Choo,Hyunkyung Han,Jinyeong Kim,Chanyong Yoon,Seong Jae Hwang

Main category: cs.CV

TL;DR: 提出M2M-Reg框架和GradCyCon正则化方法，解决异质模态图像配准问题，实验表明其在PET-MRI和FA-MRI配准中性能优于先前方法。

Details

Motivation: 临床实践中，功能特性成像（如PET、FA）需要与结构性参考（如MRI、CT）对齐进行准确解释或群体分析，但传统无监督DIR方法难以学习可靠的映射关系并易导致图像失真。 Method: 提出M2M-Reg框架，利用单模态相似性训练多模态DIR模型，并引入GradCyCon正则化器以促进微分同胚。该框架可自然扩展到半监督设置中，无需真实变换或分割掩码。 Result: 在ADNI数据集上的实验表明，M2M-Reg在PET-MRI和FA-MRI配准中的DSC比先前方法高出2倍。 Conclusion: M2M-Reg框架有效解决了高度异质多模态DIR问题，具有良好的应用前景。 Abstract: In clinical practice, imaging modalities with functional characteristics, such as positron emission tomography (PET) and fractional anisotropy (FA), are often aligned with a structural reference (e.g., MRI, CT) for accurate interpretation or group analysis, necessitating multi-modal deformable image registration (DIR). However, due to the extreme heterogeneity of these modalities compared to standard structural scans, conventional unsupervised DIR methods struggle to learn reliable spatial mappings and often distort images. We find that the similarity metrics guiding these models fail to capture alignment between highly disparate modalities. To address this, we propose M2M-Reg (Multi-to-Mono Registration), a novel framework that trains multi-modal DIR models using only mono-modal similarity while preserving the established architectural paradigm for seamless integration into existing models. We also introduce GradCyCon, a regularizer that leverages M2M-Reg's cyclic training scheme to promote diffeomorphism. Furthermore, our framework naturally extends to a semi-supervised setting, integrating pre-aligned and unaligned pairs only, without requiring ground-truth transformations or segmentation masks. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that M2M-Reg achieves up to 2x higher DSC than prior methods for PET-MRI and FA-MRI registration, highlighting its effectiveness in handling highly heterogeneous multi-modal DIR. Our code is available at https://github.com/MICV-yonsei/M2M-Reg.

[109] BoxFusion: Reconstruction-Free Open-Vocabulary 3D Object Detection via Real-Time Multi-View Box Fusion

Yuqing Lan,Chenyang Zhu,Zhirui Gao,Jiazhao Zhang,Yihan Cao,Renjiao Yi,Yijie Wang,Kai Xu

Main category: cs.CV

TL;DR: 提出了一种无重建的在线框架，用于高效的3D物体检测，结合Cubify Anything和CLIP模型，通过关联和优化模块实现多视角融合，实验表明其性能优越且具有广泛的应用前景。

Details

Motivation: 现有的3D物体检测方法依赖密集点云重建，导致计算开销大、内存限制严重，难以实现实时应用。为解决此问题，需要一种无需重建的高效实时3D检测框架。 Method: 利用Cubify Anything作为预训练视觉基础模型进行单视图3D物体检测，结合CLIP捕捉开放词汇表语义；通过关联模块（使用3D NMS和框对应匹配）和优化模块（基于粒子滤波的IoU引导随机优化技术）融合多视角检测结果。 Result: 在ScanNetV2和CA-1M数据集上的广泛实验表明，该方法在在线方法中达到了最先进的性能，展现出强大的泛化能力，能够实现实时感知，即使在超过1000平方米的环境中也是如此。 Conclusion: 所提出的无重建在线框架实现了高效、实时的3D物体检测，具有良好的泛化能力和实际应用潜力。 Abstract: Open-vocabulary 3D object detection has gained significant interest due to its critical applications in autonomous driving and embodied AI. Existing detection methods, whether offline or online, typically rely on dense point cloud reconstruction, which imposes substantial computational overhead and memory constraints, hindering real-time deployment in downstream tasks. To address this, we propose a novel reconstruction-free online framework tailored for memory-efficient and real-time 3D detection. Specifically, given streaming posed RGB-D video input, we leverage Cubify Anything as a pre-trained visual foundation model (VFM) for single-view 3D object detection by bounding boxes, coupled with CLIP to capture open-vocabulary semantics of detected objects. To fuse all detected bounding boxes across different views into a unified one, we employ an association module for correspondences of multi-views and an optimization module to fuse the 3D bounding boxes of the same instance predicted in multi-views. The association module utilizes 3D Non-Maximum Suppression (NMS) and a box correspondence matching module, while the optimization module uses an IoU-guided efficient random optimization technique based on particle filtering to enforce multi-view consistency of the 3D bounding boxes while minimizing computational complexity. Extensive experiments on ScanNetV2 and CA-1M datasets demonstrate that our method achieves state-of-the-art performance among online methods. Benefiting from this novel reconstruction-free paradigm for 3D object detection, our method exhibits great generalization abilities in various scenarios, enabling real-time perception even in environments exceeding 1000 square meters.

[110] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization

Roey Ron,Guy Tevet,Haim Sawdayee,Amit H. Bermano

Main category: cs.CV

TL;DR: The paper introduces HOIDiNi, a text-driven diffusion framework for generating realistic human-object interactions (HOIs), which uses Diffusion Noise Optimization (DNO) to balance realism and physical correctness.

Details

Motivation: Current HOI generation methods struggle to balance realism and physical correctness. Method: HOIDiNi optimizes in the noise space of a pretrained diffusion model using DNO, separating the process into an object-centric phase for hand-object contact choices and a human-centric phase for refining full-body motion. Result: HOIDiNi outperforms prior works in contact accuracy, physical validity, and overall quality on the GRAB dataset, demonstrating the ability to generate complex interactions like grasping, placing, and full-body coordination via textual prompts. Conclusion: HOIDiNi successfully generates realistic and plausible HOIs while maintaining both realism and physical correctness. Abstract: We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts. https://hoidini.github.io.

[111] FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Karmesh Yadav,Yusuf Ali,Gunshi Gupta,Yarin Gal,Zsolt Kira

Main category: cs.CV

TL;DR: 研究人员引入了一个新的基准测试，用于评估在Habitat模拟器中的长期具身任务。该基准包含60个需要持续参与和环境意识的任务，并且可以扩展到更长、更具挑战性的版本。同时，他们还提供了结合最先进的视觉-语言模型与低级导航策略的基线，以评估其在这些记忆密集型任务上的表现。

Details

Motivation: 大型视觉-语言模型在规划和控制任务中表现出色，但将其应用于真实世界机器人技术受到限制，因为它们难以处理跨多天收集的大量图像所代表的长期经验。当前模型通常无法同时处理超过几百张图像，因此需要更有效的机制来处理具身体验中的长期记忆。此外，现有的长视频问答基准忽略了物体操作和导航等具身体验挑战。 Method: 研究者们在Habitat模拟器中引入了一个新的基准测试，用于长期具身任务。该基准涵盖60个任务，要求在环境中保持持续参与和情境意识。这些任务可以通过程序扩展为更长和更具挑战性的版本。他们还将最先进的视觉-语言模型与低级导航策略相结合，提供基线并评估其在记忆密集型任务上的表现。 Result: 新基准测试能够有效评估记忆和推理能力，同时通过扩展任务长度和复杂性，实现对记忆和推理的可扩展评估。结合最先进的视觉-语言模型与低级导航策略的基线评估揭示了改进的方向。 Conclusion: 这项工作强调了在具身体验中有效整合记忆的重要性，并通过引入新的基准测试和基线评估，推动了在长期记忆和推理方面的研究进展。 Abstract: Large vision-language models have recently demonstrated impressive performance in planning and control tasks, driving interest in their application to real-world robotics. However, deploying these models for reasoning in embodied contexts is limited by their ability to incorporate long-term experience collected across multiple days and represented by vast collections of images. Current VLMs typically struggle to process more than a few hundred images concurrently, highlighting the need for more efficient mechanisms to handle long-term memory in embodied settings. To effectively evaluate these models for long-horizon control, a benchmark must specifically target scenarios where memory is crucial for success. Existing long-video QA benchmarks overlook embodied challenges like object manipulation and navigation, which demand low-level skills and fine-grained reasoning over past interactions. Moreover, effective memory integration in embodied agents involves both recalling relevant historical information and executing actions based on that information, making it essential to study these aspects together rather than in isolation. In this work, we introduce a new benchmark for long-range embodied tasks in the Habitat simulator. This benchmark evaluates memory-based capabilities across 60 tasks requiring sustained engagement and contextual awareness in an environment. The tasks can also be procedurally extended to longer and more challenging versions, enabling scalable evaluation of memory and reasoning. We also present baselines that integrate state-of-the-art VLMs with low level navigation policies, assessing their performance on these memory-intensive tasks and highlight areas for improvement.

[112] Demystifying the Visual Quality Paradox in Multimodal Large Language Models

Shuo Xing,Lanqing Guo,Hongyuan Hua,Seoyoung Lee,Peiran Li,Yufei Wang,Zhangyang Wang,Zhengzhong Tu

Main category: cs.CV

TL;DR: 近期的多模态大语言模型（MLLMs）在视觉-语言任务基准上表现出色，但输入视觉质量对其响应的影响尚不清楚。研究表明，图像的感知质量提高并不一定意味着MLLMs理解能力的提升，反而有时降低图像保真度能改善模型表现。为此，研究提出了视觉质量测试时调整（VQ-TTT），通过轻量级适应模块动态调整输入图像以匹配特定任务的模型偏好，从而显著提高平均准确率。这重新定义了对MLLMs而言“更好”的视觉输入，并强调了适应性图像处理的重要性。

Details

Motivation: 尽管MLLMs在视觉-语言任务中表现出色，但关于输入视觉质量如何影响其响应的研究较少。因此，需要系统地研究视觉质量对MLLMs性能的影响，并探索如何根据模型偏好调整输入图像以优化性能。 Method: 通过对主流MLLMs和一系列视觉-语言基准进行系统研究，应用受控降级和风格变化来分析图像质量对模型性能的影响。发现视觉质量悖论后，提出VQ-TTT方法：1) 在冻结的视觉编码器前插入一个可学习的低秩核以调节频率内容；2) 通过LoRA微调浅层视觉编码器层。该方法能够在单次前向传播中动态调整输入图像以符合任务特定的模型偏好。 Result: VQ-TTT方法在无需外部模型、缓存特征或额外训练数据的情况下，显著提高了评估的所有MLLMs和数据集上的平均准确率，证明了适应性图像处理的有效性和必要性。 Conclusion: 本研究揭示了视觉质量对MLLMs性能的复杂影响，并提出了一种有效的适应性图像调整方法VQ-TTT。这表明，在AI为主要数据消费者的新时代，应关注适应性而非普遍意义上的“干净”图像，以优化MLLMs的表现。 Abstract: Recent Multimodal Large Language Models (MLLMs) excel on benchmark vision-language tasks, yet little is known about how input visual quality shapes their responses. Does higher perceptual quality of images already translate to better MLLM understanding? We conduct the first systematic study spanning leading MLLMs and a suite of vision-language benchmarks, applying controlled degradations and stylistic shifts to each image. Surprisingly, we uncover a visual-quality paradox: model, task, and even individual-instance performance can improve when images deviate from human-perceived fidelity. Off-the-shelf restoration pipelines fail to reconcile these idiosyncratic preferences. To close the gap, we introduce Visual-Quality Test-Time Tuning (VQ-TTT)-a lightweight adaptation module that: (1) inserts a learnable, low-rank kernel before the frozen vision encoder to modulate frequency content; and (2) fine-tunes only shallow vision-encoder layers via LoRA. VQ-TTT dynamically adjusts each input image in a single forward pass, aligning it with task-specific model preferences. Across the evaluated MLLMs and all datasets, VQ-TTT lifts significant average accuracy, with no external models, cached features, or extra training data. These findings redefine ``better'' visual inputs for MLLMs and highlight the need for adaptive, rather than universally ``clean'', imagery, in the new era of AI being the main data customer.

[113] Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

Ankan Deria,Adinath Madhavrao Dukre,Feilong Tang,Sara Atito,Sudipta Roy,Muhammad Awais,Muhammad Haris Khan,Imran Razzak

Main category: cs.CV

TL;DR: 尽管在视觉-语言模型（VLMs）的推理时间搜索方面取得了显著进展，但现有方法仍然计算成本高昂，并且容易产生未被惩罚、低置信度的生成结果，通常导致持续的幻觉问题。为了解决这些问题，本文提出了基于价值引导与边缘奖励的两阶段推理框架ViMaR。该框架通过结合时差价值模型与边缘感知奖励调整，提高了效率和输出保真度。实验表明，ViMaR生成的标题更可靠、事实更准确、细节更丰富、解释性更强，同时相比现有的价值引导方法速度提升了4倍以上。此外，ViMaR还展示了在不同模型间的有效泛化能力，并且通过自我训练进一步提升了底层模型在多种视觉理解基准测试中的表现。

Details

Motivation: 当前视觉-语言模型的推理方法存在计算成本高和生成结果低置信度的问题，这可能导致持续的幻觉现象，影响模型的实际应用效果。因此，需要一种新的推理框架来提高生成结果的质量和效率，同时减少幻觉问题。 Method: 提出了一种名为ViMaR的两阶段推理框架：第一阶段通过单次遍历从多样候选中识别出最高价值的标题；第二阶段选择性地优化那些被忽略或视觉基础薄弱的部分，同时通过校准的边缘惩罚机制抑制低置信度的延续，保持描述的丰富性。 Result: ViMaR生成的标题在可靠性、事实准确性、细节和解释性方面显著优于现有方法，并且在速度上实现了超过4倍的提升。此外，ViMaR在不同模型间表现出良好的泛化能力，能够指导更强的未见过模型的解码过程。通过自我训练，使用ViMaR生成的标题还能显著提升底层模型在多种视觉理解基准测试中的表现。 Conclusion: ViMaR作为一种可扩展和可转移的推理时解码策略，不仅提高了视觉-语言模型生成标题的质量和效率，还展示了跨模型的灵活性和模块化特性。其在自我训练中的应用进一步证明了快速、准确和自我改进的VLM流水线的潜力。 Abstract: Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.

[114] UniRelight: Learning Joint Decomposition and Synthesis for Video Relighting

Kai He,Ruofan Liang,Jacob Munkberg,Jon Hasselgren,Nandita Vijaykumar,Alexander Keller,Sanja Fidler,Igor Gilitschenski,Zan Gojcic,Zian Wang

Main category: cs.CV

TL;DR: The paper presents a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in one pass using video diffusion models, trained on synthetic and real-world data to achieve strong generalization and surpass previous methods in visual fidelity and temporal consistency.

Details

Motivation: To overcome the limitations of existing end-to-end relighting models which are restricted by paired multi-illumination data scarcity and two-stage pipelines that suffer from error accumulation and inability to produce realistic outputs under complex conditions. Method: Introduced a single-pass method combining albedo estimation and relit output synthesis utilizing the generative capabilities of video diffusion models. The model is trained on both synthetic multi-illumination data and extensive automatically labeled real-world videos. Result: Demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency. Conclusion: This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions. Abstract: We address the challenge of relighting a single image or video, a task that demands precise scene intrinsic understanding and high-quality light transport synthesis. Existing end-to-end relighting models are often limited by the scarcity of paired multi-illumination data, restricting their ability to generalize across diverse scenes. Conversely, two-stage pipelines that combine inverse and forward rendering can mitigate data requirements but are susceptible to error accumulation and often fail to produce realistic outputs under complex lighting conditions or with sophisticated materials. In this work, we introduce a general-purpose approach that jointly estimates albedo and synthesizes relit outputs in a single pass, harnessing the generative capabilities of video diffusion models. This joint formulation enhances implicit scene comprehension and facilitates the creation of realistic lighting effects and intricate material interactions, such as shadows, reflections, and transparency. Trained on synthetic multi-illumination data and extensive automatically labeled real-world videos, our model demonstrates strong generalization across diverse domains and surpasses previous methods in both visual fidelity and temporal consistency.

[115] Sekai: A Video Dataset towards World Exploration

Zhen Li,Chuanhao Li,Xiaofeng Mao,Shaoheng Lin,Ming Li,Shitian Zhao,Zhaopan Xu,Xinyue Li,Yukang Feng,Jianwen Sun,Zizhen Li,Fanrui Zhang,Jiaxin Ai,Zhixiang Wang,Yuwei Wu,Tong He,Jiangmiao Pang,Yu Qiao,Yunde Jia,Kaipeng Zhang

Main category: cs.CV

TL;DR: The paper presents Sekai, a large-scale first-person worldwide video dataset with rich annotations for world exploration, and demonstrates its use in training an interactive video world exploration model named YUME.

Details

Motivation: Existing video generation datasets are not well-suited for world exploration training due to their limitations such as limited locations, short duration, static scenes, and lack of annotations about exploration and the world. Method: Developed Sekai, a high-quality first-person view worldwide video dataset with rich annotations including location, scene, weather, crowd density, captions, and camera trajectories. The dataset consists of over 5,000 hours of videos from over 100 countries and regions across 750 cities. An efficient and effective toolbox was used to collect, pre-process and annotate the videos. Result: Experiments demonstrate the quality of the Sekai dataset. A subset of the dataset was used to successfully train an interactive video world exploration model named YUME. Conclusion: Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. Abstract: Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning ``world'' in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Experiments demonstrate the quality of the dataset. And, we use a subset to train an interactive video world exploration model, named YUME (meaning ``dream'' in Japanese). We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications.

[116] Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model

Anirud Aggarwal,Abhinav Shrivastava,Matthew Gwilliam

Main category: cs.CV

TL;DR: ECAD采用遗传算法学习每个模型的缓存计划，以加速扩散模型推理，无需修改网络参数或参考图像，能有效适应不同模型和分辨率，并在多个指标和基准上表现出色。

Details

Motivation: 扩散模型生成高质量图像但推理缓慢且计算成本高，现有的缓存方法因依赖僵化的启发式规则而导致加速有限或泛化性差。 Method: 提出了一种名为ECAD的方法，使用遗传算法学习高效的、针对每个模型的缓存计划，形成帕累托前沿，仅使用少量校准提示符。 Result: 在PixArt-alpha等模型上的实验表明，ECAD在COCO FID等指标上优于先前的方法，并提高了推理速度。 Conclusion: ECAD是一种可扩展且可泛化的方法，能够有效加速扩散模型的推理过程。 Abstract: Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and FLUX-1.dev using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference. Our project website is available at https://aniaggarwal.github.io/ecad and our code is available at https://github.com/aniaggarwal/ecad.

Table of Contents

cs.CL [Back]

[1] Adverse Event Extraction from Discharge Summaries: A New Dataset, Annotation Scheme, and Initial Findings

[2] Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction

[3] CrEst: Credibility Estimation for Contexts in LLMs via Weak Supervision

[4] MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance

[5] From Chat to Checkup: Can Large Language Models Assist in Diabetes Prediction?

[6] Memory Tokens: Large Language Models Can Generate Reversible Sentence Embeddings

[7] Identifying social isolation themes in NVDRS text narratives using topic modeling and text-classification methods

[8] Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

[9] Learning-Time Encoding Shapes Unlearning in LLMs

[10] Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification

[11] CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records

[12] Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs

[13] Thunder-Tok: Minimizing Tokens per Word in Tokenizing Korean Texts for Generative Language Models

[14] Emergence of Primacy and Recency Effect in Mamba: A Mechanistic Point of View

[15] A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals

[16] ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs

[17] MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs

[18] Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

[19] Research on Graph-Retrieval Augmented Generation Based on Historical Text Knowledge Graphs

[20] TopClustRAG at SIGIR 2025 LiveRAG Challenge

[21] Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

[22] Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment

[23] ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

[24] DeVisE: Behavioral Testing of Medical Large Language Models

[25] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models' Knowledge of Indian Culture

[26] COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation

[27] Targeted Lexical Injection: Unlocking Latent Cross-Lingual Alignment in Lugha-Llama via Early-Layer LoRA Fine-Tuning

[28] Understanding GUI Agent Localization Biases through Logit Sharpness

[29] AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need

[30] RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation

[31] Context-Informed Grounding Supervision

[32] SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

[33] Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge

[34] Lessons from Training Grounded LLMs with Verifiable Rewards

[35] RATTENTION: Towards the Minimal Sliding Window Size in Local-Global Attention Models

[36] Approximating Language Model Training Data from Weights

[37] PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

[38] Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models

[39] SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

[40] DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement

[41] WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts

[42] From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns

[43] The Compositional Architecture of Regret in Large Language Models

[44] Minding the Politeness Gap in Cross-cultural Communication

[45] Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability

[46] Oldies but Goldies: The Potential of Character N-grams for Romanian Texts

[47] CC-LEARN: Cohort-based Consistency Learning

[48] Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

[49] Gender-Neutral Machine Translation Strategies in Practice

[50] GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

[51] PhantomHunter: Detecting Unseen Privately-Tuned LLM-Generated Text via Family-Aware Learning

cs.CV [Back]

[52] SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection

[53] Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?

[54] A Hybrid ConvNeXt-EfficientNet AI Solution for Precise Falcon Disease Detection

[55] ViLLa: A Neuro-Symbolic approach for Animal Monitoring

[56] GraphGSOcc: Semantic and Geometric Graph Transformer for 3D Gaussian Splating-based Occupancy Prediction

[57] DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning

[58] Recent Advances in Multi-Agent Human Trajectory Prediction: A Comprehensive Review

[59] ArchShapeNet:An Interpretable 3D-CNN Framework for Evaluating Architectural Shapes

[60] Real-Time, Low-Latency Surveillance Using Entropy-Based Adaptive Buffering and MobileNetV2 on Edge Devices

[61] MonoVQD: Monocular 3D Object Detection with Variational Query Denoising and Self-Distillation

[62] Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction

[63] PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

[64] Finding Optimal Kernel Size and Dimension in Convolutional Neural Networks An Architecture Optimization Approach

[65] Efficient Retail Video Annotation: A Robust Key Frame Generation Approach for Product and Customer Interaction Analysis

[66] Peering into the Unknown: Active View Selection with Neural Uncertainty Maps for 3D Reconstruction

[67] DETONATE: A Benchmark for Text-to-Image Alignment and Kernelized Direct Preference Optimization

[68] PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

[69] Frequency-Calibrated Membership Inference Attacks on Medical Image Diffusion Models

[70] Vision Transformers for End-to-End Quark-Gluon Jet Classification from Calorimeter Images

[71] Advances in Compliance Detection: Novel Models Using Vision-Based Tactile Sensors

[72] Hyper-Local Deformable Transformers for Text Spotting on Historical Maps

[73] Break Stylistic Sophon: Are We Really Meant to Confine the Imagination in Style Transfer?

[74] Enhancing Vector Quantization with Distributional Matching: A Theoretical and Empirical Study

[75] SynPo: Boosting Training-Free Few-Shot Medical Segmentation via High-Quality Negative Prompts

[76] Enhancing point cloud analysis via neighbor aggregation correction based on cross-stage structure correlation

[77] Echo-DND: A dual noise diffusion model for robust and precise left ventricle segmentation in echocardiography