Skip to content

Table of Contents

cs.CL [Back]

[1] MALIBU Benchmark: Multi-Agent LLM Implicit Bias Uncovered

Imran Mirza,Cole Huang,Ishwara Vasista,Rohan Patil,Asli Akalin,Sean O'Brien,Kevin Zhu

Main category: cs.CL

TL;DR: This paper presents MALIBU, a benchmark for evaluating social biases in multi-agent AI systems, showing that bias mitigation may not achieve true neutrality and could favor marginalized groups.

Details Motivation: Multi-agent systems consisting of AI models interacting in shared environments are increasingly used for persona-based interactions. However, these systems can reinforce implicit biases in large language models (LLMs), raising concerns about fairness and equitable representation. Method: The study introduces MALIBU, a benchmark for assessing implicit social biases in LLM-based multi-agent systems. It uses scenario-based assessments where AI models complete tasks, and their responses are evaluated by an LLM-based multi-agent judging system in two phases: scoring responses labeled with demographic personas and comparing paired responses assigned to different personas. Result: The study quantifies biases in LLM-generated outputs using MALIBU, revealing that current bias mitigation approaches might not achieve true neutrality and could potentially favor marginalized personas. Conclusion: The study concludes that bias mitigation in LLM-based multi-agent systems may favor marginalized personas over true neutrality, highlighting the need for nuanced detection, balanced fairness strategies, and transparent evaluation benchmarks. Abstract: Multi-agent systems, which consist of multiple AI models interacting within a shared environment, are increasingly used for persona-based interactions. However, if not carefully designed, these systems can reinforce implicit biases in large language models (LLMs), raising concerns about fairness and equitable representation. We present MALIBU, a novel benchmark developed to assess the degree to which LLM-based multi-agent systems implicitly reinforce social biases and stereotypes. MALIBU evaluates bias in LLM-based multi-agent systems through scenario-based assessments. AI models complete tasks within predefined contexts, and their responses undergo evaluation by an LLM-based multi-agent judging system in two phases. In the first phase, judges score responses labeled with specific demographic personas (e.g., gender, race, religion) across four metrics. In the second phase, judges compare paired responses assigned to different personas, scoring them and selecting the superior response. Our study quantifies biases in LLM-generated outputs, revealing that bias mitigation may favor marginalized personas over true neutrality, emphasizing the need for nuanced detection, balanced fairness strategies, and transparent evaluation benchmarks in multi-agent systems.

[2] Event-based evaluation of abstractive news summarization

Huiling You,Samia Touileb,Erik Velldal,Lilja Øvrelid

Main category: cs.CL

TL;DR: This paper proposes a new method to evaluate abstractive news summaries by analyzing overlapping events in the summaries and original articles, providing deeper insight into their quality.

Details Motivation: The evaluation of abstractive summaries typically relies on human-authored summaries as gold references. However, these methods do not effectively capture whether the summaries report events accurately, which is crucial since news articles primarily describe events. Method: The authors calculated overlapping events between generated summaries, reference summaries, and original news articles on a richly annotated Norwegian dataset containing both event annotations and human-authored summaries. Result: The method allows for better evaluation and understanding of the event information captured in abstractive summaries by focusing on event overlaps rather than traditional similarity metrics. Conclusion: The proposed approach provides a deeper insight into the event information contained in abstractive summaries, offering a more effective way to evaluate their quality. Abstract: An abstractive summary of a news article contains its most important information in a condensed version. The evaluation of automatically generated summaries by generative language models relies heavily on human-authored summaries as gold references, by calculating overlapping units or similarity scores. News articles report events, and ideally so should the summaries. In this work, we propose to evaluate the quality of abstractive summaries by calculating overlapping events between generated summaries, reference summaries, and the original news articles. We experiment on a richly annotated Norwegian dataset comprising both events annotations and summaries authored by expert human annotators. Our approach provides more insight into the event information contained in the summaries.

[3] Matching and Linking Entries in Historical Swedish Encyclopedias

Simon Börjesson,Erik Ersmark,Pierre Nugues

Main category: cs.CL

TL;DR: 该论文通过分析瑞典百科全书《Nordisk familjebok》的第一版和第二版,探讨了其地理焦点的变化趋势,并将其与第一次世界大战及新兴势力的崛起联系起来。

Details Motivation: 研究《Nordisk familjebok》不同版本中地理条目的变化,可以揭示瑞典知识界对全球地理关注的演变及其背后的历史背景。 Method: 使用Project Runeberg的数字化版本,将文本重新分段为条目,并利用语义句子嵌入匹配第一版和第二版的条目对;随后通过基于Transformer的分类器提取地理条目并链接到Wikidata以分析地理趋势。 Result: 从第一版到第二版,观察到地理焦点有显著的小幅度转移:从欧洲转向北美、非洲、亚洲、澳大利亚以及北斯堪的纳维亚。 Conclusion: 《Nordisk familjebok》地理焦点的变化反映了第一次世界大战的影响以及新势力的崛起。 Abstract: The \textit{Nordisk familjebok} is a Swedish encyclopedia from the 19th and 20th centuries. It was written by a team of experts and aimed to be an intellectual reference, stressing precision and accuracy. This encyclopedia had four main editions remarkable by their size, ranging from 20 to 38 volumes. As a consequence, the \textit{Nordisk familjebok} had a considerable influence in universities, schools, the media, and society overall. As new editions were released, the selection of entries and their content evolved, reflecting intellectual changes in Sweden. In this paper, we used digitized versions from \textit{Project Runeberg}. We first resegmented the raw text into entries and matched pairs of entries between the first and second editions using semantic sentence embeddings. We then extracted the geographical entries from both editions using a transformer-based classifier and linked them to Wikidata. This enabled us to identify geographic trends and possible shifts between the first and second editions, written between 1876-1899 and 1904-1926, respectively. Interpreting the results, we observe a small but significant shift in geographic focus away from Europe and towards North America, Africa, Asia, Australia, and northern Scandinavia from the first to the second edition, confirming the influence of the First World War and the rise of new powers. The code and data are available on GitHub at https://github.com/sibbo/nordisk-familjebok.

[4] MEGA: xLSTM with Multihead Exponential Gated Fusion for Precise Aspect-based Sentiment Analysis

Adamu Lawan,Juhua Pu,Haruna Yunusa,Jawad Muhammad,Muhammad Lawan

Main category: cs.CL

TL;DR: 本文提出了一种新的基于xLSTM和MEGA的框架,用于解决方面情感分析(ABSA)中的效率与性能平衡问题。

Details Motivation: 现有ABSA方法难以在计算效率和高性能之间取得平衡:深度学习模型缺乏全局上下文,Transformer消耗大量计算资源,而基于Mamba的方法存在CUDA依赖且局部相关性减弱。 Method: 结合双向mLSTM架构与部分翻转的前向-反向流(PF-mLSTM),并引入基于mLSTM的多头交叉指数门控融合机制(MECGAF),以优化短距离依赖关系捕捉同时保持全局上下文和效率。 Result: 在三个基准数据集上的实验表明,MEGA优于最先进的基线方法,在ABSA任务中实现了更高的准确性和效率。 Conclusion: 提出的MEGA框架有效解决了ABSA任务中效率与性能之间的权衡问题,并为未来研究提供了新方向。 Abstract: Aspect-based Sentiment Analysis (ABSA) is a critical Natural Language Processing (NLP) task that extracts aspects from text and determines their associated sentiments, enabling fine-grained analysis of user opinions. Existing ABSA methods struggle to balance computational efficiency with high performance: deep learning models often lack global context, transformers demand significant computational resources, and Mamba-based approaches face CUDA dependency and diminished local correlations. Recent advancements in Extended Long Short-Term Memory (xLSTM) models, particularly their efficient modeling of long-range dependencies, have significantly advanced the NLP community. However, their potential in ABSA remains untapped. To this end, we propose xLSTM with Multihead Exponential Gated Fusion (MEGA), a novel framework integrating a bi-directional mLSTM architecture with forward and partially flipped backward (PF-mLSTM) streams. The PF-mLSTM enhances localized context modeling by processing the initial sequence segment in reverse with dedicated parameters, preserving critical short-range patterns. We further introduce an mLSTM-based multihead cross exponential gated fusion mechanism (MECGAF) that dynamically combines forward mLSTM outputs as query and key with PF-mLSTM outputs as value, optimizing short-range dependency capture while maintaining global context and efficiency. Experimental results on three benchmark datasets demonstrate that MEGA outperforms state-of-the-art baselines, achieving superior accuracy and efficiency in ABSA tasks.

[5] The Medium Is Not the Message: Deconfounding Text Embeddings via Linear Concept Erasure

Yu Fan,Yang Tian,Shauli Ravfogel,Mrinmaya Sachan,Elliott Ash,Alexander Hoyle

Main category: cs.CL

TL;DR: 本文介绍了一种通过去除编码器表示中的混杂因素信息来减少基于嵌入的文本相似性度量偏差的方法,有效提升了文档相似性和聚类度量的表现,同时保持了计算效率和嵌入质量。

Details Motivation: 基于嵌入的文本序列相似性度量不仅可能受到我们关注的内容维度的影响,还可能因文本来源或语言等偶然属性产生偏差。这种文档混杂因素会对许多应用造成问题,尤其是需要合并来自不同语料库文本的应用。因此,本文提出了这一研究。 Method: 论文提出了一种去偏算法,该算法旨在从编码器表示中移除观察到的文档混杂因素的信息,以减少偏差。 Result: 在所有评估的嵌入变体和任务中,文档相似性和聚类度量均有所改善,有时改善效果非常明显。此外,在分布外基准测试中,性能未受到影响,表明嵌入质量并未降低。 Conclusion: 该论文得出结论,通过从编码器表示中去除观察到的混杂因素的信息,可以显著减少文本序列之间基于嵌入的相似性度量中的偏差,并且这种方法计算成本极低。 Abstract: Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text's source or language. These document confounders cause problems for many applications, but especially those that need to pool texts from different corpora. This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost. Document similarity and clustering metrics improve across every embedding variant and task we evaluate -- often dramatically. Interestingly, performance on out-of-distribution benchmarks is not impacted, indicating that the embeddings are not otherwise degraded.

Michał Matak,Jarosław A. Chudziak

Main category: cs.CL

TL;DR: 本文介绍了gAIus架构,其改进了非英语和非中文国家法律任务中的检索机制,并在波兰法律问题测试中显著提高了大语言模型的表现。

Details Motivation: 研究旨在解决当前大型语言模型在非英语和非中文国家法律事务回答中缺乏适当参考依据的问题,并提升法律信息检索的效果和可解释性。 Method: 提出了一种基于波兰民法典的LLM认知代理架构gAIus,并设计了一种更可解释、对人类友好的检索机制,通过构建基于波兰法律学徒考试选择题的特殊数据集进行方法评估。 Result: 该方法显著提升了gpt-3.5-turbo-0125的表现达419%,使其超过gpt-4o,并将gpt-4o-mini的得分从31%提高至86%。 Conclusion: gAIus架构展示了大型语言模型在法律任务中,尤其是在非英语和非中文国家的法律检索任务中的巨大潜力,并指出了未来的研究方向和应用前景。 Abstract: In this paper we discuss the capability of large language models to base their answer and provide proper references when dealing with legal matters of non-english and non-chinese speaking country. We discuss the history of legal information retrieval, the difference between case law and statute law, its impact on the legal tasks and analyze the latest research in this field. Basing on that background we introduce gAIus, the architecture of the cognitive LLM-based agent, whose responses are based on the knowledge retrieved from certain legal act, which is Polish Civil Code. We propose a retrieval mechanism which is more explainable, human-friendly and achieves better results than embedding-based approaches. To evaluate our method we create special dataset based on single-choice questions from entrance exams for law apprenticeships conducted in Poland. The proposed architecture critically leveraged the abilities of used large language models, improving the gpt-3.5-turbo-0125 by 419%, allowing it to beat gpt-4o and lifting gpt-4o-mini score from 31% to 86%. At the end of our paper we show the possible future path of research and potential applications of our findings.

[7] Evaluating Large Language Models for Multimodal Simulated Ophthalmic Decision-Making in Diabetic Retinopathy and Glaucoma Screening

Cindy Lie Tabuse,David Restepo,Carolina Gracitelli,Fernando Korn Malerbi,Caio Regatieri,Luis Filipe Nakayama

Main category: cs.CL

TL;DR: GPT-4 kann einfache Entscheidungen in der Augenheilkunde simulieren, eignet sich jedoch aufgrund mangelnder Präzision nicht für den klinischen Einsatz.

Details Motivation: Die Studie zielt darauf ab zu bewerten, ob GPT-4 klinische Entscheidungsprozesse im Bereich Diabetischer Retinopathie und Glaukom anhand strukturierter Textbeschreibungen simulieren kann. Method: Retrospektive diagnostische Validierungsstudie mit 300 annotierten Fundus-Bildern zur Bewertung der Leistungsfähigkeit von GPT-4 bei der Klassifikation von ICDR-Scores, DR-Weiterleitung und Schätzung des Cup-to-Disc-Ratios unter Verwendung strukturierter Prompts mit oder ohne Metadaten. Result: GPT-4 erreichte eine moderate Genauigkeit bei der ICDR-Klassifikation (67,5 %) und bessere Ergebnisse bei der binären DR-Weiterleitung (82,3 %), während die Leistung bei der Glaukom-Weiterleitung schlecht blieb (~78 %). Metadaten hatten keinen signifikanten Einfluss auf die Ergebnisse. Conclusion: GPT-4 kann grundlegende Entscheidungsprozesse in der Augenheilkunde simulieren, ist aber für komplexe Aufgaben nicht präzise genug. Abstract: Large language models (LLMs) can simulate clinical reasoning based on natural language prompts, but their utility in ophthalmology is largely unexplored. This study evaluated GPT-4's ability to interpret structured textual descriptions of retinal fundus photographs and simulate clinical decisions for diabetic retinopathy (DR) and glaucoma screening, including the impact of adding real or synthetic clinical metadata. We conducted a retrospective diagnostic validation study using 300 annotated fundus images. GPT-4 received structured prompts describing each image, with or without patient metadata. The model was tasked with assigning an ICDR severity score, recommending DR referral, and estimating the cup-to-disc ratio for glaucoma referral. Performance was evaluated using accuracy, macro and weighted F1 scores, and Cohen's kappa. McNemar's test and change rate analysis were used to assess the influence of metadata. GPT-4 showed moderate performance for ICDR classification (accuracy 67.5%, macro F1 0.33, weighted F1 0.67, kappa 0.25), driven mainly by correct identification of normal cases. Performance improved in the binary DR referral task (accuracy 82.3%, F1 0.54, kappa 0.44). For glaucoma referral, performance was poor across all settings (accuracy ~78%, F1 <0.04, kappa <0.03). Metadata inclusion did not significantly affect outcomes (McNemar p > 0.05), and predictions remained consistent across conditions. GPT-4 can simulate basic ophthalmic decision-making from structured prompts but lacks precision for complex tasks. While not suitable for clinical use, LLMs may assist in education, documentation, or image annotation workflows in ophthalmology.

[8] Rethinking All Evidence: Enhancing Trustworthy Retrieval-Augmented Generation via Conflict-Driven Summarization

Juan Chen,Baolong Bi,Wei Zhang,Jingyan Sui,Xiaofei Zhu,Yuanzhuo Wang,Lingrui Mei,Shenghua Liu

Main category: cs.CL

TL;DR: CARE-RAG是一种提升大语言模型在存在知识冲突情况下生成可靠响应的新方法,通过冲突感知的证据总结和QA修复步骤实现更高的准确性。

Details Motivation: 解决由于内部不一致或噪声检索内容引起的LLM生成可靠性下降的问题。 Method: 提出了一种名为CARE-RAG的框架,通过冲突驱动的摘要来整合参数知识和外部检索内容,并引入QA修复步骤以纠正过时或模糊的基准答案。 Result: 实验表明,在修订的QA数据集上,CARE-RAG在处理嘈杂或冲突证据时显著优于现有方法。 Conclusion: CARE-RAG是一个新的框架,可以提高RAG系统的可信度,特别是在处理嘈杂或冲突证据的场景中表现优于强基线。 Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating their parametric knowledge with external retrieved content. However, knowledge conflicts caused by internal inconsistencies or noisy retrieved content can severely undermine the generation reliability of RAG systems.In this work, we argue that LLMs should rethink all evidence, including both retrieved content and internal knowledge, before generating responses.We propose CARE-RAG (Conflict-Aware and Reliable Evidence for RAG), a novel framework that improves trustworthiness through Conflict-Driven Summarization of all available evidence.CARE-RAG first derives parameter-aware evidence by comparing parameter records to identify diverse internal perspectives. It then refines retrieved evidences to produce context-aware evidence, removing irrelevant or misleading content. To detect and summarize conflicts, we distill a 3B LLaMA3.2 model to perform conflict-driven summarization, enabling reliable synthesis across multiple sources.To further ensure evaluation integrity, we introduce a QA Repair step to correct outdated or ambiguous benchmark answers.Experiments on revised QA datasets with retrieval data show that CARE-RAG consistently outperforms strong RAG baselines, especially in scenarios with noisy or conflicting evidence.

[9] Frustratingly Simple Retrieval Improves Challenging, Reasoning-Intensive Benchmarks

Xinxi Lyu,Michael Duan,Rulin Shao,Pang Wei Koh,Sewon Min

Main category: cs.CL

TL;DR: This paper introduces CompactDS, a high-quality, web-scale datastore that significantly boosts the performance of Retrieval-Augmented Generation on complex reasoning tasks, outperforming web search engines and complex RAG systems while maintaining simplicity and efficiency.

Details Motivation: The motivation stems from the observation that prior RAG approaches have struggled with reasoning-intensive benchmarks due to limited, non-diverse datastores. The authors aim to address this by developing a more effective, compact, and high-quality datastore aligned with pretraining data breadth. Method: The authors introduce CompactDS, a web-scale datastore designed for high retrieval accuracy and low latency, combining in-memory approximate nearest neighbor retrieval with on-disk exact search. They evaluate its performance on reasoning-intensive benchmarks like MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. Result: Using CompactDS, the minimal RAG pipeline achieved relative accuracy gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH across different model sizes (8B–70B). The results also show that no single data source is sufficient, underscoring the need for diversity in data. Conclusion: The study concludes that a minimal RAG pipeline using CompactDS achieves significant accuracy improvements across various benchmarks and model sizes, emphasizing the importance of diverse data sources and efficient retrieval. Abstract: Retrieval-augmented Generation (RAG) has primarily been studied in limited settings, such as factoid question answering; more challenging, reasoning-intensive benchmarks have seen limited success from minimal RAG. In this work, we challenge this prevailing view on established, reasoning-intensive benchmarks: MMLU, MMLU Pro, AGI Eval, GPQA, and MATH. We identify a key missing component in prior work: a usable, web-scale datastore aligned with the breadth of pretraining data. To this end, we introduce CompactDS: a diverse, high-quality, web-scale datastore that achieves high retrieval accuracy and subsecond latency on a single-node. The key insights are (1) most web content can be filtered out without sacrificing coverage, and a compact, high-quality subset is sufficient; and (2) combining in-memory approximate nearest neighbor (ANN) retrieval and on-disk exact search balances speed and recall. Using CompactDS, we show that a minimal RAG pipeline achieves consistent accuracy improvements across all benchmarks and model sizes (8B--70B), with relative gains of 10% on MMLU, 33% on MMLU Pro, 14% on GPQA, and 19% on MATH. No single data source suffices alone, highlighting the importance of diversity of sources (web crawls, curated math, academic papers, textbooks). Finally, we show that our carefully designed in-house datastore matches or outperforms web search engines such as Google Search, as well as recently proposed, complex agent-based RAG systems--all while maintaining simplicity, reproducibility, and self-containment. We release CompactDS and our retrieval pipeline, supporting future research exploring retrieval-based AI systems.

[10] La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Kai Liu,Bowen Xu,Shaoyu Wu,Xin Chen,Hao Zhou,Yongliang Tao,Lulu Hu

Main category: cs.CL

TL;DR: 本文提出了LaRoSA,一种无需额外训练即可提升大语言模型推理效率的方法,通过逐层正交旋转和Top-K选择实现稳定的激活稀疏化和显著加速效果。

Details Motivation: 现有的激活稀疏化方法存在限制,如需要耗时的恢复训练或依赖不稳定的基于幅度的剪枝,这导致实际应用受限。因此,提出一种无需额外训练的新方法来实现稳定加速和高效推理。 Method: 利用逐层正交旋转将输入激活转换为更适合稀疏化的形式,并在旋转后的激活中采用Top-K选择方法实现一致的模型级稀疏性和可靠的壁钟时间加速。 Result: LaRoSA在各种大小和类型的LLM上均表现出色,在40%稀疏度下对LLaMA2-7B实现了仅0.17的困惑度差距和1.30倍的壁钟时间加速,同时减少了零样本任务中的精度差距,并优于TEAL和CATS方法。 Conclusion: LaRoSA是一种有效的LLM激活稀疏化方法,可以在不进行额外训练或基于幅度剪枝的情况下提高模型效率,实现稳定的加速和较小的性能损失。 Abstract: Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30x wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.

[11] Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs

Nifu Dan,Yujun Cai,Yiwei Wang

Main category: cs.CL

TL;DR: This paper explores the application of advanced reasoning models, such as Deepseek-R1, to complex physics problems, showing impressive accuracy and unique reasoning methods while highlighting improvements through few-shot prompting.

Details Motivation: Physics reasoning poses significant challenges for Large Language Models (LLMs), demanding deep conceptual understanding and effective problem-solving strategies. This study aims to explore how advanced reasoning models can better tackle complex physics problems. Method: The study employs advanced instruction-tuned reasoning models and evaluates their performance on the SciBench benchmark. It also investigates the impact of few-shot prompting on model accuracy. Result: Reasoning models achieved exceptional accuracy in answering intricate physics questions and demonstrated distinct reasoning patterns centered around symbolic derivation. Few-shot prompting further enhanced model performance. Conclusion: Advanced instruction-tuned reasoning models like Deepseek-R1 demonstrate remarkable capabilities in solving complex physics problems, achieving state-of-the-art accuracy and showcasing unique symbolic derivation reasoning patterns. Abstract: Navigating the complexities of physics reasoning has long been a difficult task for Large Language Models (LLMs), requiring a synthesis of profound conceptual understanding and adept problem-solving techniques. In this study, we investigate the application of advanced instruction-tuned reasoning models, such as Deepseek-R1, to address a diverse spectrum of physics problems curated from the challenging SciBench benchmark. Our comprehensive experimental evaluation reveals the remarkable capabilities of reasoning models. Not only do they achieve state-of-the-art accuracy in answering intricate physics questions, but they also generate distinctive reasoning patterns that emphasize on symbolic derivation. Furthermore, our findings indicate that even for these highly sophisticated reasoning models, the strategic incorporation of few-shot prompting can still yield measurable improvements in overall accuracy, highlighting the potential for continued performance gains.

[12] LEDOM: An Open and Fundamental Reverse Language Model

Xunjian Yin,Sitao Cheng,Yuxi Xie,Xinyu Hu,Li Lin,Xinyi Wang,Liangming Pan,William Yang Wang,Xiaojun Wan

Main category: cs.CL

TL;DR: 本文提出了 LEDOM,一种全新的逆向语言模型,并展示了其在各种任务中的潜力及创新应用 Reverse Reward 在数学推理方面的显著效果。

Details Motivation: 引入首个纯粹的逆向语言模型 LEDOM,探讨其作为基础模型在各类任务中的潜力,并提出新的应用 Reverse Reward。 Method: 通过先前标记预测,以反向时间顺序处理序列,自回归地训练 435B 标记,参数规模包括 2B 和 7B 的 LEDOM 模型。 Result: 基于 LEDOM 提出了新应用 Reverse Reward,在数学推理任务中实现了显著性能提升。 Conclusion: LEDOM 展现了独特的特性,具有广泛的应用潜力,并将发布模型、训练代码和预训练数据以促进未来研究。 Abstract: We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM's unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.

[13] Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Chris Yuhao Liu,Liang Zeng,Yuzhen Xiao,Jujie He,Jiacai Liu,Chaojie Wang,Rui Yan,Wei Shen,Fuxiang Zhang,Jiacheng Xu,Yang Liu,Yahui Zhou

Main category: cs.CL

TL;DR: This paper introduces Skywork-Reward-V2, a new series of reward models trained on a large-scale curated dataset (SynPref-40M) using a human-AI synergistic approach, resulting in state-of-the-art performance and highlighting the untapped potential of improved data curation.

Details Motivation: Current state-of-the-art open reward models perform poorly on most benchmarks, failing to capture nuanced human preferences, which is hypothesized to be due to limitations in preference datasets such as narrow scope, synthetic labeling, or poor quality control. Method: A human-AI synergistic two-stage pipeline was designed for data curation, combining human annotation quality with AI scalability. The SynPref-40M dataset, containing 40 million preference pairs, was curated, and a subset of 26 million pairs was used to train Skywork-Reward-V2, a suite of eight reward models with varying parameter sizes. Result: Skywork-Reward-V2 demonstrated versatility across alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirmed that effectiveness stems from both data scale and high-quality curation. Conclusion: The Skywork-Reward-V2 series represents significant progress in open reward models by demonstrating the potential of existing preference datasets and human-AI synergistic curation to achieve high data quality and state-of-the-art performance. Abstract: Despite the critical role of reward models (RMs) in reinforcement learning from human feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture the spectrum of nuanced and sophisticated human preferences. Even approaches that incorporate advanced training techniques have not yielded meaningful performance improvements. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present a large-scale preference dataset comprising 40 million preference pairs, named SynPref-40M. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while large language models perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling, achieving state-of-the-art performance across seven major reward model benchmarks. Ablation studies confirm that the effectiveness of our approach stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, highlighting the untapped potential of existing preference datasets and demonstrating how human-AI curation synergy can unlock significantly higher data quality.

[14] Clinical NLP with Attention-Based Deep Learning for Multi-Disease Prediction

Ting Xu,Xiaoxiao Deng,Xiandong Meng,Haifeng Yang,Yan Wu

Main category: cs.CL

TL;DR: 本文研究针对电子健康记录文本的复杂性,提出一种结合注意力机制与Transformer架构的深度学习方法,实现了信息提取与多标签疾病预测的统一建模,并展示了其优越的性能与实用价值。

Details Motivation: 解决电子健康记录文本的非结构化特性和高维语义复杂性所带来的挑战。 Method: 使用基于Transformer的架构进行临床文本表示学习,并采用多层自注意力机制捕捉关键医疗实体及其上下文关系。此外,应用基于Sigmoid的多标签分类器预测多个疾病标签,并整合了上下文感知语义对齐机制。 Result: 实验结果表明,所提出的方法在多个性能指标上始终优于现有的代表性方法,并且在不同的数据规模、干扰水平和模型深度配置下保持强泛化能力。 Conclusion: 该论文提出了一种基于注意力机制的深度学习方法,为处理现实世界的临床文本提供了高效的算法基础,并对多标签医学文本建模任务具有实际意义。 Abstract: This paper addresses the challenges posed by the unstructured nature and high-dimensional semantic complexity of electronic health record texts. A deep learning method based on attention mechanisms is proposed to achieve unified modeling for information extraction and multi-label disease prediction. The study is conducted on the MIMIC-IV dataset. A Transformer-based architecture is used to perform representation learning over clinical text. Multi-layer self-attention mechanisms are employed to capture key medical entities and their contextual relationships. A Sigmoid-based multi-label classifier is then applied to predict multiple disease labels. The model incorporates a context-aware semantic alignment mechanism, enhancing its representational capacity in typical medical scenarios such as label co-occurrence and sparse information. To comprehensively evaluate model performance, a series of experiments were conducted, including baseline comparisons, hyperparameter sensitivity analysis, data perturbation studies, and noise injection tests. Results demonstrate that the proposed method consistently outperforms representative existing approaches across multiple performance metrics. The model maintains strong generalization under varying data scales, interference levels, and model depth configurations. The framework developed in this study offers an efficient algorithmic foundation for processing real-world clinical texts and presents practical significance for multi-label medical text modeling tasks.

[15] LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

Tianyu Liu,Qitan Lv,Hao Li,Xing Gao,Xiao Sun

Main category: cs.CL

TL;DR: LogitSpec是一种高效的LLM推理加速方法,利用logits推测未来令牌并改进检索过程,从而提高解码速度和效率,具有良好的应用前景。

Details Motivation: 为了解决基于检索的SD方法难以找到匹配且准确的草案令牌的问题,提出LogitSpec以提高检索范围并增强草案令牌的准确性。 Method: LogitSpec通过两个步骤生成草案令牌:(1) 利用最后一个logit推测下一个下一个令牌;(2) 对下一个令牌和下一个下一个令牌检索相关参考。 Result: 实验表明,LogitSpec可以实现最高2.61倍的加速和每解码步骤平均接受3.28个令牌的性能提升。 Conclusion: LogitSpec是一个无需训练且即插即用的框架,能够有效扩展检索范围,在现有LLM推理框架中易于集成,并在多种文本生成基准测试中展现了显著的速度提升和解码效率。 Abstract: Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.

[16] Evaluating the Effectiveness of Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities

Yingqiang Gao,Kaede Johnson,David Froehlich,Luisa Carrer,Sarah Ebling

Main category: cs.CL

TL;DR: 本文提出了一种结合目标群体反馈的自动文本简化系统改进方法,通过直接偏好优化(DPO)实现个性化定制,提升了基于大语言模型的文本简化效果。

Details Motivation: 研究动机在于当前基于大语言模型的自动文本简化系统缺乏对目标用户偏好反馈的整合,导致无法实现个性化定制。为了克服这一问题,研究者尝试将目标群体的反馈引入模型训练过程,以提高系统的适应性和实用性。 Method: 论文的方法包括扩展标准的监督微调(SFT)方法,采用直接偏好优化(DPO)技术,并使用从智力障碍人士收集的人类反馈数据对模型进行后训练。此外,还提出了一个完整的流程,涵盖数据收集、模型选择、SFT和DPO后训练以及评估。 Result: 研究表明,在模型训练中引入目标群体的偏好反馈可以显著提高文本简化系统的个性化水平,从而更好地满足特定用户的需求。同时,研究还展示了如何通过与目标群体合作开发AI系统,使其更符合人类期望。 Conclusion: 该论文得出的结论是,通过利用目标群体(智力障碍人士)的反馈进行直接偏好优化(DPO),可以有效提升基于大语言模型的自动文本简化系统的个性化能力。这种方法强调了在设计包容性人工智能系统时,结合专家知识和目标群体实际需求的重要性。 Abstract: Automatic text simplification (ATS) aims to enhance language accessibility for various target groups, particularly persons with intellectual disabilities. Recent advancements in generative AI, especially large language models (LLMs), have substantially improved the quality of machine-generated text simplifications, thereby mitigating information barriers for the target group. However, existing LLM-based ATS systems do not incorporate preference feedback on text simplifications during training, resulting in a lack of personalization tailored to the specific needs of target group representatives. In this work, we extend the standard supervised fine-tuning (SFT) approach for adapting LLM-based ATS models by leveraging a computationally efficient LLM alignment technique -- direct preference optimization (DPO). Specifically, we post-train LLM-based ATS models using human feedback collected from persons with intellectual disabilities, reflecting their preferences on paired text simplifications generated by mainstream LLMs. Furthermore, we propose a pipeline for developing personalized LLM-based ATS systems, encompassing data collection, model selection, SFT and DPO post-training, and evaluation. Our findings underscore the necessity of active participation of target group persons in designing personalized AI accessibility solutions aligned with human expectations. This work represents a step towards personalizing inclusive AI systems at the target-group level, incorporating insights not only from text simplification experts but also from target group persons themselves.

[17] Efficient Out-of-Scope Detection in Dialogue Systems via Uncertainty-Driven LLM Routing

Álvaro Zaera,Diana Nicoleta Popa,Ivan Sekulic,Paolo Rosso

Main category: cs.CL

TL;DR: 本文提出了一种结合不确定性建模与微调大语言模型的模块化框架,旨在解决任务导向型对话系统中的未涵盖意图检测问题。

Details Motivation: 确保任务导向型对话系统对未知和模糊查询的鲁棒性是至关重要的挑战,而现有的方法无法有效平衡计算效率和性能。 Method: 首先应用不确定性估计来处理当前已在现实世界TODS中部署的in-scope意图检测分类器的输出;然后利用基于LLM的方法,对具有高不确定性的实例进行最终决策。 Result: 提出了一种新颖但简单的模块化框架,将不确定性建模与微调的大语言模型(LLMs)结合起来,用于高效准确的OOS检测。 Conclusion: 实验结果表明,该方法在计算效率和性能之间取得了有效的平衡,并且结合了传统方法与LLMs,在关键的OOS检测基准上取得了最先进的结果。 Abstract: Out-of-scope (OOS) intent detection is a critical challenge in task-oriented dialogue systems (TODS), as it ensures robustness to unseen and ambiguous queries. In this work, we propose a novel but simple modular framework that combines uncertainty modeling with fine-tuned large language models (LLMs) for efficient and accurate OOS detection. The first step applies uncertainty estimation to the output of an in-scope intent detection classifier, which is currently deployed in a real-world TODS handling tens of thousands of user interactions daily. The second step then leverages an emerging LLM-based approach, where a fine-tuned LLM is triggered to make a final decision on instances with high uncertainty. Unlike prior approaches, our method effectively balances computational efficiency and performance, combining traditional approaches with LLMs and yielding state-of-the-art results on key OOS detection benchmarks, including real-world OOS data acquired from a deployed TODS.

[18] Is External Information Useful for Stance Detection with LLMs?

Quang Minh Nguyen,Taegyoon Kim

Main category: cs.CL

TL;DR: 该研究发现,与基于BERT的模型不同,大型语言模型(LLMs)在立场检测任务中使用外部信息(如维基百科和网络搜索)反而会导致性能下降,主要原因是模型容易受外部信息的立场和情感影响,而非依据真实标签进行预测。

Details Motivation: 已有研究表明,外部信息可以提高基于BERT的模型在立场检测任务中的表现,但这种信息是否同样有助于广泛使用的大型语言模型(LLMs)尚未明确,因此本研究旨在填补这一空白。 Method: 作者对8种LLMs和3个数据集中的12个目标进行了系统的评估,比较了使用维基百科和网络搜索外部信息对立场检测的影响,并通过实验分析了LLMs预测倾向的原因。此外还测试了思维链提示和微调对性能下降的缓解效果。 Result: 研究发现,在大多数情况下,外部信息会导致LLMs的立场检测性能下降,宏观F1分数最多下降27.9%;LLMs倾向于根据提供的外部信息的立场和情感做出预测,而非依据文本的真实立场;即使使用思维链提示,性能下降仍然存在,而微调只能部分缓解这一问题。 Conclusion: 研究发现,尽管外部信息(如维基百科和网络搜索)在基于BERT的系统中能提升立场检测性能,但在大型语言模型(LLMs)中却可能导致性能下降,主要原因是LLMs容易受到提供信息的立场和情感影响,而不是依据文本的真实立场。 Abstract: In the stance detection task, a text is classified as either favorable, opposing, or neutral towards a target. Prior work suggests that the use of external information, e.g., excerpts from Wikipedia, improves stance detection performance. However, whether or not such information can benefit large language models (LLMs) remains an unanswered question, despite their wide adoption in many reasoning tasks. In this study, we conduct a systematic evaluation on how Wikipedia and web search external information can affect stance detection across eight LLMs and in three datasets with 12 targets. Surprisingly, we find that such information degrades performance in most cases, with macro F1 scores dropping by up to 27.9\%. We explain this through experiments showing LLMs' tendency to align their predictions with the stance and sentiment of the provided information rather than the ground truth stance of the given text. We also find that performance degradation persists with chain-of-thought prompting, while fine-tuning mitigates but does not fully eliminate it. Our findings, in contrast to previous literature on BERT-based systems which suggests that external information enhances performance, highlight the risks of information biases in LLM-based stance classifiers. Code is available at https://github.com/ngqm/acl2025-stance-detection.

[19] Emotionally Intelligent Task-oriented Dialogue Systems: Architecture, Representation, and Optimisation

Shutong Feng,Hsien-chin Lin,Nurul Lubis,Carel van Niekerk,Michael Heck,Benjamin Ruppik,Renato Vukovic,Milica Gašić

Main category: cs.CL

TL;DR: This paper introduces LUSTER, a new method for building more effective and emotionally intelligent Task-oriented Dialogue systems by combining large language models with structured reward modeling.

Details Motivation: Despite advances in large language models, building effective and emotionally intelligent Task-oriented Dialogue (ToD) systems remains a challenge due to the need to optimize for task success, emotional understanding, and precise information conveyance in noisy environments. Method: The authors propose LUSTER, an LLM-based unified system for task-oriented dialogue using end-to-end reinforcement learning with both short-term and long-term rewards. Result: The findings show that combining LLMs with reward modeling improves the performance of ToD systems in terms of resilience and emotional responsiveness. Conclusion: The study concludes that integrating LLM capabilities with structured reward modeling enhances the resilience and emotional responsiveness of ToD systems. Abstract: Task-oriented dialogue (ToD) systems are designed to help users achieve specific goals through natural language interaction. While recent advances in large language models (LLMs) have significantly improved linguistic fluency and contextual understanding, building effective and emotionally intelligent ToD systems remains a complex challenge. Effective ToD systems must optimise for task success, emotional understanding and responsiveness, and precise information conveyance, all within inherently noisy and ambiguous conversational environments. In this work, we investigate architectural, representational, optimisational as well as emotional considerations of ToD systems. We set up systems covering these design considerations with a challenging evaluation environment composed of a natural-language user simulator coupled with an imperfect natural language understanding module. We propose \textbf{LUSTER}, an \textbf{L}LM-based \textbf{U}nified \textbf{S}ystem for \textbf{T}ask-oriented dialogue with \textbf{E}nd-to-end \textbf{R}einforcement learning with both short-term (user sentiment) and long-term (task success) rewards. Our findings demonstrate that combining LLM capability with structured reward modelling leads to more resilient and emotionally responsive ToD systems, offering a practical path forward for next-generation conversational agents.

[20] Chart Question Answering from Real-World Analytical Narratives

Maeve Hutchinson,Radu Jianu,Aidan Slingsby,Jo Wood,Pranava Madhyastha

Main category: cs.CL

TL;DR: This paper introduces a new, realistic dataset for chart question answering (CQA), revealing that even top models like GPT-4.1 struggle with only 69.3% accuracy, highlighting the challenge of real-world CQA tasks.

Details Motivation: The motivation behind this work is to create a more ecologically valid benchmark for chart question answering that reflects real-world reasoning workflows, unlike prior benchmarks. Method: A new dataset for chart question answering (CQA) was constructed from visualization notebooks, featuring real-world, multi-view charts and natural language questions based on analytical narratives. Result: State-of-the-art multimodal large language models were benchmarked, showing a significant performance gap in the new CQA setting, with GPT-4.1 scoring an accuracy of 69.3%. Conclusion: The paper concludes that there is a significant performance gap in handling real-world CQA tasks, with even advanced models like GPT-4.1 achieving only 69.3% accuracy. Abstract: We present a new dataset for chart question answering (CQA) constructed from visualization notebooks. The dataset features real-world, multi-view charts paired with natural language questions grounded in analytical narratives. Unlike prior benchmarks, our data reflects ecologically valid reasoning workflows. Benchmarking state-of-the-art multimodal large language models reveals a significant performance gap, with GPT-4.1 achieving an accuracy of 69.3%, underscoring the challenges posed by this more authentic CQA setting.

[21] Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Georgii Levtsov,Dmitry Ustalov

Main category: cs.CL

TL;DR: 本文研究了自然语言处理中两种模型评估方法——全局评分和成对比较的优劣,发现全局评分更适合整体排名,而成对比较则擅长识别特定情境下的强模型,但需要更多数据支持。

Details Motivation: 随着高度指令调优的神经语言模型的出现,自然语言处理(NLP)领域的基准测试正逐渐从传统的全局点得分(如GLUE、BIG-bench、SWE-bench)转向成对比较排行榜(如LMSYS Arena)。本论文旨在帮助选择合适的模型评估策略,实证研究了全局评分和成对比较的优势与劣势。 Method: 通过使用标准全局指标和流行的Bradley-Terry模型进行合成和真实世界数据集的计算实验,对全局评分和成对比较进行了实证研究。 Result: 研究发现全局评分提供更可靠的总体排名,但可能低估有罕见重大错误或低信心的强模型;而当成对比较用于在低全局分模型中识别强者时尤其有效,特别是在质量指标难以定义的情况下,但需要更多比较来收敛如果经常出现平局。 Conclusion: 该论文得出结论,虽然全局评分在整体排名上更为可靠,但它们可能会低估那些出现罕见重大错误或置信度较低的强模型。相反,在质量指标难以定义的情况下(如文本生成),成对比较特别适用于识别低全局分模型中的强者,尽管当平局频繁时需要更多的比较才能收敛。 Abstract: With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders among models with lower global scores, especially where quality metrics are hard to define (e.g., text generation), though they require more comparisons to converge if ties are frequent. Our code and data are available at https://github.com/HSPyroblast/srw-ranking under a permissive license.

[22] Adapting Language Models to Indonesian Local Languages: An Empirical Study of Language Transferability on Zero-Shot Settings

Rifki Afina Putri

Main category: cs.CL

TL;DR: This paper shows that MAD-X improves sentiment analysis transferability to low-resource Indonesian languages, especially when the model has prior exposure to the language or related ones, without needing labeled target data.

Details Motivation: To understand how well pre-trained language models can be transferred to low-resource Indonesian local languages for sentiment analysis, particularly exploring if adapter modules can improve performance without labeled target data. Method: The paper evaluates zero-shot performance and adapter-based transfer using various models like monolingual Indonesian BERT, mBERT, XLM-R, and MAD-X on sentiment analysis across ten Indonesian local languages. They categorize languages into seen, partially seen, and unseen based on pre-training data. Result: Multilingual models perform best on seen languages, moderate on partially seen, and poorly on unseen languages. MAD-X notably improves performance across seen and partially seen languages without requiring target language data. Tokenization factors weakly correlate with performance; model exposure to the language proves a stronger predictor. Conclusion: The study concludes that MAD-X significantly enhances transferability to low-resource Indonesian local languages, especially for seen and partially seen languages, without needing labeled data. Model performance is largely influenced by prior exposure to the language or related languages. Abstract: In this paper, we investigate the transferability of pre-trained language models to low-resource Indonesian local languages through the task of sentiment analysis. We evaluate both zero-shot performance and adapter-based transfer on ten local languages using models of different types: a monolingual Indonesian BERT, multilingual models such as mBERT and XLM-R, and a modular adapter-based approach called MAD-X. To better understand model behavior, we group the target languages into three categories: seen (included during pre-training), partially seen (not included but linguistically related to seen languages), and unseen (absent and unrelated in pre-training data). Our results reveal clear performance disparities across these groups: multilingual models perform best on seen languages, moderately on partially seen ones, and poorly on unseen languages. We find that MAD-X significantly improves performance, especially for seen and partially seen languages, without requiring labeled data in the target language. Additionally, we conduct a further analysis on tokenization and show that while subword fragmentation and vocabulary overlap with Indonesian correlate weakly with prediction quality, they do not fully explain the observed performance. Instead, the most consistent predictor of transfer success is the model's prior exposure to the language, either directly or through a related language.

[23] AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness

Zixin Chen,Hongzhan Lin,Kaixin Li,Ziyang Luo,Zhen Ye,Guang Chen,Zhiyong Huang,Jing Ma

Main category: cs.CL

TL;DR: 本文提出了一种名为AdamMeme的新评估框架,用于更有效地评估多模态大语言模型理解有害模因的能力。

Details Motivation: 现有的基准测试依赖静态数据集和基于准确性的评估,无法提供最新和全面的评估,因为在线模因是动态变化的。 Method: 通过多代理协作,AdamMeme迭代更新模因数据以提供挑战性样本,从而暴露多模态大语言模型在解释有害性方面的具体局限性。 Result: 实验表明,该框架能够系统地揭示不同目标多模态大语言模型的变化表现,并提供对模型特定弱点的深入、细粒度分析。 Conclusion: AdamMeme是一个灵活的、基于代理的评估框架,可以全面评估多模态大语言模型(mLLMs)理解有害模因的能力。 Abstract: The proliferation of multimodal memes in the social media era demands that multimodal Large Language Models (mLLMs) effectively understand meme harmfulness. Existing benchmarks for assessing mLLMs on harmful meme understanding rely on accuracy-based, model-agnostic evaluations using static datasets. These benchmarks are limited in their ability to provide up-to-date and thorough assessments, as online memes evolve dynamically. To address this, we propose AdamMeme, a flexible, agent-based evaluation framework that adaptively probes the reasoning capabilities of mLLMs in deciphering meme harmfulness. Through multi-agent collaboration, AdamMeme provides comprehensive evaluations by iteratively updating the meme data with challenging samples, thereby exposing specific limitations in how mLLMs interpret harmfulness. Extensive experiments show that our framework systematically reveals the varying performance of different target mLLMs, offering in-depth, fine-grained analyses of model-specific weaknesses. Our code is available at https://github.com/Lbotirx/AdamMeme.

[24] Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach

Aditya Tomar,Rudra Murthy,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: 本文介绍了一种用于检测语言模型中偏见和刻板印象的新数据集StereoBias,并展示了联合训练模型可以显著提高偏见检测的效果。

Details Motivation: 语言模型中的偏见和刻板印象可能在内容审核和决策等敏感领域造成伤害,因此需要研究如何提高偏见检测效果。 Method: 引入了一个名为StereoBias的新数据集,并通过比较仅编码器模型和使用QLoRA微调的解码器模型进行实验。 Result: 实验显示联合训练在偏见检测方面显著优于单独训练,并且附加的情感分析实验确认了偏见与刻板印象之间的联系是提升性能的关键。 Conclusion: 研究得出联合训练模型在偏见和刻板印象检测中的有效性,强调利用刻板印象信息构建更公平、更有效的人工智能系统的重要性。 Abstract: Bias and stereotypes in language models can cause harm, especially in sensitive areas like content moderation and decision-making. This paper addresses bias and stereotype detection by exploring how jointly learning these tasks enhances model performance. We introduce StereoBias, a unique dataset labeled for bias and stereotype detection across five categories: religion, gender, socio-economic status, race, profession, and others, enabling a deeper study of their relationship. Our experiments compare encoder-only models and fine-tuned decoder-only models using QLoRA. While encoder-only models perform well, decoder-only models also show competitive results. Crucially, joint training on bias and stereotype detection significantly improves bias detection compared to training them separately. Additional experiments with sentiment analysis confirm that the improvements stem from the connection between bias and stereotypes, not multi-task learning alone. These findings highlight the value of leveraging stereotype information to build fairer and more effective AI systems.

Oliver Wardas,Florian Matthes

Main category: cs.CL

TL;DR: This paper explores how large language models can help assess the legality of employment contracts, finding that while they show potential, they're not as reliable as human lawyers.

Details Motivation: The need to address the interpretability and trustworthiness of data-driven NLP approaches in dynamic legal environments, particularly for evaluating legality in employment contracts. Method: Collaboration with legal experts, extension of an existing dataset, and evaluation of LLMs using in-context learning across three legal context variants. Result: Examination guidelines significantly improved recall for void clauses and weighted F1-Score (80%), while full-text sources only moderately enhanced performance; however, LLMs still underperformed compared to human lawyers. Conclusion: LLMs hold promise for aiding legal contract reviews but are not yet a substitute for human expertise. Abstract: Legal work, characterized by its text-heavy and resource-intensive nature, presents unique challenges and opportunities for NLP research. While data-driven approaches have advanced the field, their lack of interpretability and trustworthiness limits their applicability in dynamic legal environments. To address these issues, we collaborated with legal experts to extend an existing dataset and explored the use of Large Language Models (LLMs) and in-context learning to evaluate the legality of clauses in German employment contracts. Our work evaluates the ability of different LLMs to classify clauses as "valid," "unfair," or "void" under three legal context variants: no legal context, full-text sources of laws and court rulings, and distilled versions of these (referred to as examination guidelines). Results show that full-text sources moderately improve performance, while examination guidelines significantly enhance recall for void clauses and weighted F1-Score, reaching 80\%. Despite these advancements, LLMs' performance when using full-text sources remains substantially below that of human lawyers. We contribute an extended dataset, including examination guidelines, referenced legal sources, and corresponding annotations, alongside our code and all log files. Our findings highlight the potential of LLMs to assist lawyers in contract legality review while also underscoring the limitations of the methods presented.

[26] Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

Matteo Di Cristofaro

Main category: cs.CL

TL;DR: This paper explores the impact of tokenisation discrepancies on corpus linguistics, particularly focusing on emojis and homoglyphs, and proposes methods for accurate text representation to ensure reliable linguistic analysis.

Details Motivation: Tokenisation is crucial for corpus linguistics as it underpins quantitative methods and ensures reliability in qualitative approaches. This study examines how discrepancies in tokenisation can affect data representation and analytical validity. Method: The study investigates the challenges in tokenisation caused by emojis and homoglyphs and presents methods to preprocess these elements for accurate representation of digital texts in corpora. Result: The research highlights the necessity of preprocessing emojis and homoglyphs to maintain corpus fidelity and support reliable linguistic analysis, ensuring repeatability of interpretations. Conclusion: The paper concludes that a detailed understanding of linguistic and technical aspects of digital textual data is essential for enhancing the accuracy of corpus analysis, affecting both quantitative and qualitative approaches. Abstract: Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative method (e.g. collocations) while ensuring the reliability of qualitative approaches. This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings: investigating the challenges posed by emojis and homoglyphs, the study highlights the necessity of preprocessing these elements to maintain corpus fidelity to the source data. The research presents methods for ensuring that digital texts are accurately represented in corpora, thereby supporting reliable linguistic analysis and guaranteeing the repeatability of linguistic interpretations. The findings emphasise the necessity of a detailed understanding of both linguistic and technical aspects involved in digital textual data to enhance the accuracy of corpus analysis, and have significant implications for both quantitative and qualitative approaches in corpus-based research.

[27] MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Zhixun Chen,Ping Guo,Wenhan Han,Yifan Zhang,Binbin Liu,Haobin Lin,Fengze Liu,Yan Zhao,Bingni Zhang,Taifeng Wang,Yin Zheng,Meng Fang

Main category: cs.CL

TL;DR: MuRating是一个可扩展的框架,通过将英语数据质量信号转化为17种目标语言的单一评分器,提高了多语言模型在预训练中的表现。

Details Motivation: 现有基于模型的数据选择方法主要集中在英语上,而对多语言支持不足,因此需要一种能够跨语言迁移高质量英语数据信号的方法。 Method: MuRating通过成对比较聚合多个英语“评分器”,学习统一的文档质量得分,并通过翻译将这些判断投射到目标语言,从而在单语、跨语言和并行文本对上训练多语言评估模型。 Result: 应用于网页数据时,MuRating在预训练1.2B参数LLaMA模型中选择了平衡的英语和多语言内容子集,在多项基准测试中平均准确率优于QuRater、AskLLM、DCLM等强基线模型,特别是在知识密集型任务上有显著提升。 Conclusion: MuRating有效解决了非英语语言数据质量评估不足的问题,为未来工作提供了翻译保真度、选择偏差和叙事材料代表性方面的研究方向。 Abstract: Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English "raters" via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.

[28] Probing Evaluation Awareness of Language Models

Jord Nguyen,Khiem Hoang,Carlo Leonardo Attubato,Felix Hofstätter

Main category: cs.CL

TL;DR: 该论文探讨了Llama-3.3-70B-Instruct中的评估意识,发现模型能通过线性探测器区分评估与部署提示,并指出当前安全评估已被识别为不真实。研究强调了确保评估可信度和应对欺骗能力的重要性。

Details Motivation: 语言模型可能具有评估意识,这可能影响AI治理框架和行业承诺中的评估可靠性。 Method: 通过线性探测器分析Llama-3.3-70B-Instruct模型对真实评估与部署提示的区分能力。 Result: 线性探测器能够区分评估和部署提示,且当前的安全评估被模型识别为人工或不真实。 Conclusion: 研究强调确保评估的可信度和理解模型欺骗能力的重要性,并展示了模型内部机制如何支持安全审计中的黑盒方法。 Abstract: Language models can distinguish between testing and deployment phases -- a capability known as evaluation awareness. This has significant safety and policy implications, potentially undermining the reliability of evaluations that are central to AI governance frameworks and voluntary industry commitments. In this paper, we study evaluation awareness in Llama-3.3-70B-Instruct. We show that linear probes can separate real-world evaluation and deployment prompts, suggesting that current models internally represent this distinction. We also find that current safety evaluations are correctly classified by the probes, suggesting that they already appear artificial or inauthentic to models. Our findings underscore the importance of ensuring trustworthy evaluations and understanding deceptive capabilities. More broadly, our work showcases how model internals may be leveraged to support blackbox methods in safety audits, especially for future models more competent at evaluation awareness and deception.

[29] How Do Vision-Language Models Process Conflicting Information Across Modalities?

Tianze Hua,Tian Yun,Ellie Pavlick

Main category: cs.CL

TL;DR: This paper investigates how multimodal AI models resolve conflicts between different input types, revealing that they tend to favor certain modalities, and that specific internal mechanisms can control and improve this decision-making process.

Details Motivation: As AI models become increasingly multimodal, understanding how they handle conflicting input streams is crucial for developing more reliable and controllable systems in complex environments. Method: The paper uses vision-language models, providing them with inconsistent inputs (e.g., an image of a dog paired with a caption saying 'A photo of a cat') and asks the model to report information from a specific modality. The researchers analyze the model's behavior and internal representational structure. Result: Models were found to favor one modality (e.g., visual or textual) when faced with conflicting inputs, and this preference was detectable in their internal representations. Specific attention heads could restructure these representations to shift modality preference, while modality-agnostic router heads helped direct responses based on the instruction's focus. Conclusion: The study concludes that multimodal AI models often favor one modality over another when presented with conflicting inputs, and this preference is evident in the model's internal structure. Specific attention mechanisms can manipulate this preference, leading to improved performance across modalities and datasets. Abstract: AI models are increasingly required to be multimodal, integrating disparate input streams into a coherent state representation on which subsequent behaviors and actions can be based. This paper seeks to understand how such models behave when input streams present conflicting information. Focusing specifically on vision-language models, we provide inconsistent inputs (e.g., an image of a dog paired with the caption "A photo of a cat") and ask the model to report the information present in one of the specific modalities (e.g., "What does the caption say / What is in the image?"). We find that models often favor one modality over the other, e.g., reporting the image regardless of what the caption says, but that different models differ in which modality they favor. We find evidence that the behaviorally preferred modality is evident in the internal representational structure of the model, and that specific attention heads can restructure the representations to favor one modality over the other. Moreover, we find modality-agnostic "router heads" which appear to promote answers about the modality requested in the instruction, and which can be manipulated or transferred in order to improve performance across datasets and modalities. Together, the work provides essential steps towards identifying and controlling if and how models detect and resolve conflicting signals within complex multimodal environments.

[30] The Anatomy of Evidence: An Investigation Into Explainable ICD Coding

Katharina Beckh,Elisa Studeny,Sujan Sai Gannamaneni,Dario Antweiler,Stefan Rüping

Main category: cs.CL

TL;DR: 该研究利用MDACE数据集评估了可解释医疗编码系统的有效性,并提出了改进和评估的建议。

Details Motivation: 透明性在医疗编码中非常重要,但由于注释数据稀缺,现有的解释方法评估大多局限于短文本和二元设置。Cheng等人(2023)引入的MDACE数据集为这一领域提供了宝贵的资源。 Method: 研究团队对MDACE数据集进行了深入分析,并调查了最先进的方法与真实数据的重叠程度。 Result: 研究发现,真实证据与代码描述在一定程度上一致,最先进方法与真实数据有高度重叠,研究还提出了匹配度量并强调了成功与失败案例。 Conclusion: 本研究深入分析了MDACE数据集,并从应用角度对当前可解释医疗编码系统进行了合理性评估,揭示了真实证据与代码描述在一定程度上的一致性,并提出了匹配度量标准和开发及评估可解释医疗编码系统的建议。 Abstract: Automatic medical coding has the potential to ease documentation and billing processes. For this task, transparency plays an important role for medical coders and regulatory bodies, which can be achieved using explainability methods. However, the evaluation of these approaches has been mostly limited to short text and binary settings due to a scarcity of annotated data. Recent efforts by Cheng et al. (2023) have introduced the MDACE dataset, which provides a valuable resource containing code evidence in clinical records. In this work, we conduct an in-depth analysis of the MDACE dataset and perform plausibility evaluation of current explainable medical coding systems from an applied perspective. With this, we contribute to a deeper understanding of automatic medical coding and evidence extraction. Our findings reveal that ground truth evidence aligns with code descriptions to a certain degree. An investigation into state-of-the-art approaches shows a high overlap with ground truth evidence. We propose match measures and highlight success and failure cases. Based on our findings, we provide recommendations for developing and evaluating explainable medical coding systems.

[31] Evaluating Structured Output Robustness of Small Language Models for Open Attribute-Value Extraction from Clinical Notes

Nikita Neveditsin,Pawan Lingras,Vijay Mago

Main category: cs.CL

TL;DR: This paper compares JSON, YAML, and XML for structured output parsing in clinical NLP tasks, finding JSON as the most effective format.

Details Motivation: To determine which serialization format offers the best performance for structured output generation from small language models in privacy-sensitive clinical environments. Method: A comparative analysis was conducted on three serialization formats (JSON, YAML, XML) to assess their parseability when used for structured output generation by small language models in clinical note processing. Result: JSON consistently showed the highest parseability; structural robustness improved with targeted prompting and larger models but declined with longer documents and specific note types. Conclusion: The study concludes that JSON is the most parseable serialization format for structured outputs from small language models used in clinical attribute-value extraction, with practical guidance offered for format selection and prompt design. Abstract: We present a comparative analysis of the parseability of structured outputs generated by small language models for open attribute-value extraction from clinical notes. We evaluate three widely used serialization formats: JSON, YAML, and XML, and find that JSON consistently yields the highest parseability. Structural robustness improves with targeted prompting and larger models, but declines for longer documents and certain note types. Our error analysis identifies recurring format-specific failure patterns. These findings offer practical guidance for selecting serialization formats and designing prompts when deploying language models in privacy-sensitive clinical settings.

[32] Low-Perplexity LLM-Generated Sequences and Where To Find Them

Arthur Wuhrmann,Anastasiia Kucherenko,Andrei Kucharavy

Main category: cs.CL

TL;DR: This paper investigates how Large Language Models use their training data by analyzing low-perplexity text sequences, revealing that many outputs cannot be traced back to the training set.

Details Motivation: Understanding how training data shapes LLM outputs is crucial for improving transparency, accountability, privacy, and fairness in these models. Method: The paper introduces a systematic approach involving the analysis of low-perplexity sequences in LLM outputs, extracting these sequences without degeneration and tracing them back to their training data sources. Result: A substantial portion of low-perplexity text spans could not be linked to the training corpus, and for the ones that could, the study quantified their source distribution, shedding light on verbatim recall mechanisms. Conclusion: The study concludes that a significant portion of low-perplexity text spans generated by LLMs cannot be traced back to the training data, and for those that can, it provides insights into how training data influences model behavior. Abstract: As Large Language Models (LLMs) become increasingly widespread, understanding how specific training data shapes their outputs is crucial for transparency, accountability, privacy, and fairness. To explore how LLMs leverage and replicate their training data, we introduce a systematic approach centered on analyzing low-perplexity sequences - high-probability text spans generated by the model. Our pipeline reliably extracts such long sequences across diverse topics while avoiding degeneration, then traces them back to their sources in the training data. Surprisingly, we find that a substantial portion of these low-perplexity spans cannot be mapped to the corpus. For those that do match, we quantify the distribution of occurrences across source documents, highlighting the scope and nature of verbatim recall and paving a way toward better understanding of how LLMs training data impacts their behavior.

[33] Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages

Samridhi Raj Sinha,Rajvee Sheth,Abhishek Upperwal,Mayank Singh

Main category: cs.CL

TL;DR: EKA-EVAL is a comprehensive, multilingual evaluation framework designed specifically for global and Indic Large Language Models (LLMs), offering extensive benchmark coverage and advanced computing capabilities.

Details Motivation: The rapid advancement of Large Language Models (LLMs) has created a need for evaluation frameworks that go beyond English-centric benchmarks and address the requirements of linguistically diverse regions such as India. Method: The paper introduces EKA-EVAL, a unified and production-ready evaluation framework integrating over 35 benchmarks, including 10 Indic-specific datasets across various categories like reasoning, mathematics, tool use, long-context understanding, and reading comprehension. Result: EKA-EVAL provides broader benchmark coverage compared to existing Indian language evaluation tools, with built-in support for distributed inference, quantization, and multi-GPU usage. The framework is open-source and publicly available. Conclusion: EKA-EVAL is positioned as the first end-to-end, extensible evaluation framework for both global and Indic LLMs, offering broad benchmark coverage and built-in support for advanced computing features. It aims to establish a robust multilingual evaluation ecosystem as part of the ongoing EKA initiative. Abstract: The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that go beyond English centric benchmarks and address the requirements of linguistically diverse regions such as India. We present EKA-EVAL, a unified and production-ready evaluation framework that integrates over 35 benchmarks, including 10 Indic-specific datasets, spanning categories like reasoning, mathematics, tool use, long-context understanding, and reading comprehension. Compared to existing Indian language evaluation tools, EKA-EVAL offers broader benchmark coverage, with built-in support for distributed inference, quantization, and multi-GPU usage. Our systematic comparison positions EKA-EVAL as the first end-to-end, extensible evaluation suite tailored for both global and Indic LLMs, significantly lowering the barrier to multilingual benchmarking. The framework is open-source and publicly available at https://github.com/lingo-iitgn/ eka-eval and a part of ongoing EKA initiative (https://eka.soket.ai), which aims to scale up to over 100 benchmarks and establish a robust, multilingual evaluation ecosystem for LLMs.

[34] DIY-MKG: An LLM-Based Polyglot Language Learning System

Kenan Tang,Yanhong Li,Yao Qin

Main category: cs.CL

TL;DR: DIY-MKG是一个基于LLM的个性化多语言学习系统,能够有效提升词汇学习效果并减少认知负担。

Details Motivation: 现有语言学习工具缺乏对多语种学习者的支持,且无法满足个性化需求,同时存在认知负荷问题。 Method: 设计了一个名为DIY-MKG的系统,利用LLM进行选择性扩展,构建个性化词汇知识图谱,并进行动态测验生成。 Result: 评估显示DIY-MKG在多种语言中可靠且公平,生成的测验准确率高,增强了用户参与度与反馈机制。 Conclusion: DIY-MKG是一个支持多语言学习的开源系统,通过LLM生成个性化词汇知识图谱,并提供丰富的注释和自适应复习模块。 Abstract: Existing language learning tools, even those powered by Large Language Models (LLMs), often lack support for polyglot learners to build linguistic connections across vocabularies in multiple languages, provide limited customization for individual learning paces or needs, and suffer from detrimental cognitive offloading. To address these limitations, we design Do-It-Yourself Multilingual Knowledge Graph (DIY-MKG), an open-source system that supports polyglot language learning. DIY-MKG allows the user to build personalized vocabulary knowledge graphs, which are constructed by selective expansion with related words suggested by an LLM. The system further enhances learning through rich annotation capabilities and an adaptive review module that leverages LLMs for dynamic, personalized quiz generation. In addition, DIY-MKG allows users to flag incorrect quiz questions, simultaneously increasing user engagement and providing a feedback loop for prompt refinement. Our evaluation of LLM-based components in DIY-MKG shows that vocabulary expansion is reliable and fair across multiple languages, and that the generated quizzes are highly accurate, validating the robustness of DIY-MKG.

[35] MiCoTA: Bridging the Learnability Gap with Intermediate CoT and Teacher Assistants

Dongyi Ding,Tiannan Wang,Chenghao Zhu,Meiling Tao,Yuchen Eleanor Jiang,Wangchunshu Zhou

Main category: cs.CL

TL;DR: This paper introduces MiCoTA, a framework that enhances the reasoning capabilities of small language models by leveraging intermediate-sized teachers and intermediate-length reasoning sequences, significantly improving performance on complex reasoning tasks.

Details Motivation: The motivation stems from the observation that while large language models (LLMs) excel at long reasoning tasks, their computational demands limit deployment. Meanwhile, small language models (SLMs) struggle with long-form reasoning due to limited capacity, a problem termed the 'SLMs Learnability Gap.' Method: The authors introduce the MiCoTA framework, which uses intermediate-sized language models as teacher assistants and leverages intermediate-length Chain-of-Thought (CoT) sequences to distill knowledge into SLMs. They evaluate the framework across multiple reasoning benchmarks and conduct a quantitative analysis to understand how the method aligns with SLM data distributions. Result: Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct models distilled using MiCoTA achieved average improvements of 3.47 and 3.93 respectively on key reasoning benchmarks including AIME2024, AMC, Olympiad, MATH-500, and GSM8K. Conclusion: The study concludes that the proposed MiCoTA framework significantly improves the reasoning performance of small language models (SLMs) by bridging the capacity and reasoning length gaps through intermediate-sized teacher assistants and intermediate-length CoT sequences. Abstract: Large language models (LLMs) excel at reasoning tasks requiring long thought sequences for planning, reflection, and refinement. However, their substantial model size and high computational demands are impractical for widespread deployment. Yet, small language models (SLMs) often struggle to learn long-form CoT reasoning due to their limited capacity, a phenomenon we refer to as the "SLMs Learnability Gap". To address this, we introduce \textbf{Mi}d-\textbf{Co}T \textbf{T}eacher \textbf{A}ssistant Distillation (MiCoTAl), a framework for improving long CoT distillation for SLMs. MiCoTA employs intermediate-sized models as teacher assistants and utilizes intermediate-length CoT sequences to bridge both the capacity and reasoning length gaps. Our experiments on downstream tasks demonstrate that although SLMs distilled from large teachers can perform poorly, by applying MiCoTA, they achieve significant improvements in reasoning performance. Specifically, Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct achieve an improvement of 3.47 and 3.93 respectively on average score on AIME2024, AMC, Olympiad, MATH-500 and GSM8K benchmarks. To better understand the mechanism behind MiCoTA, we perform a quantitative experiment demonstrating that our method produces data more closely aligned with base SLM distributions. Our insights pave the way for future research into long-CoT data distillation for SLMs.

[36] High-Layer Attention Pruning with Rescaling

Songtao Liu,Peng Liu

Main category: cs.CL

TL;DR: This paper introduces a new pruning method for large language models that selectively prunes attention heads in higher layers and adjusts representation scale, resulting in improved performance over existing methods.

Details Motivation: Conventional structured pruning methods indiscriminately remove attention heads without considering their positions, which can negatively impact model performance. Method: A novel pruning algorithm that strategically prunes attention heads in higher layers and uses an adaptive rescaling parameter to maintain representation scale. Result: Experiments on various LLMs and 27 datasets show consistent outperformance over existing structured pruning methods, especially in generation tasks. Conclusion: The proposed pruning algorithm improves the performance of structured pruning methods, particularly in generation tasks. Abstract: Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model's higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks across 27 datasets. The results consistently demonstrate that our method outperforms existing structured pruning methods. This improvement is particularly notable in generation tasks, where our approach significantly outperforms existing baselines.

[37] AI4Research: A Survey of Artificial Intelligence for Scientific Research

Qiguang Chen,Mingda Yang,Libo Qin,Jinhao Liu,Zheng Yan,Jiannan Guan,Dengyun Peng,Yiyan Ji,Hanjing Li,Mengkang Hu,Yimeng Zhang,Yihao Liang,Yuhang Zhou,Jiaqi Wang,Zhi Chen,Wanxiang Che

Main category: cs.CL

TL;DR: This paper provides a comprehensive survey of AI's role in scientific research (AI4Research), offering a systematic classification of tasks, highlighting future research directions, and compiling valuable resources to advance the field.

Details Motivation: Motivated by recent advancements in AI, particularly large language models, which have shown potential in scientific innovation processes, the paper aims to address the lack of a comprehensive survey in the field of AI4Research. Method: The authors present a comprehensive survey on AI4Research, introducing a systematic taxonomy of five mainstream tasks, identifying key research gaps and future directions, and compiling multidisciplinary applications, data, and tools. Result: The main contributions include a systematic taxonomy of AI4Research tasks, identification of new frontiers like scalability and societal impact, and the compilation of extensive resources for researchers. Conclusion: The paper concludes that a comprehensive survey on AI for Research (AI4Research) is necessary to enhance understanding and further development in the field, providing systematic taxonomy, identifying research gaps, and compiling resources for future exploration. Abstract: Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs) such as OpenAI-o1 and DeepSeek-R1, have demonstrated remarkable capabilities in complex domains such as logical reasoning and experimental coding. Motivated by these advancements, numerous studies have explored the application of AI in the innovation process, particularly in the context of scientific research. These AI technologies primarily aim to develop systems that can autonomously conduct research processes across a wide range of scientific disciplines. Despite these significant strides, a comprehensive survey on AI for Research (AI4Research) remains absent, which hampers our understanding and impedes further development in this field. To address this gap, we present a comprehensive survey and offer a unified perspective on AI4Research. Specifically, the main contributions of our work are as follows: (1) Systematic taxonomy: We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research. (2) New frontiers: Then, we identify key research gaps and highlight promising future directions, focusing on the rigor and scalability of automated experiments, as well as the societal impact. (3) Abundant applications and resources: Finally, we compile a wealth of resources, including relevant multidisciplinary applications, data corpora, and tools. We hope our work will provide the research community with quick access to these resources and stimulate innovative breakthroughs in AI4Research.

[38] Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models

Chengao Li,Hanyu Zhang,Yunkun Xu,Hongyan Xue,Xiang Ao,Qing He

Main category: cs.CL

TL;DR: This paper proposes GAPO and P-GAPO, two novel fine-tuning paradigms for aligning large language models with diverse human preferences by framing it as a multi-objective optimization problem.

Details Motivation: Effectively aligning LLMs with diverse human preferences remains a significant challenge, particularly when they are conflict. Method: Gradient-Adaptive Policy Optimization (GAPO) and P-GAPO were introduced as novel fine-tuning paradigms that employ multiple-gradient descent to align LLMs with diverse preference distributions. Result: Empirical results on Mistral-7B show that GAPO outperforms current state-of-the-art methods, achieving superior performance in both helpfulness and harmlessness. Conclusion: GAPO converges towards a Pareto optimal solution for multiple objectives and P-GAPO achieves Pareto solutions that better align with the user's specific needs. Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful technique for aligning large language models (LLMs) with human preferences. However, effectively aligning LLMs with diverse human preferences remains a significant challenge, particularly when they are conflict. To address this issue, we frame human value alignment as a multi-objective optimization problem, aiming to maximize a set of potentially conflicting objectives. We introduce Gradient-Adaptive Policy Optimization (GAPO), a novel fine-tuning paradigm that employs multiple-gradient descent to align LLMs with diverse preference distributions. GAPO adaptively rescales the gradients for each objective to determine an update direction that optimally balances the trade-offs between objectives. Additionally, we introduce P-GAPO, which incorporates user preferences across different objectives and achieves Pareto solutions that better align with the user's specific needs. Our theoretical analysis demonstrates that GAPO converges towards a Pareto optimal solution for multiple objectives. Empirical results on Mistral-7B show that GAPO outperforms current state-of-the-art methods, achieving superior performance in both helpfulness and harmlessness.

[39] NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks

Yang Li,Youssef Emad,Karthik Padthe,Jack Lanchantin,Weizhe Yuan,Thao Nguyen,Jason Weston,Shang-Wen Li,Dong Wang,Ilia Kulikov,Xian Li

Main category: cs.CL

TL;DR: 本文提出了一种名为NaturalThoughts的方法,通过筛选教师模型的高质量推理轨迹来有效提升学生模型的推理能力,并在多个基准测试中表现出色。

Details Motivation: 为了找到最有效的推理演示方式以提升学生模型的推理能力,因为之前没有对此进行系统研究。 Method: 从强大的教师模型中筛选出高质量的“NaturalThoughts”推理轨迹,并系统分析影响蒸馏推理能力的因素。 Result: 使用NaturalThoughts训练的学生模型在多个通用STEM推理基准上(如GPQA-Diamond, MMLU-Pro和SuperGPQA)表现优于现有数据集(如OpenThoughts和LIMO)。 Conclusion: NaturalThoughts训练方法在提升学生模型推理能力方面优于现有方法,并且通过选择多样化的困难样例,可以更高效地传递教师模型的推理技能。 Abstract: Recent work has shown that distilling reasoning traces from a larger teacher model via supervised finetuning outperforms reinforcement learning with the smaller student model alone (Guo et al. 2025). However, there has not been a systematic study of what kind of reasoning demonstrations from the teacher are most effective in improving the student model's reasoning capabilities. In this work we curate high-quality "NaturalThoughts" by selecting reasoning traces from a strong teacher model based on a large pool of questions from NaturalReasoning (Yuan et al. 2025). We first conduct a systematic analysis of factors that affect distilling reasoning capabilities, in terms of sample efficiency and scalability for general reasoning tasks. We observe that simply scaling up data size with random sampling is a strong baseline with steady performance gains. Further, we find that selecting difficult examples that require more diverse reasoning strategies is more sample-efficient to transfer the teacher model's reasoning skills. Evaluated on both Llama and Qwen models, training with NaturalThoughts outperforms existing reasoning datasets such as OpenThoughts, LIMO, etc. on general STEM reasoning benchmarks including GPQA-Diamond, MMLU-Pro and SuperGPQA.

[40] Decision-oriented Text Evaluation

Yu-Shiang Huang,Chuan-Ju Wang,Chung-Chi Chen

Main category: cs.CL

TL;DR: 本文提出一种新的自然语言生成评估框架,强调其对人类与LLM协同决策的影响,而不仅仅是使用传统内在指标。

Details Motivation: 现有的内在评估方法(如n-gram重叠或句子合理性)与实际决策效能的相关性较弱,需要更有效的评估方式。 Method: 提出了一种面向决策的生成文本评估框架,并通过分析人类投资者和自主LLM代理在仅依赖不同文本时的交易表现来验证该方法。 Result: 实验表明,当仅依赖摘要时,无论是人类还是LLM代理都无法持续超越随机表现;然而,丰富的分析评论可以显著提升人机协作团队的表现。 Conclusion: 评估生成文本应基于其促进人类与LLM协同决策的能力,传统的内在指标存在重大局限性。 Abstract: Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts--including objective morning summaries and subjective closing-bell analyses--as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision-making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.

[41] Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla

Md Sazzadul Islam Ridoy,Sumi Akter,Md. Aminur Rahman

Main category: cs.CL

TL;DR: 该研究比较了Whisper和Wav2Vec-BERT模型在低资源语言孟加拉语上的表现,发现Wav2Vec-BERT在准确性和计算效率方面更优。

Details Motivation: 近年来,基于大规模多语言文本和语音数据集训练的神经模型在支持低资源语言方面显示出巨大潜力。本研究旨在比较两种最先进的自动语音识别(ASR)模型在低资源语言(孟加拉语)上的表现,以探索更高效的解决方案。 Method: 通过系统性的微调和超参数优化,包括学习率、训练轮数和模型检查点选择,并使用词错误率(WER)、字符错误率(CER)、训练时间和计算效率等指标对模型进行评估。 Result: 实验表明,Wav2Vec-BERT模型在所有主要评估指标上均优于Whisper模型,同时需要的计算资源更少,提供了在低资源语言环境下开发鲁棒语音识别系统的有价值见解。 Conclusion: Wav2Vec-BERT模型在所有关键评估指标上都优于Whisper,展示了其在低资源语言环境下的卓越性能和更高的计算效率。 Abstract: In recent years, neural models trained on large multilingual text and speech datasets have shown great potential for supporting low-resource languages. This study investigates the performances of two state-of-the-art Automatic Speech Recognition (ASR) models, OpenAI's Whisper (Small & Large-V2) and Facebook's Wav2Vec-BERT on Bangla, a low-resource language. We have conducted experiments using two publicly available datasets: Mozilla Common Voice-17 and OpenSLR to evaluate model performances. Through systematic fine-tuning and hyperparameter optimization, including learning rate, epochs, and model checkpoint selection, we have compared the models based on Word Error Rate (WER), Character Error Rate (CER), Training Time, and Computational Efficiency. The Wav2Vec-BERT model outperformed Whisper across all key evaluation metrics, demonstrated superior performance while requiring fewer computational resources, and offered valuable insights to develop robust speech recognition systems in low-resource linguistic settings.

[42] The Thin Line Between Comprehension and Persuasion in LLMs

Adrian de Wynter,Tangming Yuan

Main category: cs.CL

TL;DR: 大型语言模型可以进行有说服力的辩论,但缺乏深层对话理解,并且当人类意识到AI的参与时会变得更加批判。

Details Motivation: 鉴于大型语言模型(LLMs)在敏感领域的快速部署以及对其推理能力的不同看法,有必要深入研究LLMs及其对对话的理解能力。 Method: 首先评估LLMs维护辩论的能力,然后测量这种能力与对话结构及语用背景理解的关系。 Result: LLMs能够维持连贯且有说服力的辩论,经常改变参与者和观众的信念;然而,在更深层次的对话结构理解方面,LLMs无法展示出这种理解能力。当涉及AI参与时,人们会更加批判性地对待提出的论点。 Conclusion: LLMs-as-evaluators的不足之处在于它们对语境的理解能力,如果一个代理能够令人信服地维持对话,则不一定需要知道它在谈论什么。因此,对于论证理论领域而言,实际效果优先于对语用背景和连贯性的建模。 Abstract: Large language models (LLMs) are excellent at maintaining high-level, convincing dialogues. They are being fast deployed as chatbots and evaluators in sensitive areas, such as peer review and mental health applications. This, along with the disparate accounts on their reasoning capabilities, calls for a closer examination of LLMs and their comprehension of dialogue. In this work we begin by evaluating LLMs' ability to maintain a debate--one of the purest yet most complex forms of human communication. Then we measure how this capability relates to their understanding of what is being talked about, namely, their comprehension of dialogical structures and the pragmatic context. We find that LLMs are capable of maintaining coherent, persuasive debates, often swaying the beliefs of participants and audiences alike. We also note that awareness or suspicion of AI involvement encourage people to be more critical of the arguments made. When polling LLMs on their comprehension of deeper structures of dialogue, however, they cannot demonstrate said understanding. Our findings tie the shortcomings of LLMs-as-evaluators to their (in)ability to understand the context. More broadly, for the field of argumentation theory we posit that, if an agent can convincingly maintain a dialogue, it is not necessary for it to know what it is talking about. Hence, the modelling of pragmatic context and coherence are secondary to effectiveness.

cs.CV [Back]

[43] Geometry-aware 4D Video Generation for Robot Manipulation

Zeyi Liu,Shuang Li,Eric Cousineau,Siyuan Feng,Benjamin Burchfiel,Shuran Song

Main category: cs.CV

TL;DR: 本文提出了一种新的4D视频生成模型,能够基于RGB-D观测数据预测未来视频序列,并在机器人操作和新视角泛化方面表现出色。

Details Motivation: 建模动态场景的视频生成模型在机器人复杂环境中的规划和交互能力方面具有潜力,但如何生成既时间连贯又几何一致的视频仍然是一个挑战。 Method: 提出了一种4D视频生成模型,并在训练过程中使用跨视角点图对齐进行几何监督,以学习场景的共享3D表示。 Result: 与现有基线相比,该方法在多个模拟和真实机器人数据集中产生了更稳定且空间对齐的预测结果,并能通过现成的6DoF姿态追踪器恢复机器人末端轨迹。 Conclusion: 该论文提出了一种4D视频生成模型,通过跨视角点图对齐监督模型训练,实现了多视角3D一致性,从而提高了机器人操作和新视角泛化的鲁棒性。 Abstract: Understanding and predicting the dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of videos by supervising the model with cross-view pointmap alignment during training. This geometric supervision enables the model to learn a shared 3D representation of the scene, allowing it to predict future video sequences from novel viewpoints based solely on the given RGB-D observations, without requiring camera poses as inputs. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, supporting robust robot manipulation and generalization to novel camera viewpoints.

[44] Landslide Detection and Mapping Using Deep Learning Across Multi-Source Satellite Data and Geographic Regions

Rahul A. Burange,Harsh K. Shinde,Omkar Mutyalwar

Main category: cs.CV

TL;DR: 本文提出了一种集成多源卫星图像和深度学习模型的方法,以提高滑坡识别和预测的准确性,为灾害风险管理和土地利用规划提供支持。

Details Motivation: 滑坡对基础设施、经济和人类生命构成严重威胁,需要跨地理区域的准确检测和预测,而深度学习和遥感技术的进步使自动化滑坡检测更加高效。 Method: 结合Sentinel-2多光谱数据和ALOS PALSAR派生的坡度和数字高程模型(DEM)层,并采用U-Net、DeepLabV3+和Res-Net等深度学习分割模型进行滑坡识别和预测。 Result: 评估了地形特征、植被覆盖和降雨对检测精度的影响,并分析了多种深度学习模型在滑坡检测中的有效性。 Conclusion: 该研究展示了深度学习与多源遥感在建立可靠滑坡预测模型中的潜力,为早期预警系统和灾害风险管理提供了支持。 Abstract: Landslides pose severe threats to infrastructure, economies, and human lives, necessitating accurate detection and predictive mapping across diverse geographic regions. With advancements in deep learning and remote sensing, automated landslide detection has become increasingly effective. This study presents a comprehensive approach integrating multi-source satellite imagery and deep learning models to enhance landslide identification and prediction. We leverage Sentinel-2 multispectral data and ALOS PALSAR-derived slope and Digital Elevation Model (DEM) layers to capture critical environmental features influencing landslide occurrences. Various geospatial analysis techniques are employed to assess the impact of terra in characteristics, vegetation cover, and rainfall on detection accuracy. Additionally, we evaluate the performance of multiple stateof-the-art deep learning segmentation models, including U-Net, DeepLabV3+, and Res-Net, to determine their effectiveness in landslide detection. The proposed framework contributes to the development of reliable early warning systems, improved disaster risk management, and sustainable land-use planning. Our findings provide valuable insights into the potential of deep learning and multi-source remote sensing in creating robust, scalable, and transferable landslide prediction models.

[45] cp_measure: API-first feature extraction for image-based profiling workflows

Alán F. Muñoz,Tim Treis,Alexandr A. Kalinin,Shatavisha Dasgupta,Fabian Theis,Anne E. Carpenter,Shantanu Singh

Main category: cs.CV

TL;DR: 本文介绍了一个新的 Python 工具 cp_measure,用于改进生物图像分析的自动化和可重复性,特别适用于机器学习任务。

Details Motivation: 当前工具如 CellProfiler 在生成特征集时对自动化和可重复分析存在显著障碍,从而限制了机器学习工作流程的应用。 Method: 开发了一个名为 cp_measure 的 Python 库,该库提取 CellProfiler 的核心测量功能,并通过 API-first 设计实现程序化的特征提取。 Result: cp_measure 能够高保真地复现 CellProfiler 的特征,并成功应用于 3D 星形胶质细胞成像和空间转录组学中,实现了可扩展的图像分析管道。 Conclusion: cp_measure 是一个能够与科学Python生态系统无缝集成的模块化工具,它使得基于图像的分析更加自动化、可重复,并适用于计算生物学中的机器学习应用。 Abstract: Biological image analysis has traditionally focused on measuring specific visual properties of interest for cells or other entities. A complementary paradigm gaining increasing traction is image-based profiling - quantifying many distinct visual features to form comprehensive profiles which may reveal hidden patterns in cellular states, drug responses, and disease mechanisms. While current tools like CellProfiler can generate these feature sets, they pose significant barriers to automated and reproducible analyses, hindering machine learning workflows. Here we introduce cp_measure, a Python library that extracts CellProfiler's core measurement capabilities into a modular, API-first tool designed for programmatic feature extraction. We demonstrate that cp_measure features retain high fidelity with CellProfiler features while enabling seamless integration with the scientific Python ecosystem. Through applications to 3D astrocyte imaging and spatial transcriptomics, we showcase how cp_measure enables reproducible, automated image-based profiling pipelines that scale effectively for machine learning applications in computational biology.

[46] Rapid Salient Object Detection with Difference Convolutional Neural Networks

Zhuo Su,Li Liu,Matthias Müller,Jiehua Zhang,Diana Wofk,Ming-Ming Cheng,Matti Pietikäinen

Main category: cs.CV

TL;DR: This paper introduces SDNet and STDNet for efficient salient object detection in images and videos, combining traditional contrast-based methods with modern CNNs to achieve high speed and accuracy on resource-constrained devices.

Details Motivation: The motivation is to address the challenge of deploying salient object detection (SOD) on resource-constrained devices while maintaining real-time performance, as existing top-leading models are computationally expensive. Method: The authors proposed an efficient network design using Pixel Difference Convolutions (PDCs) with a difference convolution reparameterization (DCR) strategy for image SOD (SDNet) and SpatioTemporal Difference Convolution (STDC) for video SOD (STDNet). Result: On a Jetson Orin device, the proposed models achieved 46 FPS for images and 150 FPS for videos, surpassing the second-best lightweight models by more than 2× and 3× in speed while maintaining superior accuracy. Conclusion: The paper concludes that SDNet and STDNet offer significant improvements in efficiency-accuracy trade-offs for image and video SOD, operating at high speeds on resource-constrained devices. Abstract: This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance. While recent advances in deep neural networks have improved SOD, existing top-leading models are computationally expensive. We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs. Like biologically-inspired classical SOD methods relying on computing contrast cues to determine saliency of image regions, our model leverages Pixel Difference Convolutions (PDCs) to encode the feature contrasts. Differently, PDCs are incorporated in a CNN architecture so that the valuable contrast cues are extracted from rich feature maps. For efficiency, we introduce a difference convolution reparameterization (DCR) strategy that embeds PDCs into standard convolutions, eliminating computation and parameters at inference. Additionally, we introduce SpatioTemporal Difference Convolution (STDC) for video SOD, enhancing the standard 3D convolution with spatiotemporal contrast capture. Our models, SDNet for image SOD and STDNet for video SOD, achieve significant improvements in efficiency-accuracy trade-offs. On a Jetson Orin device, our models with $<$ 1M parameters operate at 46 FPS and 150 FPS on streamed images and videos, surpassing the second-best lightweight models in our experiments by more than $2\times$ and $3\times$ in speed with superior accuracy. Code will be available at https://github.com/hellozhuo/stdnet.git.

[47] Robust Brain Tumor Segmentation with Incomplete MRI Modalities Using Hölder Divergence and Mutual Information-Enhanced Knowledge Transfer

Runze Cheng,Xihang Qiu,Ming Li,Ye Zhang,Chun Li,Fei Yu

Main category: cs.CV

TL;DR: 本研究提出了一种针对多模态MRI脑肿瘤分割中模态缺失问题的新框架,即使仅使用单一模态输入也能实现高精度分割。

Details Motivation: 传统方法在某些模态因图像质量、协议不一致、患者过敏或财务限制而缺失时表现不佳,因此需要一种更鲁棒的方法。 Method: 提出了一种鲁棒的单模态并行处理框架,结合了Hölder散度和互信息进行模态特异性特征保持与网络参数动态调整。 Result: 在BraTS 2018和BraTS 2020数据集上进行了广泛评估,结果表明该框架在处理缺失模态时具有优越的性能。 Conclusion: 该框架在处理缺失模态时表现出优于现有方法的性能,通过利用Hölder散度和互信息实现了分割准确性的一致性提升。 Abstract: Multimodal MRI provides critical complementary information for accurate brain tumor segmentation. However, conventional methods struggle when certain modalities are missing due to issues such as image quality, protocol inconsistencies, patient allergies, or financial constraints. To address this, we propose a robust single-modality parallel processing framework that achieves high segmentation accuracy even with incomplete modalities. Leveraging Holder divergence and mutual information, our model maintains modality-specific features while dynamically adjusting network parameters based on the available inputs. By using these divergence- and information-based loss functions, the framework effectively quantifies discrepancies between predictions and ground-truth labels, resulting in consistently accurate segmentation. Extensive evaluations on the BraTS 2018 and BraTS 2020 datasets demonstrate superior performance over existing methods in handling missing modalities.

[48] AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation

Xiao Liu,Jiawei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的AI生成视频评估模型AIGVE-MACS,该模型能够提供数值评分和多方面的语言反馈,提高了评估的可解释性和与人类评估的一致性。

Details Motivation: 现有的AI生成视频评估指标只能产生数值分数而没有解释性评论,导致可解释性和与人类评估的一致性较低。 Method: 利用包含2,500个AI生成视频和22,500个人类注释评论及评分的大型基准数据集AIGVE-BENCH 2,结合最新的视觉-语言模型、新颖的token-wise加权损失函数和动态帧采样策略来训练AIGVE-MACS。 Result: AIGVE-MACS在监督学习和零样本基准测试中均达到了最先进的性能,显著优于包括GPT-4o和VideoScore在内的先前基线模型,并且通过多代理精炼框架实现了53.5%的质量提升。 Conclusion: AIGVE-MACS不仅在评分相关性方面表现出色,还通过提供多方面的语言反馈,为AI生成视频的评估提供了新的范式。 Abstract: The rapid advancement of AI-generated video models has created a pressing need for robust and interpretable evaluation frameworks. Existing metrics are limited to producing numerical scores without explanatory comments, resulting in low interpretability and human evaluation alignment. To address those challenges, we introduce AIGVE-MACS, a unified model for AI-Generated Video Evaluation(AIGVE), which can provide not only numerical scores but also multi-aspect language comment feedback in evaluating these generated videos. Central to our approach is AIGVE-BENCH 2, a large-scale benchmark comprising 2,500 AI-generated videos and 22,500 human-annotated detailed comments and numerical scores across nine critical evaluation aspects. Leveraging AIGVE-BENCH 2, AIGVE-MACS incorporates recent Vision-Language Models with a novel token-wise weighted loss and a dynamic frame sampling strategy to better align with human evaluators. Comprehensive experiments across supervised and zero-shot benchmarks demonstrate that AIGVE-MACS achieves state-of-the-art performance in both scoring correlation and comment quality, significantly outperforming prior baselines including GPT-4o and VideoScore. In addition, we further showcase a multi-agent refinement framework where feedback from AIGVE-MACS drives iterative improvements in video generation, leading to 53.5% quality enhancement. This work establishes a new paradigm for comprehensive, human-aligned evaluation of AI-generated videos. We release the AIGVE-BENCH 2 and AIGVE-MACS at https://huggingface.co/xiaoliux/AIGVE-MACS.

[49] Advancements in Weed Mapping: A Systematic Review

Mohammad Jahanbakht,Alex Olsen,Ross Marchant,Emilie Fillols,Mostafa Rahimi Azghadi

Main category: cs.CV

TL;DR: This paper reviews recent advances in weed mapping technologies and provides a comprehensive analysis of the entire mapping pipeline to guide future research and improve weed management systems.

Details Motivation: Recent advances in weed mapping have not been comprehensively reviewed, particularly with a structured analysis spanning the entire mapping pipeline from data acquisition to processing techniques and mapping tools. Method: Following PRISMA guidelines, the authors systematically examined state-of-the-art methods in data acquisition, data processing, and mapping techniques. Result: A holistic understanding of the weed mapping landscape was provided by critically evaluating and synthesizing key findings from the literature. Conclusion: This review serves as a foundational reference to guide future research and support the development of efficient, scalable, and sustainable weed management systems. Abstract: Weed mapping plays a critical role in precision management by providing accurate and timely data on weed distribution, enabling targeted control and reduced herbicide use. This minimizes environmental impacts, supports sustainable land management, and improves outcomes across agricultural and natural environments. Recent advances in weed mapping leverage ground-vehicle Red Green Blue (RGB) cameras, satellite and drone-based remote sensing combined with sensors such as spectral, Near Infra-Red (NIR), and thermal cameras. The resulting data are processed using advanced techniques including big data analytics and machine learning, significantly improving the spatial and temporal resolution of weed maps and enabling site-specific management decisions. Despite a growing body of research in this domain, there is a lack of comprehensive literature reviews specifically focused on weed mapping. In particular, the absence of a structured analysis spanning the entire mapping pipeline, from data acquisition to processing techniques and mapping tools, limits progress in the field. This review addresses these gaps by systematically examining state-of-the-art methods in data acquisition (sensor and platform technologies), data processing (including annotation and modelling), and mapping techniques (such as spatiotemporal analysis and decision support tools). Following PRISMA guidelines, we critically evaluate and synthesize key findings from the literature to provide a holistic understanding of the weed mapping landscape. This review serves as a foundational reference to guide future research and support the development of efficient, scalable, and sustainable weed management systems.

[50] Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing

Chengxu Liu,Lu Qi,Jinshan Pan,Xueming Qian,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于频域的扩散模型,用于未配对图像去雾任务,并在多个数据集上取得了优异的结果。

Details Motivation: 由于对比学习方法引入与去雾无关的内容信息,并忽略频域中的特定雾属性,因此提出了新的解决方案。 Method: 提出了一种基于频域的扩散模型,包括振幅残差编码器(ARE)和相位校正模块(PCM)。 Result: 成功解决了未配对图像去雾问题,并实现了更好的性能表现。 Conclusion: 实验结果表明,OURS在合成和真实世界数据集上均优于其他最先进的方法。 Abstract: Unpaired image dehazing has attracted increasing attention due to its flexible data requirements during model training. Dominant methods based on contrastive learning not only introduce haze-unrelated content information, but also ignore haze-specific properties in the frequency domain (\ie,~haze-related degradation is mainly manifested in the amplitude spectrum). To address these issues, we propose a novel frequency domain-based diffusion model, named \ours, for fully exploiting the beneficial knowledge in unpaired clear data. In particular, inspired by the strong generative ability shown by Diffusion Models (DMs), we tackle the dehazing task from the perspective of frequency domain reconstruction and perform the DMs to yield the amplitude spectrum consistent with the distribution of clear images. To implement it, we propose an Amplitude Residual Encoder (ARE) to extract the amplitude residuals, which effectively compensates for the amplitude gap from the hazy to clear domains, as well as provide supervision for the DMs training. In addition, we propose a Phase Correction Module (PCM) to eliminate artifacts by further refining the phase spectrum during dehazing with a simple attention mechanism. Experimental results demonstrate that our \ours outperforms other state-of-the-art methods on both synthetic and real-world datasets.

[51] Learning an Ensemble Token from Task-driven Priors in Facial Analysis

Sunyong Seo,Semin Kim,Jongha Lee

Main category: cs.CV

TL;DR: 本文提出了ET-Fuser,一种通过集成令牌和注意力机制改善面部分析任务特征表示的新方法。

Details Motivation: 尽管传统方法在提升视觉可解释性方面取得了进展,但在训练过程中缺乏保留单任务学习统一特征表示的研究。 Method: 提出了一种基于预训练模型中提取的任务先验信息,利用注意力机制生成集成令牌的新方法。 Result: 结果表明,该方法在多种面部分析任务中均有改进,并观察到特征表示有统计学意义上的显著增强。 Conclusion: ET-Fuser提供了一种高效的面部分析方法,通过利用基于任务先验的注意力机制来学习集成令牌。 Abstract: Facial analysis exhibits task-specific feature variations. While Convolutional Neural Networks (CNNs) have enabled the fine-grained representation of spatial information, Vision Transformers (ViTs) have facilitated the representation of semantic information at the patch level. Although the generalization of conventional methodologies has advanced visual interpretability, there remains paucity of research that preserves the unified feature representation on single task learning during the training process. In this work, we introduce ET-Fuser, a novel methodology for learning ensemble token by leveraging attention mechanisms based on task priors derived from pre-trained models for facial analysis. Specifically, we propose a robust prior unification learning method that generates a ensemble token within a self-attention mechanism, which shares the mutual information along the pre-trained encoders. This ensemble token approach offers high efficiency with negligible computational cost. Our results show improvements across a variety of facial analysis, with statistically significant enhancements observed in the feature representations.

[52] DiffusionLight-Turbo: Accelerated Light Probes for Free via Single-Pass Chrome Ball Inpainting

Worameth Chinchuthakun,Pakkapon Phongthawee,Amit Raj,Varun Jampani,Pramook Khungurn,Supasorn Suwajanakorn

Main category: cs.CV

TL;DR: This paper proposes DiffusionLight and DiffusionLight-Turbo for lighting estimation from LDR images, achieving high-quality results and fast inference times using diffusion models.

Details Motivation: Existing methods rely on limited HDR panorama datasets and suffer from generalization failures. The proposed method aims to overcome these limitations by leveraging a pre-trained diffusion model. Method: The technique reframes lighting estimation as a chrome ball inpainting problem using a pre-trained diffusion model (Stable Diffusion XL). DiffusionLight uses iterative inpainting to compute a stable lighting prior, while DiffusionLight-Turbo employs a Turbo LoRA for faster results. Result: Experimental results show convincing light estimates across diverse settings and superior generalization in real-world scenarios. DiffusionLight-Turbo achieves a 60x speedup over DiffusionLight with minimal quality loss. Conclusion: DiffusionLight and DiffusionLight-Turbo are effective techniques for estimating lighting from a single LDR image, with the latter offering significant time reduction while maintaining quality. Abstract: We introduce a simple yet effective technique for estimating lighting from a single low-dynamic-range (LDR) image by reframing the task as a chrome ball inpainting problem. This approach leverages a pre-trained diffusion model, Stable Diffusion XL, to overcome the generalization failures of existing methods that rely on limited HDR panorama datasets. While conceptually simple, the task remains challenging because diffusion models often insert incorrect or inconsistent content and cannot readily generate chrome balls in HDR format. Our analysis reveals that the inpainting process is highly sensitive to the initial noise in the diffusion process, occasionally resulting in unrealistic outputs. To address this, we first introduce DiffusionLight, which uses iterative inpainting to compute a median chrome ball from multiple outputs to serve as a stable, low-frequency lighting prior that guides the generation of a high-quality final result. To generate high-dynamic-range (HDR) light probes, an Exposure LoRA is fine-tuned to create LDR images at multiple exposure values, which are then merged. While effective, DiffusionLight is time-intensive, requiring approximately 30 minutes per estimation. To reduce this overhead, we introduce DiffusionLight-Turbo, which reduces the runtime to about 30 seconds with minimal quality loss. This 60x speedup is achieved by training a Turbo LoRA to directly predict the averaged chrome balls from the iterative process. Inference is further streamlined into a single denoising pass using a LoRA swapping technique. Experimental results that show our method produces convincing light estimates across diverse settings and demonstrates superior generalization to in-the-wild scenarios. Our code is available at https://diffusionlight.github.io/turbo

[53] Physics-informed Ground Reaction Dynamics from Human Motion Capture

Cuong Le,Huy-Phuong Le,Duc Le,Minh-Thien Duong,Van-Binh Nguyen,My-Ha Le

Main category: cs.CV

TL;DR: 该论文提出了一种基于运动捕捉数据和物理定律的人体地面反作用力估计新方法,利用欧拉积分方案和PD算法计算地面反作用力,并通过GroundLink数据集验证了其优于基线模型的估计准确性和模拟根轨迹精度。

Details Motivation: 由于测力台等专业设备只能安装在实验室环境中,这对人体动力学的学习造成了重大限制,因此需要一种更为可靠的方法来估计人体地面反作用动力学。 Method: 论文提出了一种高度准确且稳健的方法,利用运动捕捉数据、欧拉积分方案和PD算法计算地面反作用力,并将基于物理的反作用力用于改进学习模型对运动动力学的估计。 Result: 所提出的方法在GroundLink数据集上进行了测试,在地面反作用力估计准确性和模拟根轨迹精度方面均优于基线模型。 Conclusion: 论文成功开发并验证了一种新的基于物理定律和运动捕捉数据的人体地面反作用力估计方法,提高了估计准确性。 Abstract: Body dynamics are crucial information for the analysis of human motions in important research fields, ranging from biomechanics, sports science to computer vision and graphics. Modern approaches collect the body dynamics, external reactive force specifically, via force plates, synchronizing with human motion capture data, and learn to estimate the dynamics from a black-box deep learning model. Being specialized devices, force plates can only be installed in laboratory setups, imposing a significant limitation on the learning of human dynamics. To this end, we propose a novel method for estimating human ground reaction dynamics directly from the more reliable motion capture data with physics laws and computational simulation as constrains. We introduce a highly accurate and robust method for computing ground reaction forces from motion capture data using Euler's integration scheme and PD algorithm. The physics-based reactive forces are used to inform the learning model about the physics-informed motion dynamics thus improving the estimation accuracy. The proposed approach was tested on the GroundLink dataset, outperforming the baseline model on: 1) the ground reaction force estimation accuracy compared to the force plates measurement; and 2) our simulated root trajectory precision. The implementation code is available at https://github.com/cuongle1206/Phys-GRD

[54] Learning Camera-Agnostic White-Balance Preferences

Luxi Zhao,Mahmoud Afifi,Michael S. Brown

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的跨相机美学白平衡方法,能够在不同相机之间实现一致且风格化的色彩渲染。

Details Motivation: 商业自动白平衡系统通常追求计算美观的白平衡偏好而非准确的中性颜色校正,而现有基于学习的方法难以在不同相机传感器间推广,尤其是多摄像头智能手机场景。 Method: 通过学习一种将中性照度校正转换为美学偏好的校正的映射方法,在相机无关空间中进行训练,并应用于不同相机的白平衡模块之后。 Result: 提出的模型仅包含约500个参数,在旗舰移动CPU上运行仅需0.024毫秒,评估结果显示其在771张智能手机图像数据集中实现了最先进的性能。 Conclusion: 本文提出了一种新的后照度估计映射方法,用于实现跨相机的美学一致性白平衡,该方法轻量且兼容现有的交叉相机自动白平衡技术。 Abstract: The image signal processor (ISP) pipeline in modern cameras consists of several modules that transform raw sensor data into visually pleasing images in a display color space. Among these, the auto white balance (AWB) module is essential for compensating for scene illumination. However, commercial AWB systems often strive to compute aesthetic white-balance preferences rather than accurate neutral color correction. While learning-based methods have improved AWB accuracy, they typically struggle to generalize across different camera sensors -- an issue for smartphones with multiple cameras. Recent work has explored cross-camera AWB, but most methods remain focused on achieving neutral white balance. In contrast, this paper is the first to address aesthetic consistency by learning a post-illuminant-estimation mapping that transforms neutral illuminant corrections into aesthetically preferred corrections in a camera-agnostic space. Once trained, our mapping can be applied after any neutral AWB module to enable consistent and stylized color rendering across unseen cameras. Our proposed model is lightweight -- containing only $\sim$500 parameters -- and runs in just 0.024 milliseconds on a typical flagship mobile CPU. Evaluated on a dataset of 771 smartphone images from three different cameras, our method achieves state-of-the-art performance while remaining fully compatible with existing cross-camera AWB techniques, introducing minimal computational and memory overhead.

[55] Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation

Andrei Jelea,Ahmed Nabil Belbachir,Marius Leordeanu

Main category: cs.CV

TL;DR: Generalized Test-Time Augmentation (GTTA) enhances model performance across diverse tasks by forming robust ensembles through novel data transformations and a self-supervised learning stage, reducing computational cost without sacrificing accuracy.

Details Motivation: The motivation is to develop a versatile Test-Time Augmentation approach applicable to many tasks beyond what existing methods can offer, with enhanced performance and reduced computational costs. Method: GTTA employs a new general data transformation that randomly perturbs multiple times the PCA subspace projection of a test input to form robust ensembles. It also introduces a self-supervised learning stage where the ensemble output trains the initial model. Result: GTTA was validated on numerous datasets and tasks including image classification, segmentation, speech recognition, house price prediction, and specifically on salmon segmentation in underwater videos using the newly introduced DeepSalmon dataset. Conclusion: GTTA proves to be a highly effective and general method for improving the performance of trained models across various vision and non-vision tasks, offering reduced test time computational costs without compromising accuracy. Abstract: We introduce Generalized Test-Time Augmentation (GTTA), a highly effective method for improving the performance of a trained model, which unlike other existing Test-Time Augmentation approaches from the literature is general enough to be used off-the-shelf for many vision and non-vision tasks, such as classification, regression, image segmentation and object detection. By applying a new general data transformation, that randomly perturbs multiple times the PCA subspace projection of a test input, GTTA forms robust ensembles at test time in which, due to sound statistical properties, the structural and systematic noises in the initial input data is filtered out and final estimator errors are reduced. Different from other existing methods, we also propose a final self-supervised learning stage in which the ensemble output, acting as an unsupervised teacher, is used to train the initial single student model, thus reducing significantly the test time computational cost, at no loss in accuracy. Our tests and comparisons to strong TTA approaches and SoTA models on various vision and non-vision well-known datasets and tasks, such as image classification and segmentation, speech recognition and house price prediction, validate the generality of the proposed GTTA. Furthermore, we also prove its effectiveness on the more specific real-world task of salmon segmentation and detection in low-visibility underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature.

[56] Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model

Chaoxiang Cai,Longrong Yang,Kaibing Chen,Fan Yang,Xi Li

Main category: cs.CV

TL;DR: This paper proposes a novel Long-Tailed Distribution-aware Router (LTDR) for vision-language token-to-expert routing in mixture-of-experts frameworks, which takes into account the distributional differences between vision and language modalities and enhances expert activation for vision tail tokens, leading to improved performance as validated by experiments on extensive benchmarks.

Details Motivation: Existing MoE frameworks for LVLMs often rely on load balancing mechanisms, overlooking the inherent distributional differences between vision and language. This can lead to suboptimal performance due to the distinct nature of these modalities. Method: Long-Tailed Distribution-aware Router (LTDR) for vision-language token-to-expert routing (TER), with a focus on modality-specific routing strategies and an oversampling-like strategy for vision tail tokens. Result: Experiments on extensive benchmarks validate the effectiveness of the proposed approach in addressing the challenges of vision-language TER. Conclusion: The proposed LTDR method effectively addresses the challenges in vision-language TER by considering the distributional differences between modalities and enhancing expert activation for vision tail tokens. Abstract: The mixture-of-experts (MoE), which replaces dense models with sparse architectures, has gained attention in large vision-language models (LVLMs) for achieving comparable performance with fewer activated parameters. Existing MoE frameworks for LVLMs focus on token-to-expert routing (TER), encouraging different experts to specialize in processing distinct tokens. However, these frameworks often rely on the load balancing mechanism, overlooking the inherent distributional differences between vision and language. To this end, we propose a Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, tackling two challenges: (1) Distribution-aware router for modality-specific routing. We observe that language TER follows a uniform distribution, whereas vision TER exhibits a long-tailed distribution. This discrepancy necessitates distinct routing strategies tailored to each modality. (2) Enhancing expert activation for vision tail tokens. Recognizing the importance of vision tail tokens, we introduce an oversampling-like strategy by increasing the number of activated experts for these tokens. Experiments on extensive benchmarks validate the effectiveness of our approach.

[57] 3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation

Tianrui Lou,Xiaojun Jia,Siyuan Liang,Jiawei Liang,Ming Zhang,Yanjun Xiao,Xiaochun Cao

Main category: cs.CV

TL;DR: 提出了一种新的基于3D高斯点阵的物理对抗攻击框架(PGA),解决了现有方法在真实世界应用中的效率与效果问题。

Details Motivation: 现有的基于补丁或依赖网格先验和虚拟环境构建的伪装攻击方法在复杂物理环境中存在效果不足且耗时的问题。 Method: 利用3D高斯点阵进行对抗样本建模,并通过防止Gaussians之间的相互和自遮挡以及采用最小-最大优化方法来增强跨视角鲁棒性和对抗有效性。 Result: 实验验证了PGA在多样化的视角和物理环境中具有更高的对抗有效性和鲁棒性。 Conclusion: PGA提供了一种有效的物理攻击框架,基于3D高斯点阵实现快速精确的对抗样本生成。 Abstract: Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. Camouflage-based physical attack is a more promising approach compared to the patch-based attack, offering stronger adversarial effectiveness in complex physical environments. However, most prior work relies on mesh priors of the target object and virtual environments constructed by simulators, which are time-consuming to obtain and inevitably differ from the real world. Moreover, due to the limitations of the backgrounds in training images, previous methods often fail to produce multi-view robust adversarial camouflage and tend to fall into sub-optimal solutions. Due to these reasons, prior work lacks adversarial effectiveness and robustness across diverse viewpoints and physical environments. We propose a physical attack framework based on 3D Gaussian Splatting (3DGS), named PGA, which provides rapid and precise reconstruction with few images, along with photo-realistic rendering capabilities. Our framework further enhances cross-view robustness and adversarial effectiveness by preventing mutual and self-occlusion among Gaussians and employing a min-max optimization approach that adjusts the imaging background of each viewpoint, helping the algorithm filter out non-robust adversarial features. Extensive experiments validate the effectiveness and superiority of PGA. Our code is available at:https://github.com/TRLou/PGA.

[58] Activation Reward Models for Few-Shot Model Alignment

Tianning Chai,Chancharik Mitra,Brandon Huang,Gautam Rajendrakumar Gare,Zhiqiu Lin,Assaf Arbelle,Leonid Karlinsky,Rogerio Feris,Trevor Darrell,Deva Ramanan,Roei Herzig

Main category: cs.CV

TL;DR: Activation RMs offer a novel, efficient way to align models with human preferences using minimal supervision.

Details Motivation: Traditional reward modeling is not easily adaptable to new preferences and requires large preference datasets. A more flexible approach is needed. Method: Activation Reward Models (Activation RMs) use activation steering to create aligned reward signals without additional model finetuning. Result: Activation RMs outperform existing few-shot reward modeling approaches and achieve state-of-the-art performance on the PreferenceHack benchmark. Conclusion: Activation RMs are effective in aligning models with human preferences, especially in safety-critical applications by mitigating reward hacking behaviors. Abstract: Aligning Large Language Models (LLMs) and Large Multimodal Models (LMMs) to human preferences is a central challenge in improving the quality of the models' generative outputs for real-world applications. A common approach is to use reward modeling to encode preferences, enabling alignment via post-training using reinforcement learning. However, traditional reward modeling is not easily adaptable to new preferences because it requires a separate reward model, commonly trained on large preference datasets. To address this, we introduce Activation Reward Models (Activation RMs) -- a novel few-shot reward modeling method that leverages activation steering to construct well-aligned reward signals using minimal supervision and no additional model finetuning. Activation RMs outperform existing few-shot reward modeling approaches such as LLM-as-a-judge with in-context learning, voting-based scoring, and token probability scoring on standard reward modeling benchmarks. Furthermore, we demonstrate the effectiveness of Activation RMs in mitigating reward hacking behaviors, highlighting their utility for safety-critical applications. Toward this end, we propose PreferenceHack, a novel few-shot setting benchmark, the first to test reward models on reward hacking in a paired preference format. Finally, we show that Activation RM achieves state-of-the-art performance on this benchmark, surpassing even GPT-4o.

[59] Active Measurement: Efficient Estimation at Scale

Max Hamilton,Jinlin Lai,Wenlong Zhao,Subhransu Maji,Daniel Sheldon

Main category: cs.CV

TL;DR: Active measurement combines AI predictions with targeted human labeling to improve accuracy in scientific measurements with less effort.

Details Motivation: Current AI workflows in scientific discovery often lack sufficient accuracy and statistical guarantees, necessitating a more reliable approach to measurement. Method: The method involves using an AI model to predict measurements for individual units, which are then sampled for human labeling through importance sampling. The AI model improves with each new set of labels, refining the Monte Carlo estimate of the total measurement. Result: Active measurement reduces estimation error compared to alternative methods and can yield precise estimates even with an imperfect AI model, requiring minimal human effort when the model is highly accurate. Conclusion: Active measurement is an effective human-in-the-loop AI framework that provides precise estimates in scientific measurements, especially when the AI model is accurate. Abstract: AI has the potential to transform scientific discovery by analyzing vast datasets with little human effort. However, current workflows often do not provide the accuracy or statistical guarantees that are needed. We introduce active measurement, a human-in-the-loop AI framework for scientific measurement. An AI model is used to predict measurements for individual units, which are then sampled for human labeling using importance sampling. With each new set of human labels, the AI model is improved and an unbiased Monte Carlo estimate of the total measurement is refined. Active measurement can provide precise estimates even with an imperfect AI model, and requires little human effort when the AI model is very accurate. We derive novel estimators, weighting schemes, and confidence intervals, and show that active measurement reduces estimation error compared to alternatives in several measurement tasks.

[60] MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing

Langyu Wang,Bingke Zhu,Yingying Chen,Yiyuan Zhang,Ming Tang,Jinqiao Wang

Main category: cs.CV

TL;DR: This paper proposes MUG, a novel approach for weakly-supervised audio-visual video parsing that combines pseudo-labeling with an audio-visual Mamba network to enhance both segment-level and event-level prediction performance.

Details Motivation: Existing weakly-supervised AVVP methods struggle to simultaneously improve both segment-level and event-level predictions due to limitations in supervision and model architecture. This work aims to address these shortcomings by emphasizing segment uniqueness and reducing interference from alternate modalities. Method: A audio-visual Mamba network with pseudo labeling aUGmentation (MUG) is proposed. Pseudo-labels are generated based on unimodal predictions, and cross-modal random combinations are used to generate new data. The AV-Mamba architecture enhances feature processing and interaction while filtering out modal noise. Result: Extensive experiments show that MUG achieves improved performance on the LLP dataset across all metrics, including gains of 2.1% in visual Segment-level and 1.2% in audio Segment-level metrics. Conclusion: The proposed MUG method improves the state-of-the-art results on the LLP dataset for AVVP tasks by enhancing segment-level and event-level predictions through pseudo-labeling augmentation and an audio-visual Mamba network. Abstract: The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model's ability to parse various segment-level event combinations. For feature processing and interaction, we employ a audio-visual mamba network. The AV-Mamba enhances the ability to perceive different segments and excludes additional modal noise while sharing similar modal information. Our extensive experiments demonstrate that MUG improves state-of-the-art results on LLP dataset in all metrics (e.g,, gains of 2.1% and 1.2% in terms of visual Segment-level and audio Segment-level metrics). Our code is available at https://github.com/WangLY136/MUG.

[61] FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases

Shuai Tan,Bill Gong,Bin Ji,Ye Pan

Main category: cs.CV

TL;DR: 本文介绍了一种名为FixTalk的新框架,旨在解决说话头生成中的身份泄露和渲染伪影问题,从而提高生成质量。

Details Motivation: 现有的说话头生成方法常常遭受身份泄露和渲染伪影的影响,尤其是在极端情况下。 Method: 引入了增强运动指示器(EMI)以解耦身份信息与运动特征,并提出了增强细节指示器(EDI)来利用泄露的身份信息补充缺失的细节。 Result: 通过广泛的实验表明,FixTalk在减轻身份泄露和渲染伪影方面表现出色,性能优于最先进的方法。 Conclusion: FixTalk有效地减轻了身份泄露和渲染伪影,相较于现有方法实现了更高质量的说话头生成。 Abstract: Talking head generation is gaining significant importance across various domains, with a growing demand for high-quality rendering. However, existing methods often suffer from identity leakage (IL) and rendering artifacts (RA), particularly in extreme cases. Through an in-depth analysis of previous approaches, we identify two key insights: (1) IL arises from identity information embedded within motion features, and (2) this identity information can be leveraged to address RA. Building on these findings, this paper introduces FixTalk, a novel framework designed to simultaneously resolve both issues for high-quality talking head generation. Firstly, we propose an Enhanced Motion Indicator (EMI) to effectively decouple identity information from motion features, mitigating the impact of IL on generated talking heads. To address RA, we introduce an Enhanced Detail Indicator (EDI), which utilizes the leaked identity information to supplement missing details, thus fixing the artifacts. Extensive experiments demonstrate that FixTalk effectively mitigates IL and RA, achieving superior performance compared to state-of-the-art methods.

[62] Coherent Online Road Topology Estimation and Reasoning with Standard-Definition Maps

Khanh Son Pham,Christian Witte,Jens Behley,Johannes Betz,Cyrill Stachniss

Main category: cs.CV

TL;DR: 本文提出了一种基于标准定义地图和传感器数据的网络架构,用于连贯地在线构建高精度地图,从而减少自动驾驶汽车对高精度地图的依赖。

Details Motivation: 当前大多数自动驾驶汽车依赖高精度地图,但其在线构建仍然具有挑战性,因此需要一种能够统一且一致地建模道路拓扑复杂性的方法。 Method: 提出了一种网络架构,结合了先验信息和去噪技术的混合车道段编码,并利用过去帧实现时间一致性。 Result: 实验评估表明,该方法在性能上大幅优于之前的方法,证明了所提出的建模方案的优势。 Conclusion: 该论文提出了一种连贯的方法,通过利用先验地图信息(标准定义地图)来预测车道段及其拓扑结构以及道路边界,解决了自动驾驶汽车对高精度地图依赖的问题。 Abstract: Most autonomous cars rely on the availability of high-definition (HD) maps. Current research aims to address this constraint by directly predicting HD map elements from onboard sensors and reasoning about the relationships between the predicted map and traffic elements. Despite recent advancements, the coherent online construction of HD maps remains a challenging endeavor, as it necessitates modeling the high complexity of road topologies in a unified and consistent manner. To address this challenge, we propose a coherent approach to predict lane segments and their corresponding topology, as well as road boundaries, all by leveraging prior map information represented by commonly available standard-definition (SD) maps. We propose a network architecture, which leverages hybrid lane segment encodings comprising prior information and denoising techniques to enhance training stability and performance. Furthermore, we facilitate past frames for temporal consistency. Our experimental evaluation demonstrates that our approach outperforms previous methods by a large margin, highlighting the benefits of our modeling scheme.

[63] Medical-Knowledge Driven Multiple Instance Learning for Classifying Severe Abdominal Anomalies on Prenatal Ultrasound

Huanwen Liang,Jingxian Xu,Yuanji Zhang,Yuhao Huang,Yuhan Zhang,Xin Yang,Ran Li,Xuedong Deng,Yanjun Liu,Guowei Tao,Yun Wu,Sheng Zhao,Xinru Gao,Dong Ni

Main category: cs.CV

TL;DR: 本文开发了一种新的AI方法,用于更准确地诊断胎儿腹部畸形,减少孕产妇和胎儿的风险,并已在大量数据上验证其优越性能。

Details Motivation: 胎儿腹部畸形是严重的先天性异常,需要准确诊断以指导妊娠管理和降低死亡率,而现有的AI在产前腹部异常中的应用仍然有限。 Method: 采用混合注意力专家模块(MoAE)、医学知识驱动特征选择模块(MFS)和基于提示的原型学习(PPL)进行分类。 Result: 在包含2419个案例、24748张图像和6个类别的大型产前腹部超声数据集上进行了广泛验证,证明了所提方法优于现有最先进的竞争者。 Conclusion: 该论文提出了一种基于案例级别的多实例学习方法,用于胎儿腹部畸形的分类,相比现有技术表现出更高的性能。 Abstract: Fetal abdominal malformations are serious congenital anomalies that require accurate diagnosis to guide pregnancy management and reduce mortality. Although AI has demonstrated significant potential in medical diagnosis, its application to prenatal abdominal anomalies remains limited. Most existing studies focus on image-level classification and rely on standard plane localization, placing less emphasis on case-level diagnosis. In this paper, we develop a case-level multiple instance learning (MIL)-based method, free of standard plane localization, for classifying fetal abdominal anomalies in prenatal ultrasound. Our contribution is three-fold. First, we adopt a mixture-of-attention-experts module (MoAE) to weight different attention heads for various planes. Secondly, we propose a medical-knowledge-driven feature selection module (MFS) to align image features with medical knowledge, performing self-supervised image token selection at the case-level. Finally, we propose a prompt-based prototype learning (PPL) to enhance the MFS. Extensively validated on a large prenatal abdominal ultrasound dataset containing 2,419 cases, with a total of 24,748 images and 6 categories, our proposed method outperforms the state-of-the-art competitors. Codes are available at:https://github.com/LL-AC/AAcls.

[64] CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning

Kuniaki Saito,Donghyun Kim,Kwanyong Park,Atsushi Hashimoto,Yoshitaka Ushiku

Main category: cs.CV

TL;DR: 本文提出了一种新的图像字幕模型CaptionSmiths,该模型能够灵活切换语言模式,从而实现了对生成字幕的细粒度控制。

Details Motivation: 由于现有的生成视觉-语言模型在训练过程中没有将字幕属性作为条件,并且无法在其语言模式之间平滑过渡,因此需要一种新的方法来实现对生成字幕属性的细粒度控制。 Method: 通过量化每个字幕的长度、描述性和词汇唯一性,并使用连续标量值表示这些属性,然后通过在两个端点向量之间进行插值来表示条件。 Result: 实验结果表明,CaptionSmiths可以平滑地改变输出字幕的属性,并且比基线模型具有更高的词汇一致性。例如,在更好地词汇对齐的情况下,CaptionSmiths将控制字幕长度的误差降低了506%。 Conclusion: CaptionSmiths是一个能够处理多样化语言模式的图像字幕模型,解决了现有模型在生成字幕属性控制方面的不足。 Abstract: An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506\% despite better lexical alignment. Code will be available on https://github.com/omron-sinicx/captionsmiths.

[65] Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention

Jiawei Gu,Ziyue Qiao,Zechao Li

Main category: cs.CV

TL;DR: 本文提出了一种轻量级分布外检测方法,通过分析梯度方向并在推理阶段进行调整,以提高模型在开放世界环境下的安全性与可靠性。

Details Motivation: 在仅使用分布内数据训练的模型中,分布内样本的梯度方向相对一致,而分布外样本则表现出混乱或冲突的梯度方向,基于此现象提出了改进的分布外检测方法。 Method: 利用梯度方向的一致性差异,在推理阶段引入局部一阶近似方法,避免重新计算logits,并有效检测分布外样本。 Result: 实验表明,该方法在标准分布外基准测试中表现出显著改进,且计算开销小,易于集成到现有推理流程中。 Conclusion: 该论文提出了一种在推理阶段进行分布外检测的方法,通过短路梯度方向来减少对分布外样本的置信度,同时保持对分布内样本的分类准确性。 Abstract: Out-of-Distribution (OOD) detection is critical for safely deploying deep models in open-world environments, where inputs may lie outside the training distribution. During inference on a model trained exclusively with In-Distribution (ID) data, we observe a salient gradient phenomenon: around an ID sample, the local gradient directions for "enhancing" that sample's predicted class remain relatively consistent, whereas OOD samples--unseen in training--exhibit disorganized or conflicting gradient directions in the same neighborhood. Motivated by this observation, we propose an inference-stage technique to short-circuit those feature coordinates that spurious gradients exploit to inflate OOD confidence, while leaving ID classification largely intact. To circumvent the expense of recomputing the logits after this gradient short-circuit, we further introduce a local first-order approximation that accurately captures the post-modification outputs without a second forward pass. Experiments on standard OOD benchmarks show our approach yields substantial improvements. Moreover, the method is lightweight and requires minimal changes to the standard inference pipeline, offering a practical path toward robust OOD detection in real-world applications.

[66] DocShaDiffusion: Diffusion Model in Latent Space for Document Image Shadow Removal

Wenjie Liu,Bingshu Wang,Ze Wang,C. L. Philip Chen

Main category: cs.CV

TL;DR: This paper proposes DocShaDiffusion, a novel diffusion-based approach for document image shadow removal that addresses the issue of color shadows. It introduces modules for shadow mask generation and guided diffusion, along with a new dataset, achieving superior performance over existing methods.

Details Motivation: Existing methods primarily focus on removing shadows with constant color backgrounds and overlook color shadows, which limits their effectiveness in real-world scenarios. Method: A diffusion model in latent space was designed for shadow removal, along with a shadow soft-mask generation module (SSGM) and a shadow mask-aware guided diffusion module (SMGDM). A shadow-robust perceptual feature loss was also proposed. Additionally, a large-scale synthetic dataset (SDCSRD) was developed. Result: The proposed method outperforms state-of-the-art techniques on three public datasets, demonstrating its superiority in document image shadow removal. Conclusion: DocShaDiffusion is able to effectively remove shadows from document images while preserving details and structures, as validated by experiments on three public datasets. Abstract: Document shadow removal is a crucial task in the field of document image enhancement. However, existing methods tend to remove shadows with constant color background and ignore color shadows. In this paper, we first design a diffusion model in latent space for document image shadow removal, called DocShaDiffusion. It translates shadow images from pixel space to latent space, enabling the model to more easily capture essential features. To address the issue of color shadows, we design a shadow soft-mask generation module (SSGM). It is able to produce accurate shadow mask and add noise into shadow regions specially. Guided by the shadow mask, a shadow mask-aware guided diffusion module (SMGDM) is proposed to remove shadows from document images by supervising the diffusion and denoising process. We also propose a shadow-robust perceptual feature loss to preserve details and structures in document images. Moreover, we develop a large-scale synthetic document color shadow removal dataset (SDCSRD). It simulates the distribution of realistic color shadows and provides powerful supports for the training of models. Experiments on three public datasets validate the proposed method's superiority over state-of-the-art. Our code and dataset will be publicly available.

[67] DiffMark: Diffusion-based Robust Watermark Against Deepfakes

Chen Sun,Haiyang Sun,Zhiqing Guo,Yunfeng Diao,Liejun Wang,Dan Ma,Gaobo Yang,Keqin Li

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的新颖鲁棒水印框架DiffMark,该框架能够在生成过程中将水印与图像无缝融合,并具有对抗Deepfake操作的能力。

Details Motivation: 现有的水印方法在应对Deepfake恶意面部操作时缺乏足够的鲁棒性,因此需要一种更强大的水印技术来验证图像真实性和追踪来源。 Method: 提出了一种基于扩散模型的新型鲁棒水印框架DiffMark,利用修改训练和采样方案、跨信息融合模块和Deepfake-resistant guidance等技术实现水印与图像的无缝融合。 Result: 实验结果表明,DiffMark在典型Deepfakes上表现有效,并且能够更好地适应扩散模型的采样过程以增强水印鲁棒性。 Conclusion: DiffMark实现了对Deepfake操作的鲁棒水印,通过扩散模型与特定模块(如CIF模块)以及Deepfake-resistant guidance的结合。 Abstract: Deepfakes pose significant security and privacy threats through malicious facial manipulations. While robust watermarking can aid in authenticity verification and source tracking, existing methods often lack the sufficient robustness against Deepfake manipulations. Diffusion models have demonstrated remarkable performance in image generation, enabling the seamless fusion of watermark with image during generation. In this study, we propose a novel robust watermarking framework based on diffusion model, called DiffMark. By modifying the training and sampling scheme, we take the facial image and watermark as conditions to guide the diffusion model to progressively denoise and generate corresponding watermarked image. In the construction of facial condition, we weight the facial image by a timestep-dependent factor that gradually reduces the guidance intensity with the decrease of noise, thus better adapting to the sampling process of diffusion model. To achieve the fusion of watermark condition, we introduce a cross information fusion (CIF) module that leverages a learnable embedding table to adaptively extract watermark features and integrates them with image features via cross-attention. To enhance the robustness of the watermark against Deepfake manipulations, we integrate a frozen autoencoder during training phase to simulate Deepfake manipulations. Additionally, we introduce Deepfake-resistant guidance that employs specific Deepfake model to adversarially guide the diffusion sampling process to generate more robust watermarked images. Experimental results demonstrate the effectiveness of the proposed DiffMark on typical Deepfakes. Our code will be available at https://github.com/vpsg-research/DiffMark.

[68] TurboReg: TurboClique for Robust and Efficient Point Cloud Registration

Shaocheng Yan,Pengcheng Shi,Zhenjun Zhao,Kaixin Wang,Kuang Cao,Ji Wu,Jiayuan Li

Main category: cs.CV

TL;DR: TurboReg is proposed as a fast and robust method for Point Cloud Registration, outperforming existing techniques in both speed and accuracy by leveraging a lightweight clique and an efficient search algorithm.

Details Motivation: Existing correspondence-based Point Cloud Registration methods using maximal clique search suffer from exponential time complexity, limiting their effectiveness in time-sensitive applications. Method: A fast and robust estimator named TurboReg is introduced, which uses a lightweight clique (TurboClique) and a Pivot-Guided Search (PGS) algorithm for efficient parallel searching and transformation estimation. Result: TurboReg demonstrates substantial speed improvements and better recall, such as being 208.22 times faster than 3DMAC on the 3DMatch+FCGF dataset while achieving high accuracy. Conclusion: TurboReg achieves state-of-the-art performance in Point Cloud Registration with significant speed improvements over existing methods. Abstract: Robust estimation is essential in correspondence-based Point Cloud Registration (PCR). Existing methods using maximal clique search in compatibility graphs achieve high recall but suffer from exponential time complexity, limiting their use in time-sensitive applications. To address this challenge, we propose a fast and robust estimator, TurboReg, built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm. First, we define the TurboClique as a 3-clique within a highly-constrained compatibility graph. The lightweight nature of the 3-clique allows for efficient parallel searching, and the highly-constrained compatibility graph ensures robust spatial consistency for stable transformation estimation. Next, PGS selects matching pairs with high SC$^2$ scores as pivots, effectively guiding the search toward TurboCliques with higher inlier ratios. Moreover, the PGS algorithm has linear time complexity and is significantly more efficient than the maximal clique search with exponential time complexity. Extensive experiments show that TurboReg achieves state-of-the-art performance across multiple real-world datasets, with substantial speed improvements. For example, on the 3DMatch+FCGF dataset, TurboReg (1K) operates $208.22\times$ faster than 3DMAC while also achieving higher recall. Our code is accessible at \href{https://github.com/Laka-3DV/TurboReg}{\texttt{TurboReg}}.

[69] OoDDINO:A Multi-level Framework for Anomaly Segmentation on Complex Road Scenes

Yuxing Liu,Ji Zhang,Zhou Xuchuan,Jingzhong Xiao,Huimin Yang,Jiaxin Zhong

Main category: cs.CV

TL;DR: 本文提出了一种新的多级异常分割框架OoDDINO,通过解决空间相关性和阈值设定问题,提高了异常检测的准确性。

Details Motivation: 现有方法忽视像素间的空间相关性,并且全局阈值导致假阳性或漏检问题。 Method: 提出了一个粗到细的多层次异常分割框架,包括正交不确定性感知融合策略和自适应双阈值网络。 Result: 在两个基准数据集上验证了该框架的有效性和兼容性。 Conclusion: OoDDINO是一个兼容性强的框架,能提升其他像素级异常检测模型的表现。 Abstract: Anomaly segmentation aims to identify Out-of-Distribution (OoD) anomalous objects within images. Existing pixel-wise methods typically assign anomaly scores individually and employ a global thresholding strategy to segment anomalies. Despite their effectiveness, these approaches encounter significant challenges in real-world applications: (1) neglecting spatial correlations among pixels within the same object, resulting in fragmented segmentation; (2) variabil ity in anomaly score distributions across image regions, causing global thresholds to either generate false positives in background areas or miss segments of anomalous objects. In this work, we introduce OoDDINO, a novel multi-level anomaly segmentation framework designed to address these limitations through a coarse-to-fine anomaly detection strategy. OoDDINO combines an uncertainty-guided anomaly detection model with a pixel-level segmentation model within a two-stage cascade architecture. Initially, we propose an Orthogonal Uncertainty-Aware Fusion Strategy (OUAFS) that sequentially integrates multiple uncertainty metrics with visual representations, employing orthogonal constraints to strengthen the detection model's capacity for localizing anomalous regions accurately. Subsequently, we develop an Adaptive Dual-Threshold Network (ADT-Net), which dynamically generates region-specific thresholds based on object-level detection outputs and pixel-wise anomaly scores. This approach allows for distinct thresholding strategies within foreground and background areas, achieving fine-grained anomaly segmentation. The proposed framework is compatible with other pixel-wise anomaly detection models, which acts as a plug-in to boost the performance. Extensive experiments on two benchmark datasets validate our framework's superiority and compatibility over state-of-the-art methods.

[70] NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation

Max Gandyra,Alessandro Santonicola,Michael Beetz

Main category: cs.CV

TL;DR: NOCTIS improves instance segmentation for novel objects without retraining by combining Grounded-SAM 2 and DINOv2 with a novel matching strategy.

Details Motivation: To design a model capable of instance segmentation for novel objects in RGB images without retraining, addressing the limitations of previous approaches like CNOS, SAM-6D, and NIDS-Net. Method: NOCTIS uses Grounded-SAM 2 to generate precise bounding boxes and masks while leveraging DINOv2's zero-shot capabilities for image embeddings. Matching is done using class embeddings, patch embeddings, cyclic patch filtering, and confidence weighting. Result: NOCTIS achieves superior performance compared to existing RGB and RGB-D methods on seven core datasets from the BOP 2023 challenge for the task of 'Model-based 2D segmentation of unseen objects'. Conclusion: The paper concludes that NOCTIS, without further training or fine-tuning, outperforms the best RGB and RGB-D methods on the BOP 2023 challenge datasets for unseen object segmentation. Abstract: Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed, for all kinds of novel objects, without (re-) training, has proven to be a difficult task. To handle this, we propose a simple, yet powerful, framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). This work stems from and improves upon previous ones like CNOS, SAM-6D and NIDS-Net; thus, it also leverages on recent vision foundation models, namely: Grounded-SAM 2 and DINOv2. It utilises Grounded-SAM 2 to obtain object proposals with precise bounding boxes and their corresponding segmentation masks; while DINOv2's zero-shot capabilities are employed to generate the image embeddings. The quality of those masks, together with their embeddings, is of vital importance to our approach; as the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings. Differently to SAM-6D, calculating the latter involves a prior patch filtering based on the distance between each patch and its corresponding cyclic/roundtrip patch in the image grid. Furthermore, the average confidence of the proposals' bounding box and mask is used as an additional weighting factor for the object matching score. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods on the seven core datasets of the BOP 2023 challenge for the "Model-based 2D segmentation of unseen objects" task.

[71] Representation Entanglement for Generation:Training Diffusion Transformers Is Much Easier Than You Think

Ge Wu,Shen Zhang,Ruijing Shi,Shanghua Gao,Zhenyuan Chen,Lei Wang,Zhaowei Chen,Hongcheng Gao,Yao Tang,Jian Yang,Ming-Ming Cheng,Xiang Li

Main category: cs.CV

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible additional inference overhead, requiring only one single additional token for denoising (<0.5\% increase in FLOPs and latency). The inference process concurrently reconstructs both image latents and their corresponding global semantics, where the acquired semantic knowledge actively guides and enhances the image generation process. On ImageNet 256$\times$256, SiT-XL/2 + REG demonstrates remarkable convergence acceleration, achieving $\textbf{63}\times$ and $\textbf{23}\times$ faster training than SiT-XL/2 and SiT-XL/2 + REPA, respectively. More impressively, SiT-L/2 + REG trained for merely 400K iterations outperforms SiT-XL/2 + REPA trained for 4M iterations ($\textbf{10}\times$ longer). Code is available at: https://github.com/Martinser/REG.

[72] Optimizing Methane Detection On Board Satellites: Speed, Accuracy, and Low-Power Solutions for Resource-Constrained Hardware

Jonáš Herec,Vít Růžička,Rado Pitoňák

Main category: cs.CV

TL;DR: This paper explores efficient algorithms like Mag1c-SAS and CEM for onboard methane detection via hyperspectral satellite imagery, achieving significantly faster computation times without sacrificing accuracy, thus enabling early methane leak detection to mitigate climate change.

Details Motivation: Early detection of methane leaks through hyperspectral satellite imagery is crucial for mitigating climate change. However, traditional methods are computationally demanding for resource-limited onboard hardware, necessitating more efficient solutions. Method: This work tests fast target detection methods (ACE, CEM) and proposes Mag1c-SAS, a faster variant of Mag1c. These methods are integrated with machine learning models (U-Net, LinkNet) to explore detection potential. Three band selection strategies are also proposed and evaluated. Result: Two promising methods, Mag1c-SAS and CEM, were identified as acceptably accurate for detecting strong plumes and efficient enough for onboard deployment. One strategy prioritizes accuracy while the other prioritizes speed, achieving up to ~100x and ~230x faster computation than the original Mag1c on resource-limited hardware. One proposed band selection strategy outperforms traditional methods using fewer channels. Conclusion: The study successfully identifies and evaluates efficient, low-power algorithms for onboard methane detection, establishing a foundation for future advancements in this area with minimal hardware requirements. Abstract: Methane is a potent greenhouse gas, and detecting its leaks early via hyperspectral satellite imagery can help mitigate climate change. Meanwhile, many existing missions operate in manual tasking regimes only, thus missing potential events of interest. To overcome slow downlink rates cost-effectively, onboard detection is a viable solution. However, traditional methane enhancement methods are too computationally demanding for resource-limited onboard hardware. This work accelerates methane detection by focusing on efficient, low-power algorithms. We test fast target detection methods (ACE, CEM) that have not been previously used for methane detection and propose a Mag1c-SAS - a significantly faster variant of the current state-of-the-art algorithm for methane detection: Mag1c. To explore their true detection potential, we integrate them with a machine learning model (U-Net, LinkNet). Our results identify two promising candidates (Mag1c-SAS and CEM), both acceptably accurate for the detection of strong plumes and computationally efficient enough for onboard deployment: one optimized more for accuracy, the other more for speed, achieving up to ~100x and ~230x faster computation than original Mag1c on resource-limited hardware. Additionally, we propose and evaluate three band selection strategies. One of them can outperform the method traditionally used in the field while using fewer channels, leading to even faster processing without compromising accuracy. This research lays the foundation for future advancements in onboard methane detection with minimal hardware requirements, improving timely data delivery. The produced code, data, and models are open-sourced and can be accessed from https://github.com/zaitra/methane-filters-benchmark.

[73] Active Control Points-based 6DoF Pose Tracking for Industrial Metal Objects

Chentao Shen,Ding Pan,Mingyu Mei,Zaixing He,Xinyue Zhao

Main category: cs.CV

TL;DR: This paper proposes a new 6DoF pose tracking method for industrial metal objects using active control points, enhancing robustness and providing a viable real-time solution.

Details Motivation: The motivation is to address the challenge of pose tracking for industrial metal objects in real-world environments due to their reflective characteristics. Method: The method involves using image control points to generate edge features for optimization and introduces an optimal control point regression method to improve robustness. Result: The proposed tracking method performs effectively in both dataset evaluation and real world tasks. Conclusion: The paper concludes that the proposed 6DoF pose tracking method based on active control points is effective for real-time tracking of industrial metal objects. Abstract: Visual pose tracking is playing an increasingly vital role in industrial contexts in recent years. However, the pose tracking for industrial metal objects remains a challenging task especially in the real world-environments, due to the reflection characteristic of metal objects. To address this issue, we propose a novel 6DoF pose tracking method based on active control points. The method uses image control points to generate edge feature for optimization actively instead of 6DoF pose-based rendering, and serve them as optimization variables. We also introduce an optimal control point regression method to improve robustness. The proposed tracking method performs effectively in both dataset evaluation and real world tasks, providing a viable solution for real-time tracking of industrial metal objects. Our source code is made publicly available at: https://github.com/tomatoma00/ACPTracking.

[74] What Really Matters for Robust Multi-Sensor HD Map Construction?

Xiaoshuai Hao,Yuting Zhao,Yuheng Ji,Luanyuan Dai,Peng Hao,Dingzhe Li,Shuai Cheng,Rong Yin

Main category: cs.CV

TL;DR: This paper enhances the robustness of multi-modal fusion techniques for HD map construction, introducing data augmentation, a new fusion module, and modality dropout training, achieving strong results on the NuScenes dataset.

Details Motivation: Existing Camera-LiDAR fusion techniques primarily focus on model accuracy while neglecting robustness, which is critical for real-world autonomous driving applications. This work aims to address this gap by enhancing the robustness of multi-modal fusion methods. Method: The paper proposes three components: data augmentation, a novel multi-modal fusion module, and a modality dropout training strategy, which are evaluated on a dataset containing 10 days of NuScenes data. Result: The proposed methods significantly improve the robustness of baseline approaches and achieve state-of-the-art performance on the clean validation set of the NuScenes dataset. Conclusion: The paper concludes that the proposed methods significantly enhance the robustness of multi-modal fusion approaches in HD map construction while maintaining high accuracy, offering valuable insights for developing reliable models for autonomous driving. Abstract: High-definition (HD) map construction methods are crucial for providing precise and comprehensive static environmental information, which is essential for autonomous driving systems. While Camera-LiDAR fusion techniques have shown promising results by integrating data from both modalities, existing approaches primarily focus on improving model accuracy and often neglect the robustness of perception models, which is a critical aspect for real-world applications. In this paper, we explore strategies to enhance the robustness of multi-modal fusion methods for HD map construction while maintaining high accuracy. We propose three key components: data augmentation, a novel multi-modal fusion module, and a modality dropout training strategy. These components are evaluated on a challenging dataset containing 10 days of NuScenes data. Our experimental results demonstrate that our proposed methods significantly enhance the robustness of baseline methods. Furthermore, our approach achieves state-of-the-art performance on the clean validation set of the NuScenes dataset. Our findings provide valuable insights for developing more robust and reliable HD map construction models, advancing their applicability in real-world autonomous driving scenarios. Project website: https://robomap-123.github.io.

[75] AVC-DPO: Aligned Video Captioning via Direct Preference Optimization

Jiyang Tang,Hengyi Li,Yifan Du,Wayne Xin Zhao

Main category: cs.CV

TL;DR: This paper proposes AVC-DPO, a post-training framework for video multimodal large language models that aligns video captions with human preferences by focusing on temporal dynamics and spatial information, achieving top performance in a video captioning challenge.

Details Motivation: Despite progress in video multimodal large language models (video MLLMs), existing methods struggle to adjust video captions based on human preferences. This limitation motivates the development of a framework that incorporates human-centric preferences into video captioning. Method: The study introduces AVC-DPO, a post-training framework that uses enhanced prompts targeting temporal dynamics and spatial information to align video captions with human preferences. It leverages responses from varied prompt conditions for preference-aware training. Result: The proposed AVC-DPO framework achieved first place in the LOVE@CVPR'25 Workshop Track 1A: Video Detailed Captioning Challenge, particularly excelling in the Video Detailed Captioning (VDC) benchmark using the VDCSCORE evaluation metric. Conclusion: AVC-DPO is effective in enhancing video captioning capabilities by aligning with human preferences, as demonstrated by its top performance on the VDC benchmark. Abstract: Although video multimodal large language models (video MLLMs) have achieved substantial progress in video captioning tasks, it remains challenging to adjust the focal emphasis of video captions according to human preferences. To address this limitation, we propose Aligned Video Captioning via Direct Preference Optimization (AVC-DPO), a post-training framework designed to enhance captioning capabilities in video MLLMs through preference alignment. Our approach designs enhanced prompts that specifically target temporal dynamics and spatial information-two key factors that humans care about when watching a video-thereby incorporating human-centric preferences. AVC-DPO leverages the same foundation model's caption generation responses under varied prompt conditions to conduct preference-aware training and caption alignment. Using this framework, we have achieved exceptional performance in the LOVE@CVPR'25 Workshop Track 1A: Video Detailed Captioning Challenge, achieving first place on the Video Detailed Captioning (VDC) benchmark according to the VDCSCORE evaluation metric.

[76] Crop Pest Classification Using Deep Learning Techniques: A Review

Muhammad Hassam Ejaz,Muhammad Bilal,Usman Habib

Main category: cs.CV

TL;DR: This review explores the application of deep learning techniques like CNNs, vision transformers, and hybrid models for AI-based pest monitoring, highlighting progress, current challenges, and future directions.

Details Motivation: Insect pests pose a serious threat to global crop yields, and traditional monitoring methods are slow, manual, and difficult to scale. This motivates the exploration of deep learning techniques for automated pest detection. Method: The study reviewed 37 selected papers published between 2018 and 2025 focusing on AI-based pest classification. The research was organized based on crop type, pest species, model architecture, dataset usage, and technical challenges. Result: Early studies primarily used CNNs, but recent trends show a shift towards hybrid and transformer-based models that offer better accuracy and contextual understanding in pest detection. Conclusion: The review concludes that while deep learning models, especially hybrid and transformer-based ones, have shown improved performance in pest detection, there are still significant challenges to overcome, including imbalanced datasets, small pest detection, limited generalizability, and deployment constraints. Abstract: Insect pests continue to bring a serious threat to crop yields around the world, and traditional methods for monitoring them are often slow, manual, and difficult to scale. In recent years, deep learning has emerged as a powerful solution, with techniques like convolutional neural networks (CNNs), vision transformers (ViTs), and hybrid models gaining popularity for automating pest detection. This review looks at 37 carefully selected studies published between 2018 and 2025, all focused on AI-based pest classification. The selected research is organized by crop type, pest species, model architecture, dataset usage, and key technical challenges. The early studies relied heavily on CNNs but latest work is shifting toward hybrid and transformer-based models that deliver higher accuracy and better contextual understanding. Still, challenges like imbalanced datasets, difficulty in detecting small pests, limited generalizability, and deployment on edge devices remain significant hurdles. Overall, this review offers a structured overview of the field, highlights useful datasets, and outlines the key challenges and future directions for AI-based pest monitoring systems.

[77] ReFlex: Text-Guided Editing of Real Images in Rectified Flow via Mid-Step Feature Extraction and Attention Adaptation

Jimyeong Kim,Jungwon Park,Yeji Song,Nojun Kwak,Wonjong Rhee

Main category: cs.CV

TL;DR: 提出了一种用于ReFlow的新型真实图像编辑方法,通过分析多模态变压器块的中间表示并识别三个关键特征。

Details Motivation: ReFlow文本到图像模型在图像质量和文本对齐方面超过了扩散模型,但将ReFlow适应于真实图像编辑仍然具有挑战性。 Method: 通过分析多模态变压器块的中间表示并识别三个关键特征来提取真实图像的这些特征;利用中期步骤潜变量来提高可编辑性和与目标文本的对齐度。 Result: 实验结果显示,该方法在两个基准测试中均优于之前的九个基线方法,并且用户偏好强烈支持这种方法。 Conclusion: 所提出的无需训练、不需要用户提供掩码、甚至可以在没有源提示的情况下应用的方法在真实图像编辑方面表现出色。 Abstract: Rectified Flow text-to-image models surpass diffusion models in image quality and text alignment, but adapting ReFlow for real-image editing remains challenging. We propose a new real-image editing method for ReFlow by analyzing the intermediate representations of multimodal transformer blocks and identifying three key features. To extract these features from real images with sufficient structural preservation, we leverage mid-step latent, which is inverted only up to the mid-step. We then adapt attention during injection to improve editability and enhance alignment to the target text. Our method is training-free, requires no user-provided mask, and can be applied even without a source prompt. Extensive experiments on two benchmarks with nine baselines demonstrate its superior performance over prior methods, further validated by human evaluations confirming a strong user preference for our approach.

[78] Integrating Traditional and Deep Learning Methods to Detect Tree Crowns in Satellite Images

Ozan Durgut,Beril Kallfelz-Sirmacek,Cem Unsalan

Main category: cs.CV

TL;DR: This paper proposes a hybrid method combining traditional and deep learning techniques with a rule-based system to improve automatic tree crown detection.

Details Motivation: The motivation is to address the challenge of monitoring forests to protect them, which is crucial for tackling global environmental issues like biodiversity loss and climate change. Method: The study introduces two tree crown detection methods—traditional methods for feature extraction and segmentation, and deep learning methods for detection—and combines them into a rule-based post-processing approach. Result: The proposed method increases the number of detected tree crowns by using neighboring trees and localized operations in a rule-based post-processing step. Conclusion: The study concludes that integrating traditional and deep learning methods through a rule-based approach improves the robustness and accuracy of tree crown detection. Abstract: Global warming, loss of biodiversity, and air pollution are among the most significant problems facing Earth. One of the primary challenges in addressing these issues is the lack of monitoring forests to protect them. To tackle this problem, it is important to leverage remote sensing and computer vision methods to automate monitoring applications. Hence, automatic tree crown detection algorithms emerged based on traditional and deep learning methods. In this study, we first introduce two different tree crown detection methods based on these approaches. Then, we form a novel rule-based approach that integrates these two methods to enhance robustness and accuracy of tree crown detection results. While traditional methods are employed for feature extraction and segmentation of forested areas, deep learning methods are used to detect tree crowns in our method. With the proposed rule-based approach, we post-process these results, aiming to increase the number of detected tree crowns through neighboring trees and localized operations. We compare the obtained results with the proposed method in terms of the number of detected tree crowns and report the advantages, disadvantages, and areas for improvement of the obtained outcomes.

[79] Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence

Robert Aufschläger,Youssef Shoeb,Azarm Nowzad,Michael Heigl,Fabian Bally,Martin Schramm

Main category: cs.CV

TL;DR: 本文提出了cRID,一种用于检测行人图像数据集中语义PII并提升跨数据集Re-ID性能的跨模态框架。

Details Motivation: 街道级录音开放数据集在推动自动驾驶系统和AI研究方面起着关键作用,但其中包含的个人身份信息(PII)对行人隐私构成重大风险,因此需要有效的方法来检测和管理这些数据中的PII。 Method: 提出了一种名为cRID的新方法,利用Large Vision-Language Models, Graph Attention Networks和表示学习来识别可解释的PII特征,并进行跨数据集的Re-ID实验评估。 Result: 实验显示,在跨数据集Re-ID场景中,特别是在Market-1501到CUHK03-np(detected)的迁移任务中,cRID框架表现出改进的性能,证明其在实际应用中的有效性。 Conclusion: cRID是一个结合了大型视觉-语言模型、图注意力网络和表示学习的跨模态框架,旨在检测文本描述的PII线索并增强人员再识别(Re-ID)性能。 Abstract: The collection and release of street-level recordings as Open Data play a vital role in advancing autonomous driving systems and AI research. However, these datasets pose significant privacy risks, particularly for pedestrians, due to the presence of Personally Identifiable Information (PII) that extends beyond biometric traits such as faces. In this paper, we present cRID, a novel cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning to detect textual describable clues of PII and enhance person re-identification (Re-ID). Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues. We conduct a systematic evaluation of PII presence in person image datasets. Our experiments show improved performance in practical cross-dataset Re-ID scenarios, notably from Market-1501 to CUHK03-np (detected), highlighting the framework's practical utility. Code is available at https://github.com/RAufschlaeger/cRID.

[80] Mamba Guided Boundary Prior Matters: A New Perspective for Generalized Polyp Segmentation

Tapas K. Dutta,Snehashis Majhi,Deepak Ranjan Nayak,Debesh Jha

Main category: cs.CV

TL;DR: SAM-MaGuP is a novel method for more accurate and robust polyp segmentation in colonoscopy images, addressing key limitations of existing approaches.

Details Motivation: Polyp segmentation is critical for early detection of colorectal cancer but remains challenging due to variations in polyp appearance and indistinct boundaries. Existing methods struggle with weak boundaries and generalizability for clinical use. Method: SAM-MaGuP integrates a boundary distillation module and a 1D-2D Mamba adapter into the Segment Anything Model (SAM) to enhance feature learning and resolve weak boundary challenges. Result: Extensive evaluations show that SAM-MaGuP outperforms state-of-the-art methods across five diverse datasets, achieving superior segmentation performance. Conclusion: The proposed SAM-MaGuP method significantly improves the accuracy and robustness of polyp segmentation in colonoscopy images, setting a new benchmark in the field. Abstract: Polyp segmentation in colonoscopy images is crucial for early detection and diagnosis of colorectal cancer. However, this task remains a significant challenge due to the substantial variations in polyp shape, size, and color, as well as the high similarity between polyps and surrounding tissues, often compounded by indistinct boundaries. While existing encoder-decoder CNN and transformer-based approaches have shown promising results, they struggle with stable segmentation performance on polyps with weak or blurry boundaries. These methods exhibit limited abilities to distinguish between polyps and non-polyps and capture essential boundary cues. Moreover, their generalizability still falls short of meeting the demands of real-time clinical applications. To address these limitations, we propose SAM-MaGuP, a groundbreaking approach for robust polyp segmentation. By incorporating a boundary distillation module and a 1D-2D Mamba adapter within the Segment Anything Model (SAM), SAM-MaGuP excels at resolving weak boundary challenges and amplifies feature learning through enriched global contextual interactions. Extensive evaluations across five diverse datasets reveal that SAM-MaGuP outperforms state-of-the-art methods, achieving unmatched segmentation accuracy and robustness. Our key innovations, a Mamba-guided boundary prior and a 1D-2D Mamba block, set a new benchmark in the field, pushing the boundaries of polyp segmentation to new heights.

[81] Exploring Pose-based Sign Language Translation: Ablation Studies and Attention Insights

Tomas Zelezny,Jakub Straka,Vaclav Javorek,Ondrej Valach,Marek Hruz,Ivan Gruber

Main category: cs.CV

TL;DR: This paper demonstrates how pose-based preprocessing techniques improve sign language translation models, showing significant gains in accuracy and robustness, with insights into attention behavior suggesting further enhancements via register tokens.

Details Motivation: The motivation stems from the evolution of Sign Language Translation systems from isolated recognition methods to more sophisticated continuous gloss-free translation frameworks. The study aims to understand how pose-based data preprocessing affects SLT performance. Method: The study employs a transformer-based architecture using a modified T5 encoder-decoder model to process pose representations. It conducts extensive ablation studies on the YouTubeASL and How2Sign datasets to evaluate the impact of different preprocessing strategies. Result: The results show that appropriate data preprocessing techniques can significantly improve translation accuracy and model robustness. Additionally, an in-depth analysis of model attention suggests that introducing a dedicated register token enhances overall performance. Conclusion: The paper concludes that pose-based data preprocessing techniques such as normalization, interpolation, and augmentation can significantly enhance the robustness and generalization abilities of Sign Language Translation systems. Moreover, incorporating a dedicated register token improves model performance. Abstract: Sign Language Translation (SLT) has evolved significantly, moving from isolated recognition approaches to complex, continuous gloss-free translation systems. This paper explores the impact of pose-based data preprocessing techniques - normalization, interpolation, and augmentation - on SLT performance. We employ a transformer-based architecture, adapting a modified T5 encoder-decoder model to process pose representations. Through extensive ablation studies on YouTubeASL and How2Sign datasets, we analyze how different preprocessing strategies affect translation accuracy. Our results demonstrate that appropriate normalization, interpolation, and augmentation techniques can significantly improve model robustness and generalization abilities. Additionally, we provide a deep analysis of the model's attentions and reveal interesting behavior suggesting that adding a dedicated register token can improve overall model performance. We publish our code on our GitHub repository, including the preprocessed YouTubeASL data.

[82] TrackingMiM: Efficient Mamba-in-Mamba Serialization for Real-time UAV Object Tracking

Bingxi Liu,Calvin Chen,Junhao Li,Guyang Yu,Haoqian Song,Xuchen Liu,Jinqiang Cui,Hong Zhang

Main category: cs.CV

TL;DR: This paper introduces TrackingMiM, a Mamba-based architecture that improves temporal consistency and computational efficiency for UAV tracking, achieving high precision and faster performance.

Details Motivation: The Vision Transformer (ViT) suffers from quadratic complexity, which is problematic for real-time processing in unmanned aerial vehicle (UAV) tracking systems. This limitation motivates the exploration of alternative models like the State-Space Model (Mamba), which offers computational efficiency and long-sequence modeling capabilities. Method: The study proposes TrackingMiM, a Mamba-in-Mamba architecture designed to handle image sequences with minimal computational burden. It uses a nested Mamba scanning mechanism to process temporal and spatial coherent patch tokens independently while using the template frame as a query token for tracking. Result: Extensive experiments on five UAV tracking benchmarks demonstrate that TrackingMiM achieves state-of-the-art precision while significantly improving tracking speed compared to existing methods. Conclusion: The proposed TrackingMiM model effectively addresses the temporal inconsistency issue in existing Mamba-based methods by introducing a nested Mamba scan mechanism that independently processes temporal and spatial information, making it suitable for UAV tracking tasks. Abstract: The Vision Transformer (ViT) model has long struggled with the challenge of quadratic complexity, a limitation that becomes especially critical in unmanned aerial vehicle (UAV) tracking systems, where data must be processed in real time. In this study, we explore the recently proposed State-Space Model, Mamba, leveraging its computational efficiency and capability for long-sequence modeling to effectively process dense image sequences in tracking tasks. First, we highlight the issue of temporal inconsistency in existing Mamba-based methods, specifically the failure to account for temporal continuity in the Mamba scanning mechanism. Secondly, building upon this insight,we propose TrackingMiM, a Mamba-in-Mamba architecture, a minimal-computation burden model for handling image sequence of tracking problem. In our framework, the mamba scan is performed in a nested way while independently process temporal and spatial coherent patch tokens. While the template frame is encoded as query token and utilized for tracking in every scan. Extensive experiments conducted on five UAV tracking benchmarks confirm that the proposed TrackingMiM achieves state-of-the-art precision while offering noticeable higher speed in UAV tracking.

[83] ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving

Kai Chen,Ruiyuan Gao,Lanqing Hong,Hang Xu,Xu Jia,Holger Caesar,Dengxin Dai,Bingbing Liu,Dzmitry Tsishkou,Songcen Xu,Chunjing Xu,Qiang Xu,Huchuan Lu,Dit-Yan Yeung

Main category: cs.CV

TL;DR: The W-CODA workshop at ECCV 2024 focused on exploring cutting-edge solutions for autonomous driving challenges through discussions, paper submissions, and a dual-track challenge on corner case scenarios.

Details Motivation: To address the challenges in autonomous driving, particularly in handling corner cases, by leveraging advanced multimodal perception and comprehension techniques. Method: The workshop featured presentations from five invited speakers, a collection of research papers, and a dual-track challenge focused on corner case scene understanding and generation. Result: W-CODA established a platform for sharing progress and opinions on autonomous driving, highlighting the importance of frontier techniques in achieving reliable self-driving agents. Conclusion: The W-CODA workshop successfully brought together experts from academia and industry to discuss next-generation solutions for autonomous driving corner cases, aiming to bridge the gap between current techniques and fully intelligent self-driving agents. Abstract: In this paper, we present details of the 1st W-CODA workshop, held in conjunction with the ECCV 2024. W-CODA aims to explore next-generation solutions for autonomous driving corner cases, empowered by state-of-the-art multimodal perception and comprehension techniques. 5 Speakers from both academia and industry are invited to share their latest progress and opinions. We collect research papers and hold a dual-track challenge, including both corner case scene understanding and generation. As the pioneering effort, we will continuously bridge the gap between frontier autonomous driving techniques and fully intelligent, reliable self-driving agents robust towards corner cases.

[84] A Multi-Centric Anthropomorphic 3D CT Phantom-Based Benchmark Dataset for Harmonization

Mohammadreza Amirian,Michael Bach,Oscar Jimenez-del-Toro,Christoph Aberle,Roger Schaer,Vincent Andrearczyk,Jean-Félix Maestrati,Maria Martin Asiain,Kyriakos Flouris,Markus Obmann,Clarisse Dromain,Benoît Dufour,Pierre-Alexandre Alois Poletti,Hendrik von Tengg-Kobligk,Rolf Hügli,Martin Kretzschmar,Hatem Alkadhi,Ender Konukoglu,Henning Müller,Bram Stieltjes,Adrien Depeursinge

Main category: cs.CV

TL;DR: 这篇论文介绍了一个新的开放源码基准数据集和相关方法,旨在解决基于人工智能的CT分析中的数据分布偏移问题,通过推动AI协调技术的发展来提升模型的泛化能力。

Details Motivation: 在医学领域,尽管人工智能(AI)为人类辅助和任务自动化带来了许多机会,但在数据分布发生变化时,其泛化能力较差,特别是在基于AI的CT分析中,扫描仪制造商、重建技术和剂量的变化可能导致显著的数据分布偏移。 Method: 使用了包含1378个图像系列的数据集,这些数据由13台扫描仪获取,来自4个制造商,在8个机构中使用统一协议和几种采集剂量进行。此外,作者还提出了评估图像和特征级稳定性的方法以及肝脏组织分类的基线结果和开源代码。 Result: 论文展示了一个开放源码的基准数据集以及相关的评估方法和基线结果,旨在促进AI协调策略的发展,提高CT分析中AI模型的稳定性与适应性。 Conclusion: 该论文提出了一种开放源代码的基准数据集,用于推动人工智能(AI)协调技术的发展,以减少CT分析中由于采集设置不同引起的数据分布偏移问题。 Abstract: Artificial intelligence (AI) has introduced numerous opportunities for human assistance and task automation in medicine. However, it suffers from poor generalization in the presence of shifts in the data distribution. In the context of AI-based computed tomography (CT) analysis, significant data distribution shifts can be caused by changes in scanner manufacturer, reconstruction technique or dose. AI harmonization techniques can address this problem by reducing distribution shifts caused by various acquisition settings. This paper presents an open-source benchmark dataset containing CT scans of an anthropomorphic phantom acquired with various scanners and settings, which purpose is to foster the development of AI harmonization techniques. Using a phantom allows fixing variations attributed to inter- and intra-patient variations. The dataset includes 1378 image series acquired with 13 scanners from 4 manufacturers across 8 institutions using a harmonized protocol as well as several acquisition doses. Additionally, we present a methodology, baseline results and open-source code to assess image- and feature-level stability and liver tissue classification, promoting the development of AI harmonization strategies.

[85] Interpolation-Based Event Visual Data Filtering Algorithms

Marcin Kowlaczyk,Tomasz Kryjak

Main category: cs.CV

TL;DR: 该论文提出了一种用于事件相机数据去噪的新方法,可去除99%的噪声并适合嵌入式设备使用。

Details Motivation: 由于神经形态视觉领域的发展,事件相机越来越多地被应用,但其数据流存在显著的噪声问题,因此需要一种高效的降噪方法。 Method: 作者提出了四种基于无限脉冲响应(IIR)滤波器矩阵的算法,并在多个添加了人工和动态视觉传感器噪声的事件数据集上进行了比较分析。 Result: 所提出的方法能够在1280 x 720分辨率的传感器上仅使用约30KB内存,并能去除大约99%的噪声,同时保留大部分有效信号。 Conclusion: 本文提出了一种基于无限脉冲响应(IIR)滤波器矩阵的方法,能够有效去除事件数据中的大部分噪声,同时适用于嵌入式设备的实现。 Abstract: The field of neuromorphic vision is developing rapidly, and event cameras are finding their way into more and more applications. However, the data stream from these sensors is characterised by significant noise. In this paper, we propose a method for event data that is capable of removing approximately 99\% of noise while preserving the majority of the valid signal. We have proposed four algorithms based on the matrix of infinite impulse response (IIR) filters method. We compared them on several event datasets that were further modified by adding artificially generated noise and noise recorded with dynamic vision sensor. The proposed methods use about 30KB of memory for a sensor with a resolution of 1280 x 720 and is therefore well suited for implementation in embedded devices.

[86] A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic Segmentation

Hao Wang,Keyan Hu,Xin Guo,Haifeng Li,Chao Tao

Main category: cs.CV

TL;DR: This paper proposes the IDGBR framework for remote sensing semantic segmentation, combining discriminative and generative learning to improve boundary refinement.

Details Motivation: Remote sensing semantic segmentation requires accurate identification and localization of ground objects, but existing methods struggle with capturing high-frequency features necessary for precise boundary localization. Method: The IDGBR framework generates a coarse segmentation map using a discriminative backbone model, then utilizes a conditioning guidance network and an iterative denoising diffusion process to refine the segmentation. Result: Extensive experiments on five remote sensing datasets demonstrate that the IDGBR framework consistently refines boundaries in coarse segmentation results from diverse discriminative architectures. Conclusion: The proposed IDGBR framework successfully integrates discriminative and generative learning to refine boundaries in remote sensing semantic segmentation, showing consistent improvement across multiple datasets. Abstract: Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model's ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework's capability of consistent boundary refinement for coarse results from diverse discriminative architectures. The source code will be available at https://github.com/KeyanHu-git/IDGBR.

[87] SketchColour: Channel Concat Guided DiT-based Sketch-to-Colour Pipeline for 2D Animation

Bryan Constantine Sadihin,Michael Hua Wang,Shei Pern Chua,Hang Su

Main category: cs.CV

TL;DR: SketchColour is a new method for 2D animation colorization that improves efficiency and quality by utilizing a diffusion transformer and specialized adapters.

Details Motivation: The process of creating high-quality 2D animations is labor-intensive due to the need for animators to manually draw and color numerous frames. Method: Replaced conventional U-Net denoiser with a DiT-style architecture and injected sketch information using lightweight channel-concatenation adapters and LoRA finetuning. Result: Evaluated on the SAKUGA dataset, SketchColour outperforms existing methods in video colorization while using only half the training data and significantly reducing parameter count and GPU memory usage. Conclusion: SketchColour offers a more efficient and effective approach to 2D animation colorization, leveraging a diffusion transformer backbone with lightweight adapters and LoRA finetuning. Abstract: The production of high-quality 2D animation is highly labor-intensive process, as animators are currently required to draw and color a large number of frames by hand. We present SketchColour, the first sketch-to-colour pipeline for 2D animation built on a diffusion transformer (DiT) backbone. By replacing the conventional U-Net denoiser with a DiT-style architecture and injecting sketch information via lightweight channel-concatenation adapters accompanied with LoRA finetuning, our method natively integrates conditioning without the parameter and memory bloat of a duplicated ControlNet, greatly reducing parameter count and GPU memory usage. Evaluated on the SAKUGA dataset, SketchColour outperforms previous state-of-the-art video colourization methods across all metrics, despite using only half the training data of competing models. Our approach produces temporally coherent animations with minimal artifacts such as colour bleeding or object deformation. Our code is available at: https://bconstantine.github.io/SketchColour .

[88] Towards Controllable Real Image Denoising with Camera Parameters

Youngjin Oh,Junhyeong Kwon,Keuntek Lee,Nam Ik Cho

Main category: cs.CV

TL;DR: 這篇論文介紹了一種新的可控圖像去噪框架,利用相機參數調整去噪強度,提高了去噪性能。

Details Motivation: 現有的深度學習圖像去噪方法缺乏根據噪點水平、相機設置和用戶偏好調整去噪強度的靈活性。 Method: 將ISO、快門速度和F-number轉換為向量,以控制和增強去噪網絡的性能。 Result: 實驗結果顯示,所提出的框架可以無縫地添加可控性到標準去噪神經網絡中,並提高其性能。 Conclusion: 本文提出了一種可控的去噪框架,利用相機參數來調整去噪強度,並實驗證明該方法能有效提升去噪效果。 Abstract: Recent deep learning-based image denoising methods have shown impressive performance; however, many lack the flexibility to adjust the denoising strength based on the noise levels, camera settings, and user preferences. In this paper, we introduce a new controllable denoising framework that adaptively removes noise from images by utilizing information from camera parameters. Specifically, we focus on ISO, shutter speed, and F-number, which are closely related to noise levels. We convert these selected parameters into a vector to control and enhance the performance of the denoising network. Experimental results show that our method seamlessly adds controllability to standard denoising neural networks and improves their performance. Code is available at https://github.com/OBAKSA/CPADNet.

[89] Autonomous AI Surveillance: Multimodal Deep Learning for Cognitive and Behavioral Monitoring

Ameer Hamza,Zuhaib Hussain But,Umar Arif,Samiya,M. Abdullah Asad,Muhammad Naeem

Main category: cs.CV

TL;DR: A novel multi-modal classroom surveillance system has been developed to accurately assess student attentiveness by integrating drowsiness detection, mobile phone tracking, and face recognition technologies, achieving high performance metrics and enabling automatic attendance.

Details Motivation: To assess student attentiveness with enhanced precision through comprehensive, real-time monitoring of engagement and behavior. Method: The system integrates multiple modalities such as drowsiness detection, mobile phone usage tracking, and face recognition. It uses the YOLOv8 model for detecting mobile phone and sleep usage, LResNet Occ FC for facial recognition using YOLO and MTCNN, and is implemented within a core PHP web application with ESP32-CAM hardware for data capture. Result: Sleep detection achieves 97.42% mAP@50, face recognition reaches 86.45% validation accuracy, and mobile phone detection attains 85.89% mAP@50. Conclusion: The system not only enhances classroom monitoring but also ensures automatic attendance recording via face recognition, offering scalability for diverse educational environments. Abstract: This study presents a novel classroom surveillance system that integrates multiple modalities, including drowsiness, tracking of mobile phone usage, and face recognition,to assess student attentiveness with enhanced precision.The system leverages the YOLOv8 model to detect both mobile phone and sleep usage,(Ghatge et al., 2024) while facial recognition is achieved through LResNet Occ FC body tracking using YOLO and MTCNN.(Durai et al., 2024) These models work in synergy to provide comprehensive, real-time monitoring, offering insights into student engagement and behavior.(S et al., 2023) The framework is trained on specialized datasets, such as the RMFD dataset for face recognition and a Roboflow dataset for mobile phone detection. The extensive evaluation of the system shows promising results. Sleep detection achieves 97. 42% mAP@50, face recognition achieves 86. 45% validation accuracy and mobile phone detection reach 85. 89% mAP@50. The system is implemented within a core PHP web application and utilizes ESP32-CAM hardware for seamless data capture.(Neto et al., 2024) This integrated approach not only enhances classroom monitoring, but also ensures automatic attendance recording via face recognition as students remain seated in the classroom, offering scalability for diverse educational environments.(Banada,2025)

[90] DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

Yue-Jiang Dong,Wang Zhao,Jiale Xu,Ying Shan,Song-Hai Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为DepthSync的新框架,用于实现长视频的尺度和几何一致的深度预测。

Details Motivation: 现有的基于扩散的视频深度估计方法在处理长视频时存在尺度差异和几何结构不一致的问题。 Method: 引入了尺度引导和几何引导,以同步不同窗口间的深度尺度并强制窗口内的几何对齐。 Result: 实验表明,该方法在多种数据集上均能有效提高长视频深度估计的尺度和几何一致性。 Conclusion: DepthSync是一种有效的训练自由框架,能够通过扩散引导实现长视频的尺度和几何一致的深度预测。 Abstract: Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions. In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce scale guidance to synchronize the depth scale across windows and geometry guidance to enforce geometric alignment within windows based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.

[91] Survivability of Backdoor Attacks on Unconstrained Face Recognition Systems

Quentin Le Roux,Yannick Teglia,Teddy Furon,Philippe Loubet-Moundi,Eric Bourbao

Main category: cs.CV

TL;DR: This paper investigates DNN backdoor attacks on deep learning-based face recognition systems, demonstrating their vulnerability and offering countermeasures for stakeholders.

Details Motivation: The motivation is to address the lack of research on DNN backdoor attacks in real-life, unconstrained deep learning face recognition systems, which raises security concerns. Method: The study conducts a system-level analysis of backdoors in deep learning-based face recognition systems, exploring the feasibility of DNN backdoors through four contributions involving face detection tasks, feature extractors, and pipeline configurations. Result: The study demonstrates two backdoor attacks on face detection tasks, shows that feature extractors are vulnerable, and proves that a single backdoor can compromise entire system functionality across multiple pipeline configurations. Conclusion: This paper concludes that DNN backdoor attacks pose a significant threat to deep learning-based face recognition systems and provides best practices and countermeasures to address these vulnerabilities. Abstract: The widespread use of deep learning face recognition raises several security concerns. Although prior works point at existing vulnerabilities, DNN backdoor attacks against real-life, unconstrained systems dealing with images captured in the wild remain a blind spot of the literature. This paper conducts the first system-level study of backdoors in deep learning-based face recognition systems. This paper yields four contributions by exploring the feasibility of DNN backdoors on these pipelines in a holistic fashion. We demonstrate for the first time two backdoor attacks on the face detection task: face generation and face landmark shift attacks. We then show that face feature extractors trained with large margin losses also fall victim to backdoor attacks. Combining our models, we then show using 20 possible pipeline configurations and 15 attack cases that a single backdoor enables an attacker to bypass the entire function of a system. Finally, we provide stakeholders with several best practices and countermeasures.

[92] Perception-Oriented Latent Coding for High-Performance Compressed Domain Semantic Inference

Xu Zhang,Ming Lu,Yan Chen,Zhan Ma

Main category: cs.CV

TL;DR: 本文提出了一种新的图像编码方法POLC,能够提高视觉任务中的性能表现,同时减少所需的微调工作量。

Details Motivation: 为了解决MSE导向优化产生的有限语义丰富度的潜在空间问题,以及避免对整个视觉模型进行计算密集型的微调。 Method: 引入了一种名为Perception-Oriented Latent Coding (POLC)的方法,该方法通过丰富潜在特征的语义内容来实现高性能的压缩域语义推断。 Result: 实验结果表明,POLC在感知率性能上与最先进的生成图像编码方法相当,同时在视觉任务中显著提升了性能。 Conclusion: POLC是一种新的感知优化潜在编码方法,可以显著提升视觉任务的性能,并且只需要进行极少的微调操作。 Abstract: In recent years, compressed domain semantic inference has primarily relied on learned image coding models optimized for mean squared error (MSE). However, MSE-oriented optimization tends to yield latent spaces with limited semantic richness, which hinders effective semantic inference in downstream tasks. Moreover, achieving high performance with these models often requires fine-tuning the entire vision model, which is computationally intensive, especially for large models. To address these problems, we introduce Perception-Oriented Latent Coding (POLC), an approach that enriches the semantic content of latent features for high-performance compressed domain semantic inference. With the semantically rich latent space, POLC requires only a plug-and-play adapter for fine-tuning, significantly reducing the parameter count compared to previous MSE-oriented methods. Experimental results demonstrate that POLC achieves rate-perception performance comparable to state-of-the-art generative image coding methods while markedly enhancing performance in vision tasks, with minimal fine-tuning overhead. Code is available at https://github.com/NJUVISION/POLC.

[93] Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss

Yuxiao Wang,Yu Lei,Zhenao Wei,Weiying Xue,Xinyu Jiang,Nan Zhuang,Qi Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为P3HOT的人-物接触检测框架,通过引入提示引导和人体近端感知机制以及新的损失函数RJLoss和评估指标AD-Acc.,显著提高了检测性能。

Details Motivation: 现有的HOT检测模型受限于单一图像类型,难以保持区域类别一致性并过度分割交互较少的区域。 Method: 提出了结合提示引导和人体近端感知的P3HOT框架,并设计了新的损失函数和评估指标。 Result: 在四个指标上均取得最先进的性能,在HOT-Annotated数据集上的SC-Acc.、mIoU、wIoU和AD-Acc.分别提升了0.7、2.0、1.6和11.0。 Conclusion: P3HOT框架在HOT检测任务中表现出色,解决了现有方法的局限性。 Abstract: The task of Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects. Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions. To tackle this issue, a HOT framework, termed \textbf{P3HOT}, is proposed, which blends \textbf{P}rompt guidance and human \textbf{P}roximal \textbf{P}erception. To begin with, we utilize a semantic-driven prompt mechanism to direct the network's attention towards the relevant regions based on the correlation between image and text. Then a human proximal perception mechanism is employed to dynamically perceive key depth range around the human, using learnable parameters to effectively eliminate regions where interactions are not expected. Calculating depth resolves the uncertainty of the overlap between humans and objects in a 2D perspective, providing a quasi-3D viewpoint. Moreover, a Regional Joint Loss (RJLoss) has been created as a new loss to inhibit abnormal categories in the same area. A new evaluation metric called ``AD-Acc.'' is introduced to address the shortcomings of existing methods in addressing negative samples. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art performance in four metrics across two benchmark datasets. Specifically, our model achieves an improvement of \textbf{0.7}$\uparrow$, \textbf{2.0}$\uparrow$, \textbf{1.6}$\uparrow$, and \textbf{11.0}$\uparrow$ in SC-Acc., mIoU, wIoU, and AD-Acc. metrics, respectively, on the HOT-Annotated dataset. Code is available at https://github.com/YuxiaoWang-AI/P3HOT.

[94] Tile and Slide : A New Framework for Scaling NeRF from Local to Global 3D Earth Observation

Camille Billouard,Dawa Derksen,Alexandre Constantin,Bruno Vallet

Main category: cs.CV

TL;DR: Snake-NeRF is a scalable framework for 3D reconstruction from large satellite images, achieving efficiency and quality on a single GPU.

Details Motivation: State-of-the-art NeRF methods are limited to small scenes due to memory constraints during training, which motivated the development of a scalable solution like Snake-NeRF. Method: Snake-NeRF uses an out-of-core method, dividing the region of interest into non-overlapping 3D-tiled NeRFs while cropping images with overlap. It employs a $2\times 2$ 3D tile progression strategy and a segmented sampler to prevent reconstruction errors. Result: Snake-NeRF successfully scales NeRFs to large scenes, ensuring quality while reducing memory footprint by operating on a single GPU with linear time complexity. Conclusion: Snake-NeRF enables the processing of large satellite images with linear time complexity on a single GPU without quality compromise. Abstract: Neural Radiance Fields (NeRF) have recently emerged as a paradigm for 3D reconstruction from multiview satellite imagery. However, state-of-the-art NeRF methods are typically constrained to small scenes due to the memory footprint during training, which we study in this paper. Previous work on large-scale NeRFs palliate this by dividing the scene into NeRFs. This paper introduces Snake-NeRF, a framework that scales to large scenes. Our out-of-core method eliminates the need to load all images and networks simultaneously, and operates on a single device. We achieve this by dividing the region of interest into NeRFs that 3D tile without overlap. Importantly, we crop the images with overlap to ensure each NeRFs is trained with all the necessary pixels. We introduce a novel $2\times 2$ 3D tile progression strategy and segmented sampler, which together prevent 3D reconstruction errors along the tile edges. Our experiments conclude that large satellite images can effectively be processed with linear time complexity, on a single GPU, and without compromise in quality.

[95] Depth Anything at Any Condition

Boyuan Sun,Modi Jin,Bowen Yin,Qibin Hou

Main category: cs.CV

TL;DR: DepthAnything-AC improves monocular depth estimation in challenging environments using an unsupervised approach and spatial constraints, achieving robust zero-shot performance.

Details Motivation: Previous foundation models for monocular depth estimation perform well in general scenes but struggle in complex open-world environments with conditions like illumination variations, adverse weather, and sensor-induced distortions. Method: The paper proposes an unsupervised consistency regularization finetuning paradigm and introduces the Spatial Distance Constraint to improve performance in challenging conditions. Result: Experimental results demonstrate that DepthAnything-AC achieves strong performance across real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks without requiring large labeled datasets. Conclusion: DepthAnything-AC is a robust monocular depth estimation model capable of handling diverse environmental conditions, showing zero-shot capabilities across various benchmarks. Abstract: We present Depth Anything at Any Condition (DepthAnything-AC), a foundation monocular depth estimation (MDE) model capable of handling diverse environmental conditions. Previous foundation MDE models achieve impressive performance across general scenes but not perform well in complex open-world environments that involve challenging conditions, such as illumination variations, adverse weather, and sensor-induced distortions. To overcome the challenges of data scarcity and the inability of generating high-quality pseudo-labels from corrupted images, we propose an unsupervised consistency regularization finetuning paradigm that requires only a relatively small amount of unlabeled data. Furthermore, we propose the Spatial Distance Constraint to explicitly enforce the model to learn patch-level relative relationships, resulting in clearer semantic boundaries and more accurate details. Experimental results demonstrate the zero-shot capabilities of DepthAnything-AC across diverse benchmarks, including real-world adverse weather benchmarks, synthetic corruption benchmarks, and general benchmarks. Project Page: https://ghost233lism.github.io/depthanything-AC-page Code: https://github.com/HVision-NKU/DepthAnythingAC

[96] SAILViT: Towards Robust and Generalizable Visual Backbones for MLLMs via Gradual Feature Refinement

Weijie Yin,Dingkang Yang,Hongyuan Dong,Zijian Kang,Jiacong Wang,Xiao Liang,Chao Feng,Jiao Ran

Main category: cs.CV

TL;DR: 本研究提出了一种名为SAILViT的新方法,以解决传统视觉变换器在与大型语言模型联合训练时面临的挑战,并展示了其在多模态任务上的卓越性能。

Details Motivation: 由于传统的ViTs在与LLMs进行基于连接器的联合训练时存在参数初始化冲突和模态语义差距的问题,因此需要一种新的方法来突破MLLMs在复杂多模态交互中的性能瓶颈。 Method: 本文提出了SAILViT,一种渐进式特征学习增强的ViT,通过逐渐特征优化实现从粗到细粒度的特征对齐和世界知识注入。 Result: 作者进行了全面的实证分析,验证了SAILViT的强大鲁棒性和跨不同参数大小、模型架构、训练策略和数据规模的泛化能力。 Conclusion: SAILViT系列模型在多个维度上表现出强大的鲁棒性和泛化能力,并且能够显著提升现有的MLLMs在OpenCompass基准上的性能。 Abstract: Vision Transformers (ViTs) are essential as foundation backbones in establishing the visual comprehension capabilities of Multimodal Large Language Models (MLLMs). Although most ViTs achieve impressive performance through image-text pair-based contrastive learning or self-supervised mechanisms, they struggle to engage in connector-based co-training directly with LLMs due to potential parameter initialization conflicts and modality semantic gaps. To address the above challenges, this paper proposes SAILViT, a gradual feature learning-enhanced ViT for facilitating MLLMs to break through performance bottlenecks in complex multimodal interactions. SAILViT achieves coarse-to-fine-grained feature alignment and world knowledge infusion with gradual feature refinement, which better serves target training demands. We perform thorough empirical analyses to confirm the powerful robustness and generalizability of SAILViT across different dimensions, including parameter sizes, model architectures, training strategies, and data scales. Equipped with SAILViT, existing MLLMs show significant and consistent performance improvements on the OpenCompass benchmark across extensive downstream tasks. SAILViT series models are released at https://huggingface.co/BytedanceDouyinContent.

[97] Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective

Yuxin Mao,Zhen Qin,Jinxing Zhou,Hui Deng,Xuyang Shen,Bin Fan,Jing Zhang,Yiran Zhong,Yuchao Dai

Main category: cs.CV

TL;DR: 本文提出了一种新的线性注意力机制LASAD以及对应的图像生成模型LASADGen,在降低计算成本的同时保持了图像生成质量。

Details Motivation: 现有的基于Transformer的自回归图像生成模型因计算复杂度高和内存开销大而受限,同时线性注意力机制在图像生成任务中会显著降低质量,因为其难以捕捉视觉数据中的长距离依赖关系。 Method: 提出了基于空间感知衰减(LASAD)的注意力机制,并构建了自回归图像生成模型LASADGen,通过位置相关的衰减因子来保留2D空间关系。 Result: 实验表明,LASADGen在保持线性计算复杂度的同时,能够实现高质量的图像生成,并在ImageNet数据集上达到领先水平。 Conclusion: LASADGen有效地结合了线性注意力的效率和对视觉数据中长距离依赖关系的空间理解,从而在ImageNet上实现了最先进的图像生成性能和计算效率。 Abstract: Autoregressive (AR) models have garnered significant attention in image generation for their ability to effectively capture both local and global structures within visual data. However, prevalent AR models predominantly rely on the transformer architectures, which are beset by quadratic computational complexity concerning input sequence length and substantial memory overhead due to the necessity of maintaining key-value caches. Although linear attention mechanisms have successfully reduced this burden in language models, our initial experiments reveal that they significantly degrade image generation quality because of their inability to capture critical long-range dependencies in visual data. We propose Linear Attention with Spatial-Aware Decay (LASAD), a novel attention mechanism that explicitly preserves genuine 2D spatial relationships within the flattened image sequences by computing position-dependent decay factors based on true 2D spatial location rather than 1D sequence positions. Based on this mechanism, we present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity. Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency, bridging the gap between linear attention's efficiency and spatial understanding needed for high-quality generation.

[98] RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather

Yuran Wang,Yingping Liang,Yutao Hu,Ying Fu

Main category: cs.CV

TL;DR: RobuSTereo improves stereo matching in adverse weather by generating synthetic training data with a diffusion model and designing a robust feature encoder using ConvNet and denoising transformer.

Details Motivation: Learning-based stereo matching models struggle in adverse weather due to limited training data and difficulties extracting reliable features from degraded images. This hinders their ability to generalize to unseen weather conditions. Method: The paper proposes a diffusion-based simulation pipeline with a stereo consistency module to generate synthetic adverse-weather stereo data, and a robust feature encoder combining ConvNet and denoising transformer for improved feature extraction from degraded images. Result: Extensive experiments show that RobuSTereo significantly improves stereo matching performance across various adverse weather scenarios, reducing domain gaps and enhancing depth estimation accuracy. Conclusion: The paper concludes that RobuSTereo significantly enhances the zero-shot generalization and robustness of stereo matching models in adverse weather conditions by addressing data scarcity and feature extraction challenges. Abstract: Learning-based stereo matching models struggle in adverse weather conditions due to the scarcity of corresponding training data and the challenges in extracting discriminative features from degraded images. These limitations significantly hinder zero-shot generalization to out-of-distribution weather conditions. In this paper, we propose \textbf{RobuSTereo}, a novel framework that enhances the zero-shot generalization of stereo matching models under adverse weather by addressing both data scarcity and feature extraction challenges. First, we introduce a diffusion-based simulation pipeline with a stereo consistency module, which generates high-quality stereo data tailored for adverse conditions. By training stereo matching models on our synthetic datasets, we reduce the domain gap between clean and degraded images, significantly improving the models' robustness to unseen weather conditions. The stereo consistency module ensures structural alignment across synthesized image pairs, preserving geometric integrity and enhancing depth estimation accuracy. Second, we design a robust feature encoder that combines a specialized ConvNet with a denoising transformer to extract stable and reliable features from degraded images. The ConvNet captures fine-grained local structures, while the denoising transformer refines global representations, effectively mitigating the impact of noise, low visibility, and weather-induced distortions. This enables more accurate disparity estimation even under challenging visual conditions. Extensive experiments demonstrate that \textbf{RobuSTereo} significantly improves the robustness and generalization of stereo matching models across diverse adverse weather scenarios.

[99] SPoT: Subpixel Placement of Tokens in Vision Transformers

Martine Hjelkrem-Tan,Marius Aasan,Gabriel Y. Arteaga,Adín Ramírez Rivera

Main category: cs.CV

TL;DR: 本论文提出了SPoT这一新颖的标记化策略,使得ViT能够有效地绕过基于网格的限制,并通过Oracle引导的搜索来实现更少标记数下的高性能预测。

Details Motivation: 标准的分片方法将特征限制在离散的补丁网格上,这阻止了模型充分利用稀疏机制,迫使产生尴尬的折衷方案。 Method: 提出了一种新的标记化策略——子像素标记放置(SPoT),并通过Oracle引导的搜索来发现理想的子像素标记定位所能带来的显著性能提升。 Result: 通过SPoT方法大幅减少推理过程中准确预测所需的标记数量。 Conclusion: SPoT为ViT架构提供了灵活、高效和可解释的新方向,将稀疏性重新定义为一种战略优势。 Abstract: Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.

[100] What does really matter in image goal navigation?

Gianluca Monaci,Philippe Weinzaepfel,Christian Wolf

Main category: cs.CV

TL;DR: This paper explores whether image goal navigation can be effectively addressed through end-to-end reinforcement learning, showing that while simulation settings affect outcomes, skills learned can transfer to real-world scenarios and are linked to improved pose estimation.

Details Motivation: The authors want to determine if image goal navigation can be efficiently solved with end-to-end training of full agents with RL, which could have implications beyond Embodied AI by enabling the training of relative pose estimation from reward for navigation alone. Method: The authors conducted a large study investigating the effect of architectural choices such as late fusion, channel stacking, space-to-depth projections, and cross-attention on the emergence of relative pose estimators from navigation training using end-to-end training of full agents with RL. Result: The results show that simulator settings influence the success of recent methods, creating shortcuts in simulation, but these capabilities can be partially transferred to more realistic settings. The study also found correlations between navigation performance and emerging relative pose estimation performance. Conclusion: The paper concludes that while recent methods for image goal navigation are influenced by simulator settings, they can still be transferred to more realistic settings up to some extent, and there is a correlation between navigation performance and emerging relative pose estimation performance. Abstract: Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In a large study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention, and their role in the emergence of relative pose estimators from navigation training. We show that the success of recent methods is influenced up to a certain extent by simulator settings, leading to shortcuts in simulation. However, we also show that these capabilities can be transferred to more realistic setting, up to some extend. We also find evidence for correlations between navigation performance and probed (emerging) relative pose estimation performance, an important sub skill.

[101] Facial Emotion Learning with Text-Guided Multiview Fusion via Vision-Language Model for 3D/4D Facial Expression Recognition

Muzammil Behzad

Main category: cs.CV

TL;DR: This paper proposes FACET-VLM, a vision-language framework for 3D/4D facial expression recognition that achieves state-of-the-art results across multiple benchmarks by integrating multiview facial representation learning with semantic guidance from natural language prompts.

Details Motivation: Facial expression recognition (FER) in 3D and 4D domains is challenging due to the complexity of spatial and temporal facial dynamics. Its success is crucial for advancing applications in human behavior understanding, healthcare monitoring, and human-computer interaction. Method: The proposed FACET-VLM framework integrates multiview facial representation learning with semantic guidance from natural language prompts. It includes Cross-View Semantic Aggregation (CVSA), Multiview Text-Guided Fusion (MTGF), and a multiview consistency loss. Result: The model achieves state-of-the-art accuracy across multiple benchmarks including BU-3DFE, Bosphorus, BU-4DFE, and BP4D-Spontaneous. It also demonstrates strong performance in 4D micro-expression recognition (MER) on the 4DME dataset. Conclusion: FACET-VLM offers a robust, extensible, and high-performing solution for multimodal FER in both posed and spontaneous settings. Abstract: Facial expression recognition (FER) in 3D and 4D domains presents a significant challenge in affective computing due to the complexity of spatial and temporal facial dynamics. Its success is crucial for advancing applications in human behavior understanding, healthcare monitoring, and human-computer interaction. In this work, we propose FACET-VLM, a vision-language framework for 3D/4D FER that integrates multiview facial representation learning with semantic guidance from natural language prompts. FACET-VLM introduces three key components: Cross-View Semantic Aggregation (CVSA) for view-consistent fusion, Multiview Text-Guided Fusion (MTGF) for semantically aligned facial emotions, and a multiview consistency loss to enforce structural coherence across views. Our model achieves state-of-the-art accuracy across multiple benchmarks, including BU-3DFE, Bosphorus, BU-4DFE, and BP4D-Spontaneous. We further extend FACET-VLM to 4D micro-expression recognition (MER) on the 4DME dataset, demonstrating strong performance in capturing subtle, short-lived emotional cues. The extensive experimental results confirm the effectiveness and substantial contributions of each individual component within the framework. Overall, FACET-VLM offers a robust, extensible, and high-performing solution for multimodal FER in both posed and spontaneous settings.

[102] Component Adaptive Clustering for Generalized Category Discovery

Mingfu Yan,Jiancheng Huang,Yifan Liu,Shifeng Chen

Main category: cs.CV

TL;DR: 本文提出了一个用于广义类别发现的新框架AdaGCD,它结合了自适应槽注意力机制,能动态调整聚类数量,从而更有效地对无标签图像进行分类。

Details Motivation: 现有的方法通常依赖于预定义类别数量等刚性假设,限制了其处理现实世界数据固有变异性和复杂性的能力。 Method: 提出了一种基于簇的对比学习框架AdaGCD,并引入了自适应槽注意力机制(AdaSlot),能够根据数据复杂度动态确定最优槽数量。 Result: 在公共和细粒度数据集上的大量实验验证了该框架的有效性,并强调了利用空间局部信息进行无标签图像数据集类别发现的优势。 Conclusion: AdaGCD框架通过整合自适应表示和动态槽分配,捕捉实例特定和空间聚类特征,提高了开放世界场景中的类别发现效果。 Abstract: Generalized Category Discovery (GCD) tackles the challenging problem of categorizing unlabeled images into both known and novel classes within a partially labeled dataset, without prior knowledge of the number of unknown categories. Traditional methods often rely on rigid assumptions, such as predefining the number of classes, which limits their ability to handle the inherent variability and complexity of real-world data. To address these shortcomings, we propose AdaGCD, a cluster-centric contrastive learning framework that incorporates Adaptive Slot Attention (AdaSlot) into the GCD framework. AdaSlot dynamically determines the optimal number of slots based on data complexity, removing the need for predefined slot counts. This adaptive mechanism facilitates the flexible clustering of unlabeled data into known and novel categories by dynamically allocating representational capacity. By integrating adaptive representation with dynamic slot allocation, our method captures both instance-specific and spatially clustered features, improving class discovery in open-world scenarios. Extensive experiments on public and fine-grained datasets validate the effectiveness of our framework, emphasizing the advantages of leveraging spatial local information for category discovery in unlabeled image datasets.

[103] Using Wavelet Domain Fingerprints to Improve Source Camera Identification

Xinle Tian,Matthew Nunes,Emiko Dupont,Shaunagh Downing,Freddie Lichtenstein,Matt Burns

Main category: cs.CV

TL;DR: This paper introduces a wavelet domain fingerprint approach for camera fingerprint detection, improving accuracy and processing speed by eliminating the need for image reconstruction.

Details Motivation: To streamline the SPN extraction and comparison process in camera fingerprint detection for improved efficiency and accuracy. Method: Modification of wavelet-based SPN extraction by constructing a wavelet domain fingerprint, avoiding the final inversion step of the denoising algorithm. Result: Experimental results show enhanced performance in both detection accuracy and processing speed on real-world datasets. Conclusion: The proposed wavelet domain fingerprint method achieves higher detection accuracy and significantly improves processing speed compared to traditional approaches. Abstract: Camera fingerprint detection plays a crucial role in source identification and image forensics, with wavelet denoising approaches proving to be particularly effective in extracting sensor pattern noise (SPN). In this article, we propose a modification to wavelet-based SPN extraction. Rather than constructing the fingerprint as an image, we introduce the notion of a wavelet domain fingerprint. This avoids the final inversion step of the denoising algorithm and allows fingerprint comparisons to be made directly in the wavelet domain. As such, our modification streamlines the extraction and comparison process. Experimental results on real-world datasets demonstrate that our method not only achieves higher detection accuracy but can also significantly improve processing speed.

[104] Soft Self-labeling and Potts Relaxations for Weakly-Supervised Segmentation

Zhongwen Zhang,Yuri Boykov

Main category: cs.CV

TL;DR: This paper introduces soft self-labeling for weakly supervised segmentation, improving performance while handling class uncertainty and outperforming complex methods.

Details Motivation: Hard pseudo-labels cannot represent class uncertainty or errors in weakly supervised segmentation, motivating the need for soft self-labeling. Method: A principled auxiliary loss is derived for soft self-labeling, evaluating CRF relaxations and connecting network predictions with soft pseudo-labels. A continuous sub-problem solver is also proposed. Result: Soft self-labeling consistently enhances performance in scribble-based training using standard architectures, without requiring specialized systems. Conclusion: Soft self-labeling improves scribble-based training and outperforms more complex WSSS systems, even surpassing full pixel-precise supervision. Abstract: We consider weakly supervised segmentation where only a fraction of pixels have ground truth labels (scribbles) and focus on a self-labeling approach optimizing relaxations of the standard unsupervised CRF/Potts loss on unlabeled pixels. While WSSS methods can directly optimize such losses via gradient descent, prior work suggests that higher-order optimization can improve network training by introducing hidden pseudo-labels and powerful CRF sub-problem solvers, e.g. graph cut. However, previously used hard pseudo-labels can not represent class uncertainty or errors, which motivates soft self-labeling. We derive a principled auxiliary loss and systematically evaluate standard and new CRF relaxations (convex and non-convex), neighborhood systems, and terms connecting network predictions with soft pseudo-labels. We also propose a general continuous sub-problem solver. Using only standard architectures, soft self-labeling consistently improves scribble-based training and outperforms significantly more complex specialized WSSS systems. It can outperform full pixel-precise supervision. Our general ideas apply to other weakly-supervised problems/systems.

[105] When Does Pruning Benefit Vision Representations?

Enrico Cassano,Riccardo Renzulli,Andrea Bragagnolo,Marco Grangetto

Main category: cs.CV

TL;DR: This paper explores how pruning affects vision models in terms of interpretability, object discovery, and alignment with human perception, identifying optimal sparsity levels for improved performance across these dimensions.

Details Motivation: Although pruning is commonly used to reduce the complexity of deep learning models, its effects on interpretability and representation learning are not well understood. This paper aims to investigate these effects systematically. Method: The researchers analyzed different vision network architectures to determine how varying levels of sparsity affect feature attribution interpretability methods. They also explored whether pruning leads to more structured representations and improves unsupervised object discovery while examining if pruning enhances the alignment between model representations and human perception. Result: The findings reveal that sparse models can exhibit higher interpretability, downstream generalization, and human alignment at certain 'sweet spots'. However, these benefits depend significantly on the network's architecture and parameter count. Pruning may promote succinct and structured representations aligned with human perception by removing redundant information while preserving essential features. Conclusion: The study concludes that pruning has a complex interplay with interpretability, unsupervised object discovery, and alignment with human perception in vision models. There are 'sweet spots' where sparse models perform better on these dimensions, but this performance heavily depends on the architecture and size of the network. Abstract: Pruning is widely used to reduce the complexity of deep learning models, but its effects on interpretability and representation learning remain poorly understood. This paper investigates how pruning influences vision models across three key dimensions: (i) interpretability, (ii) unsupervised object discovery, and (iii) alignment with human perception. We first analyze different vision network architectures to examine how varying sparsity levels affect feature attribution interpretability methods. Additionally, we explore whether pruning promotes more succinct and structured representations, potentially improving unsupervised object discovery by discarding redundant information while preserving essential features. Finally, we assess whether pruning enhances the alignment between model representations and human perception, investigating whether sparser models focus on more discriminative features similarly to humans. Our findings also reveal the presence of sweet spots, where sparse models exhibit higher interpretability, downstream generalization and human alignment. However, these spots highly depend on the network architectures and their size in terms of trainable parameters. Our results suggest a complex interplay between these three dimensions, highlighting the importance of investigating when and how pruning benefits vision representations.

[106] HOI-Dyn: Learning Interaction Dynamics for Human-Object Motion Diffusion

Lin Wu,Zhixiang Chen,Jianglin Lan

Main category: cs.CV

TL;DR: HOI-Dyn improves 3D human-object interaction generation by modeling their dynamics together with a transformer-based approach and ensures training stability through a novel residual loss.

Details Motivation: Existing methods model human and object motions independently, leading to physically implausible and causally inconsistent interactions. A better approach is needed to capture detailed interaction dynamics. Method: HOI-Dyn uses a transformer-based interaction dynamics model and a residual-based dynamics loss to jointly optimize human and object motions as a driver-responder system. Result: The method achieves superior performance in both qualitative and quantitative evaluations, while also providing a feasible metric for assessing generated interaction quality. Conclusion: The proposed HOI-Dyn framework effectively improves the quality of 3D human-object interaction generation by modeling interaction dynamics and ensures inference efficiency. Abstract: Generating realistic 3D human-object interactions (HOIs) remains a challenging task due to the difficulty of modeling detailed interaction dynamics. Existing methods treat human and object motions independently, resulting in physically implausible and causally inconsistent behaviors. In this work, we present HOI-Dyn, a novel framework that formulates HOI generation as a driver-responder system, where human actions drive object responses. At the core of our method is a lightweight transformer-based interaction dynamics model that explicitly predicts how objects should react to human motion. To further enforce consistency, we introduce a residual-based dynamics loss that mitigates the impact of dynamics prediction errors and prevents misleading optimization signals. The dynamics model is used only during training, preserving inference efficiency. Through extensive qualitative and quantitative experiments, we demonstrate that our approach not only enhances the quality of HOI generation but also establishes a feasible metric for evaluating the quality of generated interactions.

[107] DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

Ming Dai,Wenxuan Cheng,Jiang-jiang Liu,Sen Yang,Wenxiao Cai,Yanpeng Sun,Wankou Yang

Main category: cs.CV

TL;DR: 本文提出了一个新颖的框架DeRIS,该框架通过分解RIS为感知和认知两个部分,对现有Referring Image Segmentation(RIS)框架的主要瓶颈进行系统分析。

Details Motivation: 现有的RIS框架的基本瓶颈缺乏系统性分析,因此需要一种新的框架来进行更有效的分析。 Method: 提出了一种将RIS分解为感知和认知两个关键组件的框架,并引入了Loopback Synergy机制以增强这两个模块之间的协同作用。此外,还分析并引入了一种简单的非参考样本转换数据增强方法来解决目标存在判断的长尾分布问题。 Result: 研究发现主要限制因素不是感知缺陷,而是当前模型的多模态认知能力不足。通过Loopback Synergy机制提升了精确分割和鲁棒的图像-文本理解能力。 Conclusion: DeRIS表现出在非参考和多参考场景中的固有适应性,无需特殊架构修改,增强了其通用性。 Abstract: Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS, a novel framework that decomposes RIS into two key components: perception and cognition. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, DeRIS demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability. The codes and models are available at https://github.com/Dmmm1997/DeRIS.

[108] Calibrated Self-supervised Vision Transformers Improve Intracranial Arterial Calcification Segmentation from Clinical CT Head Scans

Benjamin Jin,Grant Mair,Joanna M. Wardlaw,Maria del C. Valdés Hernández

Main category: cs.CV

TL;DR: This paper demonstrates that self-supervised Vision Transformers (ViTs) trained within a masked autoencoder (MAE) framework significantly improve IAC segmentation and clinical risk assessment compared to traditional supervised methods.

Details Motivation: Vision Transformers are efficient in self-supervised learning via MAE, making them suitable for 3D medical image segmentation without manual annotations. This is particularly important for assessing intracranial arterial calcification (IAC) as a biomarker for neurovascular diseases. Method: ViTs were pre-trained using the MAE framework and fine-tuned for IAC segmentation using data from the IST-3 clinical trial. Model performance was evaluated based on segmentation accuracy and clinical implications. Result: 1) Self-supervised ViT outperformed a supervised nnU-Net baseline by 3.2 Dice points, 2) low patch sizes and convolution-based upsampling improved model performance, and 3) ViTs increased robustness to higher slice thicknesses and improved clinical risk group classification by 46%. Conclusion: ViTs pre-trained with MAE show promising results for IAC segmentation, outperforming supervised baselines and improving clinical risk assessment. Abstract: Vision Transformers (ViTs) have gained significant popularity in the natural image domain but have been less successful in 3D medical image segmentation. Nevertheless, 3D ViTs are particularly interesting for large medical imaging volumes due to their efficient self-supervised training within the masked autoencoder (MAE) framework, which enables the use of imaging data without the need for expensive manual annotations. intracranial arterial calcification (IAC) is an imaging biomarker visible on routinely acquired CT scans linked to neurovascular diseases such as stroke and dementia, and automated IAC quantification could enable their large-scale risk assessment. We pre-train ViTs with MAE and fine-tune them for IAC segmentation for the first time. To develop our models, we use highly heterogeneous data from a large clinical trial, the third International Stroke Trial (IST-3). We evaluate key aspects of MAE pre-trained ViTs in IAC segmentation, and analyse the clinical implications. We show: 1) our calibrated self-supervised ViT beats a strong supervised nnU-Net baseline by 3.2 Dice points, 2) low patch sizes are crucial for ViTs for IAC segmentation and interpolation upsampling with regular convolutions is preferable to transposed convolutions for ViT-based models, and 3) our ViTs increase robustness to higher slice thicknesses and improve risk group classification in a clinical scenario by 46%. Our code is available online.

[109] SSL4SAR: Self-Supervised Learning for Glacier Calving Front Extraction from SAR Imagery

Nora Gourmelon,Marcel Dreier,Martin Mayr,Thorsten Seehaus,Dakota Pyles,Matthias Braun,Andreas Maier,Vincent Christlein

Main category: cs.CV

TL;DR: 本研究旨在解决当前基于深度学习模型提取冰川断裂前沿位置的不足之处,提出了两种新的自监督多模态预训练技术,并引入了一种结合Swin Transformer编码器和残差卷积神经网络解码器的混合模型架构。

Details Motivation: 冰川以前所未有的速度失去冰质量,需要准确、全年监测以了解前端消融,特别是导致冰山断裂的因素。当前最先进的模型依赖于ImageNet预训练权重,但由于ImageNet中的自然图像与遥感图像,尤其是合成孔径雷达(SAR)图像之间存在领域转移,它们并不理想。 Method: 论文中提出了一种新的自监督多模态预训练技术,并引入了一个结合了Swin Transformer编码器和残差卷积神经网络(CNN)解码器的混合模型架构。 Result: 当在SSL4SAR上进行预训练时,该模型在“CAlving Fronts and where to Find thEm”(CaFFe)基准数据集上实现了293米的平均距离误差,比之前的最佳模型提高了67米。同时,在基准数据集上的多注释符研究中,模型集成评估显示出75米的平均距离误差,接近人类性能的38米。 Conclusion: 论文的结论是,通过在SSL4SAR数据集上进行预训练,所提出的混合模型达到了293米的平均距离误差,优于之前最好的模型67米。此外,对基准数据集上的多注释符研究中的模型集成评估显示75米的平均距离误差,接近人类表现的38米,这使得冰川崩解前沿的季节性变化可以被精确监测。 Abstract: Glaciers are losing ice mass at unprecedented rates, increasing the need for accurate, year-round monitoring to understand frontal ablation, particularly the factors driving the calving process. Deep learning models can extract calving front positions from Synthetic Aperture Radar imagery to track seasonal ice losses at the calving fronts of marine- and lake-terminating glaciers. The current state-of-the-art model relies on ImageNet-pretrained weights. However, they are suboptimal due to the domain shift between the natural images in ImageNet and the specialized characteristics of remote sensing imagery, in particular for Synthetic Aperture Radar imagery. To address this challenge, we propose two novel self-supervised multimodal pretraining techniques that leverage SSL4SAR, a new unlabeled dataset comprising 9,563 Sentinel-1 and 14 Sentinel-2 images of Arctic glaciers, with one optical image per glacier in the dataset. Additionally, we introduce a novel hybrid model architecture that combines a Swin Transformer encoder with a residual Convolutional Neural Network (CNN) decoder. When pretrained on SSL4SAR, this model achieves a mean distance error of 293 m on the "CAlving Fronts and where to Find thEm" (CaFFe) benchmark dataset, outperforming the prior best model by 67 m. Evaluating an ensemble of the proposed model on a multi-annotator study of the benchmark dataset reveals a mean distance error of 75 m, approaching the human performance of 38 m. This advancement enables precise monitoring of seasonal changes in glacier calving fronts.

[110] Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis

Peng Zheng,Junke Wang,Yi Chang,Yizhou Yu,Rui Ma,Zuxuan Wu

Main category: cs.CV

TL;DR: This paper proposes DisCon, a novel framework for autoregressive visual generation that improves image fidelity by modeling continuous representations conditioned on discrete tokens.

Details Motivation: Current AR-based visual generation models suffer from information loss due to quantization when encoding images as discrete tokens. Modeling continuous tokens directly is challenging due to high-dimensional, unbounded space and out-of-distribution artifacts. Method: The paper introduces DisCon, which uses discrete tokens as conditional signals to model the conditional probability of continuous representations, avoiding information loss and optimization challenges. Result: DisCon achieves a gFID score of 1.38 on ImageNet 256×256 generation, outperforming state-of-the-art autoregressive approaches. Conclusion: DisCon effectively addresses the limitations of existing autoregressive visual generation models by modeling continuous representations conditioned on discrete tokens, thereby achieving superior performance in image fidelity. Abstract: Recent advances in large language models (LLMs) have spurred interests in encoding images as discrete tokens and leveraging autoregressive (AR) frameworks for visual generation. However, the quantization process in AR-based visual generation models inherently introduces information loss that degrades image fidelity. To mitigate this limitation, recent studies have explored to autoregressively predict continuous tokens. Unlike discrete tokens that reside in a structured and bounded space, continuous representations exist in an unbounded, high-dimensional space, making density estimation more challenging and increasing the risk of generating out-of-distribution artifacts. Based on the above findings, this work introduces DisCon (Discrete-Conditioned Continuous Autoregressive Model), a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets. By modeling the conditional probability of continuous representations conditioned on discrete tokens, DisCon circumvents the optimization challenges of continuous token modeling while avoiding the information loss caused by quantization. DisCon achieves a gFID score of 1.38 on ImageNet 256$\times$256 generation, outperforming state-of-the-art autoregressive approaches by a clear margin.

[111] Are Vision Transformer Representations Semantically Meaningful? A Case Study in Medical Imaging

Montasir Shams,Chashi Mahiul Islam,Shaeke Salman,Phat Tran,Xiuwen Liu

Main category: cs.CV

TL;DR: 这篇论文发现视觉转换器在医学图像分类中的表示缺乏语义意义并且容易受到小变化的影响,这对它们在安全关键系统中的应用提出了重大挑战。

Details Motivation: 视觉转换器在医学成像任务中表现出优于传统深度学习模型的准确性,但其规模和通过自注意力机制的复杂交互导致它们不被很好地理解。这项研究旨在揭示这些模型表示是否具有语义意义。 Method: 该论文使用了一种基于投影梯度的算法来分析视觉转换器的表示是否具有语义意义。 Result: 研究表明,视觉转换器的表示不是语义上有意义的,并且它们本质上容易受到小变化的影响。具有难以察觉差异的图像可以有非常不同的表示;另一方面,应该属于不同语义类别的图像可能具有几乎相同的表示。这种脆弱性可能导致不可靠的分类结果。 Conclusion: 该论文得出结论,视觉转换器(ViTs)在医学图像分类中的表示缺乏语义意义,并且对小的变化具有内在的脆弱性,这对其在安全关键系统中的部署提出了严峻挑战。 Abstract: Vision transformers (ViTs) have rapidly gained prominence in medical imaging tasks such as disease classification, segmentation, and detection due to their superior accuracy compared to conventional deep learning models. However, due to their size and complex interactions via the self-attention mechanism, they are not well understood. In particular, it is unclear whether the representations produced by such models are semantically meaningful. In this paper, using a projected gradient-based algorithm, we show that their representations are not semantically meaningful and they are inherently vulnerable to small changes. Images with imperceptible differences can have very different representations; on the other hand, images that should belong to different semantic classes can have nearly identical representations. Such vulnerability can lead to unreliable classification results; for example, unnoticeable changes cause the classification accuracy to be reduced by over 60\%. %. To the best of our knowledge, this is the first work to systematically demonstrate this fundamental lack of semantic meaningfulness in ViT representations for medical image classification, revealing a critical challenge for their deployment in safety-critical systems.

[112] Boosting Adversarial Transferability Against Defenses via Multi-Scale Transformation

Zihong Guo,Chen Wan,Yayin Zheng,Hailing Kuang,Xiaohai Lu

Main category: cs.CV

TL;DR: This paper proposes a Segmented Gaussian Pyramid (SGP) attack method to improve the transferability of adversarial examples, achieving higher attack success rates against black-box defense models.

Details Motivation: The transferability of adversarial examples poses a significant security threat to deep neural networks, particularly when attackers have no prior knowledge of the target model. Method: A new Segmented Gaussian Pyramid (SGP) attack method was proposed that uses Gaussian filtering and three types of downsampling to create multi-scale examples. Gradients of the loss function at each scale are averaged to determine perturbations. Result: Extensive experiments show that SGP significantly enhances attack success rates compared to state-of-the-art methods, with an increase of 2.3% to 32.6% based on transferability alone. Conclusion: The SGP method effectively improves the transferability of adversarial examples, significantly increasing attack success rates against black-box defense models. Abstract: The transferability of adversarial examples poses a significant security challenge for deep neural networks, which can be attacked without knowing anything about them. In this paper, we propose a new Segmented Gaussian Pyramid (SGP) attack method to enhance the transferability, particularly against defense models. Unlike existing methods that generally focus on single-scale images, our approach employs Gaussian filtering and three types of downsampling to construct a series of multi-scale examples. Then, the gradients of the loss function with respect to each scale are computed, and their average is used to determine the adversarial perturbations. The proposed SGP can be considered an input transformation with high extensibility that is easily integrated into most existing adversarial attacks. Extensive experiments demonstrate that in contrast to the state-of-the-art methods, SGP significantly enhances attack success rates against black-box defense models, with average attack success rates increasing by 2.3% to 32.6%, based only on transferability.

[113] FreeLoRA: Enabling Training-Free LoRA Fusion for Autoregressive Multi-Subject Personalization

Peng Zheng,Ye Wang,Rui Ma,Zuxuan Wu

Main category: cs.CV

TL;DR: FreeLoRA是一种新颖的、无需训练的方法,通过融合多个LoRA模块来实现在单一图像中的多主题个性化图像生成,同时减轻过拟合和主题间的相互干扰。

Details Motivation: 现有的方法在多主题个性化方面存在困难,需要复杂的重新调整或联合优化,而FreeLoRA旨在提供一个简单且通用的解决方案。 Method: FreeLoRA采用Full Token Tuning策略进行模块适应,并在推理过程中使用Subject-Aware Inference仅激活对应主题的模块。 Result: 实验结果表明,FreeLoRA在主题保真度和提示一致性方面表现出色。 Conclusion: FreeLoRA是一个无需训练的框架,能够有效地融合多个特定主题的LoRA模块,实现多主题个性化图像生成。 Abstract: Subject-driven image generation plays a crucial role in applications such as virtual try-on and poster design. Existing approaches typically fine-tune pretrained generative models or apply LoRA-based adaptations for individual subjects. However, these methods struggle with multi-subject personalization, as combining independently adapted modules often requires complex re-tuning or joint optimization. We present FreeLoRA, a simple and generalizable framework that enables training-free fusion of subject-specific LoRA modules for multi-subject personalization. Each LoRA module is adapted on a few images of a specific subject using a Full Token Tuning strategy, where it is applied across all tokens in the prompt to encourage weakly supervised token-content alignment. At inference, we adopt Subject-Aware Inference, activating each module only on its corresponding subject tokens. This enables training-free fusion of multiple personalized subjects within a single image, while mitigating overfitting and mutual interference between subjects. Extensive experiments show that FreeLoRA achieves strong performance in both subject fidelity and prompt consistency.

[114] HCNQA: Enhancing 3D VQA with Hierarchical Concentration Narrowing Supervision

Shengli Zhou,Jianuo Zhu,Qilin Huang,Fangjing Wang,Yanfu Zhang,Feng Zheng

Main category: cs.CV

TL;DR: 本文提出了一种新的3D视觉问答(3D VQA)模型HCNQA,通过层次化集中缩小监督方法引导模型进行合理推理,避免依赖问题-答案对中的表面模式。

Details Motivation: 现有的答案中心监督方法可能导致模型发展出表面化的捷径,缺乏对推理路径的监督,从而导致模型推理不合理或效率低下。 Method: 提出了层次化集中缩小监督方法,模仿人类从广泛区域逐渐聚焦到具体对象的过程,通过三个阶段的集中缩小引导模型推理。 Result: 实验结果表明,该方法能够有效确保模型发展出合理的推理路径,并在3D VQA任务上取得更好的性能。 Conclusion: HCNQA有效地确保了模型发展出合理的推理路径,并在3D VQA任务上表现更好。 Abstract: 3D Visual Question-Answering (3D VQA) is pivotal for models to perceive the physical world and perform spatial reasoning. Answer-centric supervision is a commonly used training method for 3D VQA models. Many models that utilize this strategy have achieved promising results in 3D VQA tasks. However, the answer-centric approach only supervises the final output of models and allows models to develop reasoning pathways freely. The absence of supervision on the reasoning pathway enables the potential for developing superficial shortcuts through common patterns in question-answer pairs. Moreover, although slow-thinking methods advance large language models, they suffer from underthinking. To address these issues, we propose \textbf{HCNQA}, a 3D VQA model leveraging a hierarchical concentration narrowing supervision method. By mimicking the human process of gradually focusing from a broad area to specific objects while searching for answers, our method guides the model to perform three phases of concentration narrowing through hierarchical supervision. By supervising key checkpoints on a general reasoning pathway, our method can ensure the development of a rational and effective reasoning pathway. Extensive experimental results demonstrate that our method can effectively ensure that the model develops a rational reasoning pathway and performs better. The code is available at https://github.com/JianuoZhu/HCNQA.

[115] AMD: Adaptive Momentum and Decoupled Contrastive Learning Framework for Robust Long-Tail Trajectory Prediction

Bin Rao,Haicheng Liao,Yanchen Guan,Chengyue Wang,Bonan Wang,Jiaxun Zhang,Zhenning Li

Main category: cs.CV

TL;DR: This paper proposes an adaptive momentum and decoupled contrastive learning framework (AMD) to enhance trajectory prediction for autonomous driving, particularly addressing rare and complex long-tail trajectory patterns.

Details Motivation: Accurate trajectory prediction is crucial for autonomous driving; however, existing studies neglect the diversity and uncertainty of long-tail trajectory patterns, which represent complex and hazardous scenarios due to inherent imbalance in trajectory distributions. Method: An adaptive momentum and decoupled contrastive learning framework (AMD) was developed, incorporating unsupervised and supervised strategies. It includes an improved momentum contrast learning (MoCo-DT), a decoupled contrastive learning (DCL) module, four types of trajectory random augmentation methods, and an online iterative clustering strategy. Result: AMD showed optimal performance in long-tail trajectory prediction and superior overall prediction accuracy through extensive experiments on nuScenes and ETH/UCY datasets. Conclusion: The proposed AMD framework achieves optimal performance in long-tail trajectory prediction and demonstrates outstanding overall prediction accuracy. Abstract: Accurately predicting the future trajectories of traffic agents is essential in autonomous driving. However, due to the inherent imbalance in trajectory distributions, tail data in natural datasets often represents more complex and hazardous scenarios. Existing studies typically rely solely on a base model's prediction error, without considering the diversity and uncertainty of long-tail trajectory patterns. We propose an adaptive momentum and decoupled contrastive learning framework (AMD), which integrates unsupervised and supervised contrastive learning strategies. By leveraging an improved momentum contrast learning (MoCo-DT) and decoupled contrastive learning (DCL) module, our framework enhances the model's ability to recognize rare and complex trajectories. Additionally, we design four types of trajectory random augmentation methods and introduce an online iterative clustering strategy, allowing the model to dynamically update pseudo-labels and better adapt to the distributional shifts in long-tail data. We propose three different criteria to define long-tail trajectories and conduct extensive comparative experiments on the nuScenes and ETH$/$UCY datasets. The results show that AMD not only achieves optimal performance in long-tail trajectory prediction but also demonstrates outstanding overall prediction accuracy.

[116] Modulate and Reconstruct: Learning Hyperspectral Imaging from Misaligned Smartphone Views

Daniil Reutsky,Daniil Vladimirov,Yasin Mamedov,Georgy Perevozchikov,Nancy Mehta,Egor Ershov,Radu Timofte

Main category: cs.CV

TL;DR: This study introduces a triple-camera smartphone-based hyperspectral reconstruction method with spectral filters, significantly improving spectral estimation accuracy compared to traditional RGB cameras.

Details Motivation: Hyperspectral reconstruction from RGB images is fundamentally ill-posed due to significant spectral information loss. Existing methods are limited by relying on a single RGB image, prompting the need for a multi-image approach to improve accuracy. Method: A novel MI-HSR framework was developed using a triple-camera smartphone system, where two lenses are equipped with spectral filters. The method leverages richer and more diverse spectral observations, supported by the introduction of the Doomer dataset for training and benchmarking. Result: The proposed HSR model demonstrated consistent improvements over existing methods on the newly introduced Doomer dataset, achieving approximately 30% better accuracy in spectral estimation compared to ordinary RGB cameras. Conclusion: Multi-image-to-hyperspectral reconstruction using a triple-camera smartphone system with spectral filters achieves more accurate and practical hyperspectral imaging compared to conventional single-camera setups. Abstract: Hyperspectral reconstruction (HSR) from RGB images is a fundamentally ill-posed problem due to severe spectral information loss. Existing approaches typically rely on a single RGB image, limiting reconstruction accuracy. In this work, we propose a novel multi-image-to-hyperspectral reconstruction (MI-HSR) framework that leverages a triple-camera smartphone system, where two lenses are equipped with carefully selected spectral filters. Our configuration, grounded in theoretical and empirical analysis, enables richer and more diverse spectral observations than conventional single-camera setups. To support this new paradigm, we introduce Doomer, the first dataset for MI-HSR, comprising aligned images from three smartphone cameras and a hyperspectral reference camera across diverse scenes. We show that the proposed HSR model achieves consistent improvements over existing methods on the newly proposed benchmark. In a nutshell, our setup allows 30% towards more accurately estimated spectra compared to an ordinary RGB camera. Our findings suggest that multi-view spectral filtering with commodity hardware can unlock more accurate and practical hyperspectral imaging solutions.

[117] MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices

Hailong Yan,Ao Li,Xiangtao Zhang,Zhe Liu,Zenglin Shi,Ce Zhu,Le Zhang

Main category: cs.CV

TL;DR: This paper introduces a highly efficient CNN-based framework for real-time image enhancement on mobile devices, using innovative optimization techniques and achieving a strong balance between speed and performance.

Details Motivation: The motivation is to overcome the computational and memory constraints of deploying deep learning models for image enhancement on mobile devices. Method: The method combines a lightweight CNN with reparameterization, Incremental Weight Optimization, Feature Self-Transform module, Hierarchical Dual-Path Attention mechanism, and Local Variance-Weighted loss. Result: The proposed framework achieves real-time inference at up to 1,100 FPS while maintaining competitive image quality across multiple IE tasks. Conclusion: This paper proposes an efficient CNN framework for real-time image enhancement on mobile devices, achieving high speed and competitive quality. Abstract: Recent advancements in deep neural networks have driven significant progress in image enhancement (IE). However, deploying deep learning models on resource-constrained platforms, such as mobile devices, remains challenging due to high computation and memory demands. To address these challenges and facilitate real-time IE on mobile, we introduce an extremely lightweight Convolutional Neural Network (CNN) framework with around 4K parameters. Our approach integrates reparameterization with an Incremental Weight Optimization strategy to ensure efficiency. Additionally, we enhance performance with a Feature Self-Transform module and a Hierarchical Dual-Path Attention mechanism, optimized with a Local Variance-Weighted loss. With this efficient framework, we are the first to achieve real-time IE inference at up to 1,100 frames per second (FPS) while delivering competitive image quality, achieving the best trade-off between speed and performance across multiple IE tasks. The code will be available at https://github.com/AVC2-UESTC/MobileIE.git.

[118] Future Slot Prediction for Unsupervised Object Discovery in Surgical Video

Guiqiu Liao,Matjaz Jogan,Marcel Hussing,Edward Zhang,Eric Eaton,Daniel A. Hashimoto

Main category: cs.CV

TL;DR: This paper introduces a dynamic temporal slot transformer (DTST) module that enhances unsupervised object-centric learning for structured representations, achieving top performance on surgical video data and enabling practical use in healthcare.

Details Motivation: The motivation stems from the need to effectively parse complex, heterogeneous scenes in real-world healthcare applications like surgery, where current approaches with adaptive slot counts perform poorly on surgical videos. Method: The researchers proposed a dynamic temporal slot transformer (DTST) module, which is trained for both temporal reasoning and predicting optimal future slot initialization, to address the challenges of parsing heterogeneous scenes in surgical videos. Result: The model achieved state-of-the-art performance on multiple surgical databases, demonstrating its effectiveness in applying unsupervised object-centric methods to real-world data. Conclusion: The study concludes that the dynamic temporal slot transformer (DTST) module significantly improves the application of unsupervised object-centric methods on real-world surgical data, making them viable for healthcare applications. Abstract: Object-centric slot attention is an emerging paradigm for unsupervised learning of structured, interpretable object-centric representations (slots). This enables effective reasoning about objects and events at a low computational cost and is thus applicable to critical healthcare applications, such as real-time interpretation of surgical video. The heterogeneous scenes in real-world applications like surgery are, however, difficult to parse into a meaningful set of slots. Current approaches with an adaptive slot count perform well on images, but their performance on surgical videos is low. To address this challenge, we propose a dynamic temporal slot transformer (DTST) module that is trained both for temporal reasoning and for predicting the optimal future slot initialization. The model achieves state-of-the-art performance on multiple surgical databases, demonstrating that unsupervised object-centric methods can be applied to real-world data and become part of the common arsenal in healthcare applications.

[119] Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation for Semi-Supervised Lifelong Person Re-Identification

Kunlun Xu,Fan Zhuo,Jiangmeng Li,Xu Zou,Jiahuan Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的半监督终身行人重识别方法SPRED,通过动态原型引导的伪标签生成与新旧知识协作净化框架解决无标签数据利用中的噪声问题,从而提高长期适应性能。

Details Motivation: 现有的终身行人重识别方法主要依赖于完全标注的数据流,但在现实场景中,由于标注资源有限,大量未标注数据与少量标注样本共存,导致性能严重下降。因此需要解决半监督终身行人重识别问题。 Method: 文章引入了可学习的身份原型来动态捕捉身份分布并生成高质量伪标签,并结合当前模型特化和历史模型泛化的双知识协作方案,以净化噪声伪标签。 Result: 实验表明,SPRED在已建立的半监督终身行人重识别基准上达到了最先进的性能。 Conclusion: SPRED通过自增强循环设计,在终身学习过程中有效地挖掘可靠的伪标签,提升了当前阶段的学习效果并确保了长期的知识传播。 Abstract: Current lifelong person re-identification (LReID) methods predominantly rely on fully labeled data streams. However, in real-world scenarios where annotation resources are limited, a vast amount of unlabeled data coexists with scarce labeled samples, leading to the Semi-Supervised LReID (Semi-LReID) problem where LReID methods suffer severe performance degradation. Existing LReID methods, even when combined with semi-supervised strategies, suffer from limited long-term adaptation performance due to struggling with the noisy knowledge occurring during unlabeled data utilization. In this paper, we pioneer the investigation of Semi-LReID, introducing a novel Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation framework (SPRED). Our key innovation lies in establishing a self-reinforcing cycle between dynamic prototype-guided pseudo-label generation and new-old knowledge collaborative purification to enhance the utilization of unlabeled data. Specifically, learnable identity prototypes are introduced to dynamically capture the identity distributions and generate high-quality pseudo-labels. Then, the dual-knowledge cooperation scheme integrates current model specialization and historical model generalization, refining noisy pseudo-labels. Through this cyclic design, reliable pseudo-labels are progressively mined to improve current-stage learning and ensure positive knowledge propagation over long-term learning. Experiments on the established Semi-LReID benchmarks show that our SPRED achieves state-of-the-art performance. Our source code is available at https://github.com/zhoujiahuan1991/ICCV2025-SPRED

[120] Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

Qingdong He,Xueqin Chen,Chaoyi Wang,Yanjie Pan,Xiaobin Hu,Zhenye Gan,Yabiao Wang,Chengjie Wang,Xiangtai Li,Jiangning Zhang

Main category: cs.CV

TL;DR: 本文提出了Reason50K数据集和ReasonBrain框架,用于解决复杂隐式指令下的图像编辑问题,显著提升了推理能力与编辑效果。

Details Motivation: 现有的IIE方法难以处理需要深层推理的隐式指令,且缺乏支持此类任务的数据集与架构设计。 Method: 提出Reason50K数据集与ReasonBrain框架,包含FRCE模块和CME模块,以支持复杂的假设指令推理与编辑。 Result: 实验表明,ReasonBrain在推理场景上表现优异,并能在传统IIE任务中实现零样本泛化。 Conclusion: ReasonBrain通过结合MLLM和扩散模型,解决了复杂隐式指令的图像编辑问题,并在推理场景中优于现有方法。 Abstract: Instruction-based image editing (IIE) has advanced rapidly with the success of diffusion models. However, existing efforts primarily focus on simple and explicit instructions to execute editing operations such as adding, deleting, moving, or swapping objects. They struggle to handle more complex implicit hypothetical instructions that require deeper reasoning to infer plausible visual changes and user intent. Additionally, current datasets provide limited support for training and evaluating reasoning-aware editing capabilities. Architecturally, these methods also lack mechanisms for fine-grained detail extraction that support such reasoning. To address these limitations, we propose Reason50K, a large-scale dataset specifically curated for training and evaluating hypothetical instruction reasoning image editing, along with ReasonBrain, a novel framework designed to reason over and execute implicit hypothetical instructions across diverse scenarios. Reason50K includes over 50K samples spanning four key reasoning scenarios: Physical, Temporal, Causal, and Story reasoning. ReasonBrain leverages Multimodal Large Language Models (MLLMs) for editing guidance generation and a diffusion model for image synthesis, incorporating a Fine-grained Reasoning Cue Extraction (FRCE) module to capture detailed visual and textual semantics essential for supporting instruction reasoning. To mitigate the semantic loss, we further introduce a Cross-Modal Enhancer (CME) that enables rich interactions between the fine-grained cues and MLLM-derived features. Extensive experiments demonstrate that ReasonBrain consistently outperforms state-of-the-art baselines on reasoning scenarios while exhibiting strong zero-shot generalization to conventional IIE tasks. Our dataset and code will be released publicly.

[121] Modality Agnostic, patient-specific digital twins modeling temporally varying digestive motion

Jorge Tapias Gomez,Nishant Nadkarni,Lando S. Bosma,Jue Jiang,Ergys D. Subashi,William P. Segars,James M. Balter,Mert R Sabuncu,Neelam Tyagi,Harini Veeraraghavan

Main category: cs.CV

TL;DR: 本文提出了一种半自动化的管道,用于生成模拟胃肠道运动的患者特定数字孪生模型,以评估变形图像配准(DIR)方法的空间准确性,并验证剂量映射的精度。

Details Motivation: 胃肠道(GI)器官的高度移动性使基于体素的空间准确性评估变得困难,因此需要一种可靠的方法来测试DIR工具在动态解剖区域中的性能。 Method: 利用已发表的胃肠道运动模型,从静态3D患者扫描数据中生成21个运动阶段作为4D序列,并使用目标配准误差、Dice相似系数和95百分位Hausdorff距离等指标评估六种DIR方法的性能。此外,还对接受MR引导放疗患者的T2加权MRI扫描数据进行了剂量分布扭曲和累积分析。 Result: 所提出的管道成功合成了能够模拟真实胃肠道运动的数字孪生模型,其平均和最大运动幅度以及平均对数Jacobian行列式分别在0.8毫米和0.01以内,与已有临床数据相符。该方法支持对DIR性能进行详细定量评估,并能严格验证剂量映射的准确性。 Conclusion: 本研究提供了一个可靠的框架,用于测试动态且解剖结构复杂区域中的DIR工具,实现了精细的空间和剂量准确性验证。 Abstract: Objective: Clinical implementation of deformable image registration (DIR) requires voxel-based spatial accuracy metrics such as manually identified landmarks, which are challenging to implement for highly mobile gastrointestinal (GI) organs. To address this, patient-specific digital twins (DT) modeling temporally varying motion were created to assess the accuracy of DIR methods. Approach: 21 motion phases simulating digestive GI motion as 4D sequences were generated from static 3D patient scans using published analytical GI motion models through a semi-automated pipeline. Eleven datasets, including six T2w FSE MRI (T2w MRI), two T1w 4D golden-angle stack-of-stars, and three contrast-enhanced CT scans. The motion amplitudes of the DTs were assessed against real patient stomach motion amplitudes extracted from independent 4D MRI datasets. The generated DTs were then used to assess six different DIR methods using target registration error, Dice similarity coefficient, and the 95th percentile Hausdorff distance using summary metrics and voxel-level granular visualizations. Finally, for a subset of T2w MRI scans from patients treated with MR-guided radiation therapy, dose distributions were warped and accumulated to assess dose warping errors, including evaluations of DIR performance in both low- and high-dose regions for patient-specific error estimation. Main results: Our proposed pipeline synthesized DTs modeling realistic GI motion, achieving mean and maximum motion amplitudes and a mean log Jacobian determinant within 0.8 mm and 0.01, respectively, similar to published real-patient gastric motion data. It also enables the extraction of detailed quantitative DIR performance metrics and rigorous validation of dose mapping accuracy. Significance: The pipeline enables rigorously testing DIR tools for dynamic, anatomically complex regions enabling granular spatial and dosimetric accuracies.

[122] 3D Reconstruction and Information Fusion between Dormant and Canopy Seasons in Commercial Orchards Using Deep Learning and Fast GICP

Ranjan Sapkota,Zhichao Meng,Martin Churuvija,Xiaoqiang Du,Zenghong Ma,Manoj Karkee

Main category: cs.CV

TL;DR: This paper introduces an information fusion framework that integrates multi-seasonal structural data to enhance robotic crop load management in orchards, combining RGB-D imagery, YOLOv9-Seg segmentation, Kinect Fusion for 3D modeling, and Fast GICP for alignment, achieving high precision despite seasonal foliage challenges.

Details Motivation: Dense foliage during the canopy season severely occludes tree structures, limiting the effectiveness of machine vision systems. In contrast, the canopy structure is more visible during the dormant season, which inspired the integration of multi-seasonal data to improve robotic crop load management. Method: The framework combines high-resolution RGB-D imagery from both dormant and canopy periods using YOLOv9-Seg for instance segmentation, Kinect Fusion for 3D reconstruction, and Fast Generalized Iterative Closest Point (Fast GICP) for model alignment. Result: The YOLOv9-Seg model achieved a mean squared error (MSE) of 0.0047 and segmentation mAP@50 scores up to 0.78 for trunks in the dormant season dataset. Kinect Fusion produced accurate reconstructions validated with RMSEs of 5.23 mm for trunk diameter, 4.50 mm for branch diameter, and 13.72 mm for branch spacing. Fast GICP achieved precise cross-seasonal registration with a minimum fitness score of 0.00197. Conclusion: The study concludes that integrating multi-seasonal structural data through an information fusion framework significantly enhances the ability to manage crop load using robotics throughout the growing season, even when visibility is limited by dense foliage. Abstract: In orchard automation, dense foliage during the canopy season severely occludes tree structures, minimizing visibility to various canopy parts such as trunks and branches, which limits the ability of a machine vision system. However, canopy structure is more open and visible during the dormant season when trees are defoliated. In this work, we present an information fusion framework that integrates multi-seasonal structural data to support robotic and automated crop load management during the entire growing season. The framework combines high-resolution RGB-D imagery from both dormant and canopy periods using YOLOv9-Seg for instance segmentation, Kinect Fusion for 3D reconstruction, and Fast Generalized Iterative Closest Point (Fast GICP) for model alignment. Segmentation outputs from YOLOv9-Seg were used to extract depth-informed masks, which enabled accurate 3D point cloud reconstruction via Kinect Fusion; these reconstructed models from each season were subsequently aligned using Fast GICP to achieve spatially coherent multi-season fusion. The YOLOv9-Seg model, trained on manually annotated images, achieved a mean squared error (MSE) of 0.0047 and segmentation mAP@50 scores up to 0.78 for trunks in dormant season dataset. Kinect Fusion enabled accurate reconstruction of tree geometry, validated with field measurements resulting in root mean square errors (RMSE) of 5.23 mm for trunk diameter, 4.50 mm for branch diameter, and 13.72 mm for branch spacing. Fast GICP achieved precise cross-seasonal registration with a minimum fitness score of 0.00197, allowing integrated, comprehensive tree structure modeling despite heavy occlusions during the growing season. This fused structural representation enables robotic systems to access otherwise obscured architectural information, improving the precision of pruning, thinning, and other automated orchard operations.

[123] IC-Custom: Diverse Image Customization via In-Context Learning

Yaowei Li,Xiaoyu Li,Zhaoyang Zhang,Yuxuan Bian,Gan Liu,Xinyuan Li,Jiale Xu,Wenbo Hu,Yating Liu,Lingen Li,Jing Cai,Yuexian Zou,Yancheng He,Ying Shan

Main category: cs.CV

TL;DR: 本文提出了一种统一的图像定制框架 IC-Custom,通过上下文学习实现位置感知与无位置图像定制的融合,并在实际和合成数据上表现优异。

Details Motivation: 当前的图像定制方法通常将任务分为不同的范式,缺乏通用框架,限制了其应用场景。 Method: 提出了一种名为 IC-Custom 的统一框架,通过上下文学习无缝集成位置感知和无位置的图像定制,并引入了 ICMA 机制。 Result: IC-Custom 在身份一致性、和谐度和文本对齐指标上获得约 73% 的更高人类偏好,且训练参数仅占原模型的 0.4%。 Conclusion: IC-Custom 支持多种工业应用,并在多个评估基准上显著优于现有方法。 Abstract: Image customization, a crucial technique for industrial media production, aims to generate content that is consistent with reference images. However, current approaches conventionally separate image customization into position-aware and position-free customization paradigms and lack a universal framework for diverse customization, limiting their applications across various scenarios. To overcome these limitations, we propose IC-Custom, a unified framework that seamlessly integrates position-aware and position-free image customization through in-context learning. IC-Custom concatenates reference images with target images to a polyptych, leveraging DiT's multi-modal attention mechanism for fine-grained token-level interactions. We introduce the In-context Multi-Modal Attention (ICMA) mechanism with learnable task-oriented register tokens and boundary-aware positional embeddings to enable the model to correctly handle different task types and distinguish various inputs in polyptych configurations. To bridge the data gap, we carefully curated a high-quality dataset of 12k identity-consistent samples with 8k from real-world sources and 4k from high-quality synthetic data, avoiding the overly glossy and over-saturated synthetic appearance. IC-Custom supports various industrial applications, including try-on, accessory placement, furniture arrangement, and creative IP customization. Extensive evaluations on our proposed ProductBench and the publicly available DreamBench demonstrate that IC-Custom significantly outperforms community workflows, closed-source models, and state-of-the-art open-source approaches. IC-Custom achieves approximately 73% higher human preference across identity consistency, harmonicity, and text alignment metrics, while training only 0.4% of the original model parameters. Project page: https://liyaowei-stu.github.io/project/IC_Custom

[124] evMLP: An Efficient Event-Driven MLP Architecture for Vision

Zhentan Zheng

Main category: cs.CV

TL;DR: evMLP结合事件驱动的局部更新机制,在保持模型准确性的同时显著降低了视频处理中的计算成本。

Details Motivation: 受CNN、ViTs及MLPs在视觉架构中发展的启发,探索如何通过减少冗余计算提高顺序图像数据处理效率。 Method: 提出evMLP模型和基于事件驱动的局部更新机制,该机制仅处理发生“事件”的图像块(即帧间变化区域)。 Result: 在ImageNet分类任务中达到与SOTA模型相当的准确率;在多个视频数据集上验证了其降低计算成本的能力,同时保持输出与非事件驱动基线模型一致。 Conclusion: evMLP通过事件驱动机制有效减少了视频处理中的冗余计算,为未来视觉模型设计提供了新思路。 Abstract: Deep neural networks have achieved remarkable results in computer vision tasks. In the early days, Convolutional Neural Networks (CNNs) were the mainstream architecture. In recent years, Vision Transformers (ViTs) have become increasingly popular. In addition, exploring applications of multi-layer perceptrons (MLPs) has provided new perspectives for research into vision model architectures. In this paper, we present evMLP accompanied by a simple event-driven local update mechanism. The proposed evMLP can independently process patches on images or feature maps via MLPs. We define changes between consecutive frames as "events". Under the event-driven local update mechanism, evMLP selectively processes patches where events occur. For sequential image data (e.g., video processing), this approach improves computational performance by avoiding redundant computations. Through ImageNet image classification experiments, evMLP attains accuracy competitive with state-of-the-art models. More significantly, experimental results on multiple video datasets demonstrate that evMLP reduces computational cost via its event-driven local update mechanism while maintaining output consistency with its non-event-driven baseline. The code and trained models are available at https://github.com/i-evi/evMLP.

[125] CI-VID: A Coherent Interleaved Text-Video Dataset

Yiming Ju,Jijin Hu,Zhengxiong Luo,Haoge Deng,hanyu Zhao,Li Du,Chengwei Wu,Donglin Hao,Xinlong Wang,Tengfei Pan

Main category: cs.CV

TL;DR: 本研究提出了CI-VID数据集,用于支持从文本和视频到视频的生成任务,显著提升了模型在多片段视频生成方面的性能。

Details Motivation: 现有公开的文本到视频生成数据集主要由孤立的文本-视频对组成,难以支持连贯的多片段视频序列建模,因此需要一个新的数据集来解决这一限制。 Method: 设计了一个包含超过34万样本的新数据集CI-VID,每个样本包括连贯的视频片段序列和描述片段内容及过渡的文字说明,并建立了一个综合的多维度评估基准来验证其有效性。 Result: 实验结果表明,在CI-VID上训练的模型在生成视频序列时表现出显著提升的准确性和内容一致性,能够实现视觉过渡流畅且时间连贯的故事驱动内容创作。 Conclusion: CI-VID数据集的引入显著提高了视频生成模型在多场景连续生成中的准确性和内容一致性,促进了故事驱动型内容的创作,并具有实际应用价值。 Abstract: Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to support the modeling of coherent multi-clip video sequences. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated text-to-video (T2V) generation toward text-and-video-to-video (TV2V) generation, enabling models to produce coherent, multi-scene video sequences. CI-VID contains over 340,000 samples, each featuring a coherent sequence of video clips with text captions that capture both the individual content of each clip and the transitions between them, enabling visually and textually grounded generation. To further validate the effectiveness of CI-VID, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID exhibit significant improvements in both accuracy and content consistency when generating video sequences. This facilitates the creation of story-driven content with smooth visual transitions and strong temporal coherence, underscoring the quality and practical utility of the CI-VID dataset We release the CI-VID dataset and the accompanying code for data construction and evaluation at: https://github.com/ymju-BAAI/CI-VID

[126] LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

Nan Chen,Mengqi Huang,Yihao Meng,Zhendong Mao

Main category: cs.CV

TL;DR: This paper proposes LongAnimation, a new framework for automated animation colorization that achieves both short-term and long-term color consistency by using a dynamic global-local paradigm.

Details Motivation: Long animation colorization involves high labor costs in the real animation industry, and existing studies are limited to short-term colorization while neglecting global information, failing to maintain long-term color consistency. Method: The study proposes a novel framework named LongAnimation, which includes three main components: SketchDiT for capturing hybrid reference features, Dynamic Global-Local Memory (DGLM) for dynamically compressing global historical features, and Color Consistency Reward to refine color consistency. During inference, a color consistency fusion method is introduced to smooth video segment transitions. Result: Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations demonstrate the effectiveness of LongAnimation in achieving ideal short-term and long-term color consistency. Conclusion: The study concludes that the proposed LongAnimation framework effectively maintains both short-term and long-term color consistency in open-domain animation colorization tasks, showing significant research value. Abstract: Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at https://cn-makers.github.io/long_animation_web/.

[127] Kwai Keye-VL Technical Report

Kwai Keye Team,Biao Yang,Bin Wen,Changyi Liu,Chenglong Chu,Chengru Song,Chongling Rao,Chuan Yi,Da Li,Dunju Zang,Fan Yang,Guorui Zhou,Hao Peng,Haojie Ding,Jiaming Huang,Jiangxia Cao,Jiankang Chen,Jingyun Hua,Jin Ouyang,Kaibing Chen,Kaiyu Jiang,Kaiyu Tang,Kun Gai,Shengnan Zhang,Siyang Mao,Sui Huang,Tianke Zhang,Tingting Gao,Wei Chen,Wei Yuan,Xiangyu Wu,Xiao Hu,Xingyu Lu,Yang Zhou,Yi-Fan Zhang,Yiping Yang,Yulong Chen,Zhenhua Wu,Zhenyu Li,Zhixin Ling,Ziming Li,Dehua Ma,Di Xu,Haixuan Gao,Hang Li,Jiawei Guo,Jing Wang,Lejian Ren,Muhao Wei,Qianqian Wang,Qigen Hu,Shiyao Wang,Tao Yu,Xinchen Luo,Yan Li,Yiming Liang,Yuhang Hu,Zeyi Lu,Zhuoran Yang,Zixing Zhang

Main category: cs.CV

TL;DR: 本文提出了 Kwai Keye-VL,一种专为短视频理解设计的多模态基础模型,通过大规模数据和创新训练方法实现了卓越性能。

Details Motivation: 为了解决当前多模态大语言模型(MLLMs)在理解动态且信息密集的短视频方面存在的不足,满足当今数字环境中对短视频理解的需求。 Method: 开发了一个大规模、高质量的数据集,结合四阶段预训练和两阶段后训练的创新训练策略,其中第二阶段采用五种模式的“冷启动”数据混合方法,并进一步应用强化学习和对齐步骤优化模型表现。 Result: Kwai Keye-VL 在视频基准测试中取得领先成果,同时在图像任务上保持竞争力,并在新提出的 KC-MMBench 基准测试中表现优异。 Conclusion: Kwai Keye-VL 模型不仅在公共视频基准测试中达到了最先进的结果,在一般图像任务上也保持了竞争力,并通过 KC-MMBench 基准测试展示了其在现实短视频场景中的显著优势。 Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the \textbf{KC-MMBench}, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.

[128] FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model

Yukang Cao,Chenyang Si,Jinghao Wang,Ziwei Liu

Main category: cs.CV

TL;DR: FreeMorph is a novel, efficient method for image morphing that accommodates inputs with different semantics or layouts without requiring per-instance training, thereby setting a new state-of-the-art in the field.

Details Motivation: The motivation is to overcome the limitations of existing methods that require finetuning pre-trained diffusion models, which are time-consuming and ineffective when dealing with semantic/layout discrepancies. Method: FreeMorph integrates two key innovations: a guidance-aware spherical interpolation design and a step-oriented variation trend to blend self-attention modules for achieving controlled transitions. Result: FreeMorph outperforms existing methods in efficiency, being 10x ~ 50x faster, while delivering high-fidelity image morphing without per-instance training. Conclusion: FreeMorph successfully addresses the challenges of tuning-free image morphing by introducing innovative techniques that ensure high fidelity and efficient transitions between images with different semantics or layouts. Abstract: We present FreeMorph, the first tuning-free method for image morphing that accommodates inputs with different semantics or layouts. Unlike existing methods that rely on finetuning pre-trained diffusion models and are limited by time constraints and semantic/layout discrepancies, FreeMorph delivers high-fidelity image morphing without requiring per-instance training. Despite their efficiency and potential, tuning-free methods face challenges in maintaining high-quality results due to the non-linear nature of the multi-step denoising process and biases inherited from the pre-trained diffusion model. In this paper, we introduce FreeMorph to address these challenges by integrating two key innovations. 1) We first propose a guidance-aware spherical interpolation design that incorporates explicit guidance from the input images by modifying the self-attention modules, thereby addressing identity loss and ensuring directional transitions throughout the generated sequence. 2) We further introduce a step-oriented variation trend that blends self-attention modules derived from each input image to achieve controlled and consistent transitions that respect both inputs. Our extensive evaluations demonstrate that FreeMorph outperforms existing methods, being 10x ~ 50x faster and establishing a new state-of-the-art for image morphing.

[129] How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Rahul Ramachandran,Ali Garjani,Roman Bachmann,Andrei Atanov,Oğuzhan Fatih Kar,Amir Zamir

Main category: cs.CV

TL;DR: This paper evaluates how well large multimodal AI models (like GPT-4o) perform on standard vision tasks, finding they are decent generalists but still fall short of specialized models.

Details Motivation: To understand how well recent multimodal foundation models perform on traditional computer vision tasks compared to specialized models, especially given their primarily image-text training. Method: The paper benchmarks popular multimodal foundation models (e.g., GPT-4o, Gemini 1.5 Pro) on established computer vision tasks (e.g., object detection, image classification) using a prompt chaining approach to make these tasks compatible with the models' text-based outputs and API constraints. Result: Multimodal foundation models lag behind specialist models but are competent generalists. Semantic tasks are handled better than geometric ones, and some models like GPT-4o outperform others in non-reasoning tasks, while reasoning models improve performance in geometric tasks. Conclusion: Multimodal foundation models, while not matching state-of-the-art specialist models, perform reasonably well as generalists on standard vision tasks. They excel at semantic tasks more than geometric ones, and show varying sensitivity to prompt engineering techniques. Abstract: Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.

[130] Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

Zhuoyang Zhang,Luke J. Huang,Chengyue Wu,Shang Yang,Kelly Peng,Yao Lu,Song Han

Main category: cs.CV

TL;DR: This paper proposes LPD for faster autoregressive image generation by introducing flexible parallel modeling and locality-aware scheduling, significantly reducing latency without sacrificing quality.

Details Motivation: Traditional autoregressive image generation suffers from high latency due to memory-bound next-patch prediction, and existing methods achieve limited parallelization. Method: LPD introduces two techniques: Flexible Parallelized Autoregressive Modeling for arbitrary generation ordering and learnable position query tokens, and Locality-aware Generation Ordering to optimize group dependencies and contextual support. Result: Generation steps are reduced from 256 to 20 (256×256 resolution) and 1024 to 48 (512×512 resolution), with at least 3.4× lower latency than previous models on ImageNet class-conditional generation. Conclusion: The proposed LPD method significantly accelerates autoregressive image generation while maintaining quality, achieving higher parallelization and reduced latency compared to existing models. Abstract: We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256$\times$256 res.) and 1024 to 48 (512$\times$512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4$\times$ lower latency than previous parallelized autoregressive models.