Skip to content

Table of Contents

cs.CL [Back]

[1] Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Jianfeng Si,Lin Sun,Zhewen Tan,Xiangzheng Zhang

Main category: cs.CL

TL;DR: This paper proposes a unified co-training framework for Large Language Models that enables fine-grained, post-deployment control of safety behaviors (positive, negative, and rejective) through a single SFT stage, offering superior performance and efficiency compared to existing methods.

Details Motivation: Current methods like SFT and RLHF rely on multi-stage pipelines and lack post-deployment controllability. This work addresses these limitations by proposing a more flexible and efficient solution. Method: A unified co-training framework integrating multiple safety behaviors (positive, negative, and rejective) within a single SFT stage, activated via system-level instructions or magic tokens. Result: The method matches the safety alignment quality of SFT+DPO, with the 8B model surpassing DeepSeek-R1 (671B) in safety performance while significantly reducing training complexity and deployment cost. Conclusion: The proposed co-training framework is a scalable, efficient, and highly controllable solution for LLM content safety, offering superior safety performance with reduced training and deployment costs. Abstract: Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream moderation signals. This co-training strategy induces a distinct Safety Alignment Margin in the output space, characterized by well-separated response distributions corresponding to each safety mode. The existence of this margin provides empirical evidence for the model's safety robustness and enables unprecedented fine-grained control. Experiments show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance, while significantly reducing both training complexity and deployment costs. This work presents a scalable, efficient, and highly controllable solution for LLM content safety.

[2] Preliminary Ranking of WMT25 General Machine Translation Systems

Tom Kocmi,Eleftherios Avramidis,Rachel Bawden,Ondřej Bojar,Konstantin Dranch,Anton Dvorkovich,Sergey Dukanov,Natalia Fedorova,Mark Fishel,Markus Freitag,Thamme Gowda,Roman Grundkiewicz,Barry Haddow,Marzena Karpinska,Philipp Koehn,Howard Lakougna,Jessica Lundin,Kenton Murray,Masaaki Nagata,Stefano Perrella,Lorenzo Proietti,Martin Popel,Maja Popović,Parker Riley,Mariya Shmatova,Steinþór Steingrímsson,Lisa Yankovskaya,Vilém Zouhar

Main category: cs.CL

TL;DR: This report presents the preliminary ranking of the WMT25 General Machine Translation Shared Task based on automatic evaluations, which may favor systems with re-ranking techniques. The official ranking will use more reliable human evaluations.

Details Motivation: To share preliminary results with task participants so they can use the information when preparing their system submission papers. Method: Automatic metrics were used to evaluate MT systems for the preliminary ranking, while human evaluation will be used for the official ranking. Result: The preliminary ranking may be biased towards systems using re-ranking techniques, such as Quality Estimation re-ranking or Minimum Bayes Risk decoding. Conclusion: The preliminary ranking of the WMT25 General Machine Translation Shared Task is based on automatic evaluations and may be biased in favor of systems that employ re-ranking techniques. The official ranking will be based on more reliable human evaluations. Abstract: We present the preliminary ranking of the WMT25 General Machine Translation Shared Task, in which MT systems have been evaluated using automatic metrics. As this ranking is based on automatic evaluations, it may be biased in favor of systems that employ re-ranking techniques, such as Quality Estimation re-ranking or Minimum Bayes Risk decoding. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede the automatic ranking. The purpose of this report is not to present the final findings of the General MT task, but rather to share preliminary results with task participants, which may be useful when preparing their system submission papers.

[3] Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages

Israel Abebe Azime,Tadesse Destaw Belay,Dietrich Klakow,Philipp Slusallek,Anshuman Chhabra

Main category: cs.CL

TL;DR: This paper introduces an LLM-driven framework to create culturally localized math datasets, addressing biases in multilingual mathematical reasoning and improving model robustness across languages.

Details Motivation: The motivation stems from the scarcity of socio-cultural datasets reflecting native entities in low-resource languages, which hampers multilingual and culturally-grounded mathematical reasoning compared to English. Method: The researchers introduced a framework for LLM-driven cultural localization of math word problems, which automatically constructs datasets using native names, organizations, and currencies from existing sources. Result: The framework successfully mitigates English-centric entity bias and enhances robustness when native entities are incorporated across multiple languages. Additionally, it was found that translated benchmarks may misrepresent true multilingual math abilities in appropriate socio-cultural contexts. Conclusion: The study concludes that the proposed LLM-driven framework effectively addresses the lack of culturally-grounded datasets for multilingual mathematical reasoning, reducing English-centric bias and improving model robustness with native entities. Abstract: Large language models (LLMs) have demonstrated significant capabilities in solving mathematical problems expressed in natural language. However, multilingual and culturally-grounded mathematical reasoning in low-resource languages lags behind English due to the scarcity of socio-cultural task datasets that reflect accurate native entities such as person names, organization names, and currencies. Existing multilingual benchmarks are predominantly produced via translation and typically retain English-centric entities, owing to the high cost associated with human annotater-based localization. Moreover, automated localization tools are limited, and hence, truly localized datasets remain scarce. To bridge this gap, we introduce a framework for LLM-driven cultural localization of math word problems that automatically constructs datasets with native names, organizations, and currencies from existing sources. We find that translated benchmarks can obscure true multilingual math ability under appropriate socio-cultural contexts. Through extensive experiments, we also show that our framework can help mitigate English-centric entity bias and improves robustness when native entities are introduced across various languages.

[4] Improving LLMs for Machine Translation Using Synthetic Preference Data

Dario Vajda,Domen Vreš,Marko Robnik-Šikonja

Main category: cs.CL

TL;DR: This paper demonstrates that fine-tuning an LLM with DPO and a curated dataset improves machine translation performance and reduces errors.

Details Motivation: The motivation was to enhance the machine translation capabilities of a general instruction-tuned large language model using limited and easily produced data resources. Method: The researchers employed Direct Preference Optimization (DPO) training on a programmatically curated dataset. Translations were generated using two LLMs, GaMS-9B-Instruct and EuroLLM-9B-Instruct, and ranked using heuristics and automatic metrics like COMET. Result: The fine-tuned model outperformed both baseline models in COMET scores by approximately 0.04 and 0.02, respectively, and reduced language and formatting errors more consistently. Conclusion: The study concludes that fine-tuning a large language model using DPO and a curated dataset can consistently improve translation quality and reduce language and formatting errors. Abstract: Large language models have emerged as effective machine translation systems. In this paper, we explore how a general instruction-tuned large language model can be improved for machine translation using relatively few easily produced data resources. Using Slovene as a use case, we improve the GaMS-9B-Instruct model using Direct Preference Optimization (DPO) training on a programmatically curated and enhanced subset of a public dataset. As DPO requires pairs of quality-ranked instances, we generated its training dataset by translating English Wikipedia articles using two LLMs, GaMS-9B-Instruct and EuroLLM-9B-Instruct. We ranked the resulting translations based on heuristics coupled with automatic evaluation metrics such as COMET. The evaluation shows that our fine-tuned model outperforms both models involved in the dataset generation. In comparison to the baseline models, the fine-tuned model achieved a COMET score gain of around 0.04 and 0.02, respectively, on translating Wikipedia articles. It also more consistently avoids language and formatting errors.

[5] Multilingual Datasets for Custom Input Extraction and Explanation Requests Parsing in Conversational XAI Systems

Qianli Wang,Tatiana Anikina,Nils Feldhus,Simon Ostermann,Fedor Splitt,Jiaao Li,Yoana Tsoneva,Sebastian Möller,Vera Schmitt

Main category: cs.CL

TL;DR: 该论文旨在通过提出MultiCoXQL和Comapss两个数据集及新的解析方法,解决ConvXAI系统在多语言泛化和自定义输入支持方面的挑战。

Details Motivation: 当前基于意图识别的ConvXAI系统在多语言泛化方面存在显著挑战,训练数据稀缺且对自由形式的自定义输入支持有限。 Method: 介绍了一个多语言扩展的数据集MultiCoXQL,设计了一种新的解析方法,并提出了一个新的多语言数据集Compass用于自定义输入提取。 Result: 在MultiCoXQL数据集上评估了三种LLMs模型的不同解析策略,在Compass数据集上进行了单语、跨语言和多语言评估,使用了不同大小的LLMs和BERT型模型。 Conclusion: 本文提出了MultiCoXQL和Compass两个数据集以及一种新的解析方法,旨在解决ConvXAI系统在多语言泛化和自定义输入支持方面的挑战。 Abstract: Conversational explainable artificial intelligence (ConvXAI) systems based on large language models (LLMs) have garnered considerable attention for their ability to enhance user comprehension through dialogue-based explanations. Current ConvXAI systems often are based on intent recognition to accurately identify the user's desired intention and map it to an explainability method. While such methods offer great precision and reliability in discerning users' underlying intentions for English, a significant challenge in the scarcity of training data persists, which impedes multilingual generalization. Besides, the support for free-form custom inputs, which are user-defined data distinct from pre-configured dataset instances, remains largely limited. To bridge these gaps, we first introduce MultiCoXQL, a multilingual extension of the CoXQL dataset spanning five typologically diverse languages, including one low-resource language. Subsequently, we propose a new parsing approach aimed at enhancing multilingual parsing performance, and evaluate three LLMs on MultiCoXQL using various parsing strategies. Furthermore, we present Compass, a new multilingual dataset designed for custom input extraction in ConvXAI systems, encompassing 11 intents across the same five languages as MultiCoXQL. We conduct monolingual, cross-lingual, and multilingual evaluations on Compass, employing three LLMs of varying sizes alongside BERT-type models.

[6] Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

Bolian Li,Yanran Wu,Xinyu Luo,Ruqi Zhang

Main category: cs.CL

TL;DR: The paper introduces a new algorithm, called reward-Shifted Speculative Sampling, which efficiently aligns large language models with human preferences, achieving high performance with reduced computational costs.

Details Motivation: Aligning large language models with human preferences is a critical step in their development. However, current test-time alignment techniques often incur substantial inference costs, limiting their practical application. The authors were inspired by speculative sampling acceleration to address this efficiency bottleneck. Method: The authors introduced the reward-Shifted Speculative Sampling (SSS) algorithm, which utilizes an aligned draft model to predict future tokens while keeping the target model unchanged. They theoretically demonstrated how the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution by modifying the acceptance criterion and bonus token distribution. Result: The algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments. Conclusion: The reward-Shifted Speculative Sampling algorithm can effectively and efficiently align large language models with human preferences, achieving superior gold reward scores at reduced inference costs. Abstract: Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-Shifted Speculative Sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.

[7] LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

MohamamdJavad Ardestani,Ehsan Kamalloo,Davood Rafiei

Main category: cs.CL

TL;DR: LongRecall is a new framework for evaluating recall in machine-generated text, combining lexical, semantic, and structured entailment checks to improve accuracy and reduce errors.

Details Motivation: Ensuring the completeness of machine-generated text is crucial in critical domains like medicine and law, and existing recall metrics suffer from issues like lexical dependency, errors with paraphrased answers, and hallucinations in LLM-as-a-Judge methods. Method: LongRecall is a general three-stage recall evaluation framework that decomposes answers into self-contained facts, filters plausible candidate matches using lexical and semantic methods, and verifies alignment through structured entailment checks. Result: LongRecall demonstrates substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines on three challenging long-form QA benchmarks. Conclusion: LongRecall serves as a foundational building block for systematic recall assessment, showing substantial improvements in recall accuracy on challenging long-form QA benchmarks using both human annotations and LLM-based judges. Abstract: LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.

[8] Mapping the Course for Prompt-based Structured Prediction

Matt Pauk,Maria Leonor Pacheco

Main category: cs.CL

TL;DR: This paper proposes combining LLMs with combinatorial inference to improve structured prediction, showing that structured learning still adds value in the LLM era.

Details Motivation: LLMs struggle with hallucinations and complex reasoning due to their autoregressive nature, motivating the need for structural consistency through inference methods. Method: The study combines LLMs with combinatorial inference to enhance structured prediction. It explores prompting strategies, calibration, and fine-tuning using structured prediction objectives. Result: Experiments show that symbolic inference improves predictions regardless of prompting strategy, and calibration and fine-tuning further enhance performance on challenging tasks. Conclusion: Structured learning remains valuable in the era of LLMs, as combining LLMs with combinatorial inference improves prediction consistency and accuracy. Abstract: LLMs have been shown to be useful for a variety of language tasks, without requiring task-specific fine-tuning. However, these models often struggle with hallucinations and complex reasoning problems due to their autoregressive nature. We propose to address some of these issues, specifically in the area of structured prediction, by combining LLMs with combinatorial inference in an attempt to marry the predictive power of LLMs with the structural consistency provided by inference methods. We perform exhaustive experiments in an effort to understand which prompting strategies can effectively estimate LLM confidence values for use with symbolic inference, and show that, regardless of the prompting strategy, the addition of symbolic inference on top of prompting alone leads to more consistent and accurate predictions. Additionally, we show that calibration and fine-tuning using structured prediction objectives leads to increased performance for challenging tasks, showing that structured learning is still valuable in the era of LLMs.

[9] Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Rabeeh Karimi Mahabadi,Sanjeev Satheesh,Shrimai Prabhumoye,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro

Main category: cs.CL

TL;DR: 本文提出了一种从大规模网络数据中提取数学内容的新方法,构建了高质量的数学语料库,显著提升了模型的数学推理能力。

Details Motivation: 现有的数学数据集由于提取方法不可靠、转换损失以及数学结构的丢失,质量较低,因此需要更鲁棒的提取方法。 Method: 通过一个领域无关的处理流程,从 Common Crawl 中提取数学内容,并利用 lynx 渲染和 LLM 清洗来保留数学结构。 Result: 构建了 Nemotron-CC-Math-3+(1330 亿 token)和 Nemotron-CC-Math-4+(520 亿 token),其中 Nemotron-CC-Math-4+ 在质量和规模上均优于所有先前的开源数学数据集。 Conclusion: Nemotron-CC-Math 提供了一个大规模、高质量的数学语料库,显著提升了模型在数学、编程和一般推理任务上的表现。 Abstract: Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.

[10] Identifying and Answering Questions with False Assumptions: An Interpretable Approach

Zijie Wang,Eduardo Blanco

Main category: cs.CL

TL;DR: This paper presents an approach to identify and answer questions with false assumptions by leveraging external evidence and validating atomic assumptions, thereby reducing hallucinations in LLMs.

Details Motivation: The motivation is to address the issue of misleading answers generated by LLMs due to hallucinations when answering questions with false assumptions. Method: The method involves reducing the problem to fact verification and leveraging external evidence to mitigate hallucinations in LLMs. This includes generating and validating atomic assumptions. Result: Experiments with five LLMs showed that incorporating retrieved evidence is beneficial, and generating and validating atomic assumptions leads to more improvements and provides interpretable answers by specifying the false assumptions. Conclusion: The paper concludes that by generating and validating atomic assumptions and incorporating retrieved evidence, the identification and answering of questions with false assumptions can be significantly improved, providing interpretable answers. Abstract: People often ask questions with false assumptions, a type of question that does not have regular answers. Answering such questions require first identifying the false assumptions. Large Language Models (LLMs) often generate misleading answers because of hallucinations. In this paper, we focus on identifying and answering questions with false assumptions in several domains. We first investigate to reduce the problem to fact verification. Then, we present an approach leveraging external evidence to mitigate hallucinations. Experiments with five LLMs demonstrate that (1) incorporating retrieved evidence is beneficial and (2) generating and validating atomic assumptions yields more improvements and provides an interpretable answer by specifying the false assumptions.

[11] ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

Seungmin Han,Haeun Kwon,Ji-jun Park,Taeyang Yoon

Main category: cs.CL

TL;DR: 本文提出了CoLVLM Agent,通过一种无需大量重新训练底层模型的迭代框架,显著提升现有LVLMs在复杂多模态对话任务中的表现,包括推理深度、指令遵循和错误抑制。

Details Motivation: 尽管LLMs和LVLMs取得了显著进展,但现有模型在处理需要深度推理、持续上下文理解、实体追踪和多步指令跟随的复杂多模态任务时仍面临重大挑战。当前基准测试往往无法捕捉真实世界多模态交互的动态性和复杂性,导致上下文丢失和视觉幻觉问题。 Method: 提出了CoLVLM Agent框架,基于迭代的“记忆-感知-规划-执行”循环机制,以增强现有LVLMs在复杂多模态对话中的表现力。 Result: 在MMDR-Bench上进行的广泛实验表明,CoLVLM Agent平均人类评估得分为4.03,在推理深度、指令遵循和错误抑制方面显著优于GPT-4o(3.92)和Gemini 1.5 Pro(3.85)等最先进的商业模型,并在长时间对话中保持稳健性能。 Conclusion: CoLVLM Agent展现出在复杂多模态交互任务中的有效性,通过迭代的“记忆-感知-规划-执行”循环,无需大量重新训练底层模型,即可显著提升现有LVLMs的推理和指令遵循能力。 Abstract: Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a holistic framework that enhances existing LVLMs with advanced reasoning and instruction following capabilities through an iterative "memory-perception-planning-execution" cycle, requiring no extensive re-training of the underlying models. Our extensive experiments on MMDR-Bench demonstrate that CoLVLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03, notably surpassing state-of-the-art commercial models like GPT-4o (3.92) and Gemini 1.5 Pro (3.85). The framework exhibits significant advantages in reasoning depth, instruction adherence, and error suppression, and maintains robust performance over extended dialogue turns, validating the effectiveness of its modular design and iterative approach for complex multi-modal interactions.

[12] SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling

Dong Liu,Yanxuan Yu

Main category: cs.CL

TL;DR: SemToken is a semantic-aware tokenization framework that improves computation efficiency in language models by reducing token redundancy while maintaining performance.

Details Motivation: Existing tokenization approaches like BPE or WordPiece rely solely on frequency statistics, ignoring the semantic structure of text, which leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence. Method: Proposed SemToken, a semantic-aware tokenization framework that reduces token redundancy and improves computation efficiency by extracting contextual semantic embeddings and performing local semantic clustering. Result: Experiments show that SemToken achieves up to 2.4× reduction in token count and 1.9× speedup, with negligible or no degradation in perplexity and downstream accuracy. Conclusion: Semantic structure provides a promising new axis for optimizing tokenization and computation in large language models. Abstract: Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbf{SemToken}, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can be seamlessly integrated with modern language models and attention acceleration methods. Experiments on long-context language modeling benchmarks such as WikiText-103 and LongBench show that SemToken achieves up to $2.4\times$ reduction in token count and $1.9\times$ speedup, with negligible or no degradation in perplexity and downstream accuracy. Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.

[13] Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

Yuanchen Zhou,Shuo Jiang,Jie Zhu,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang

Main category: cs.CL

TL;DR: 本文提出Fin-PRM,一个专为金融领域设计的过程奖励模型,通过细粒度评估推理轨迹,显著提升了金融推理任务的表现。

Details Motivation: 现有的通用过程奖励模型(PRMs)在金融等特定领域中表现不足,因为这些领域的推理更结构化、符号化,并对事实和监管正确性敏感。因此,需要一个领域专业化的PRM来提升金融任务的推理质量。 Method: Fin-PRM结合了步骤级和轨迹级奖励监督,并应用于离线和在线奖励学习设置,以支持推理轨迹选择、强化学习的密集过程奖励以及测试时的奖励指导Best-of-N推理。 Result: Fin-PRM在CFLUE和FinQA等金融推理基准测试中表现优异,其在监督学习、强化学习和测试时性能分别提升了12.9%、5.2%和5.1%。 Conclusion: Fin-PRM是一个针对金融领域推理任务的领域专业化、轨迹感知的过程奖励模型,其集成了步骤级和轨迹级奖励监督,实现了对金融逻辑推理轨迹的细粒度评估,且在金融推理基准测试中优于通用PRM和强领域基线。 Abstract: Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce \textbf{Fin-PRM}, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9\% in supervised learning, 5.2\% in reinforcement learning, and 5.1\% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.

[14] SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

Huanxuan Liao,Yixing Xu,Shizhu He,Guanchen Li,Xuanwu Yin,Dong Li,Emad Barsoum,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: SPARK通过通道级别的KV缓存剪枝,有效解决了长上下文推理中的内存瓶颈问题,同时保持或提高了模型的准确性。

Details Motivation: 现有的KV缓存压缩方法通常忽视了特征维度(即通道轴)上的细粒度重要性变化,从而限制了其在效率和模型准确性之间的有效平衡。SPARK旨在通过关注通道级别的显著性变化来解决这一问题。 Method: SPARK采用了一种非结构化稀疏的方法,在通道级别上对KV缓存进行剪枝,并在注意力得分计算过程中动态恢复被剪枝的条目。 Result: SPARK在相同的内存预算下能够处理更长的序列,与基于驱逐的方法相比,KV缓存存储减少了超过30%。即使在80%的激进剪枝比例下,SPARK的性能下降也小于5%,展示了其鲁棒性和有效性。 Conclusion: SPARK是一个无需训练的即插即用方法,通过在通道级别上对KV缓存进行非结构化稀疏处理,有效解决了长上下文推理中的KV缓存瓶颈问题,并且与现有的KV压缩和量化技术正交,可以进一步提升性能。 Abstract: Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.

[15] Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering

Bolei He,Xinran He,Run Shao,Shanfu Shu,Xianwei Xue,Mingquan Cheng,Haifeng Li,Zhenhua Ling

Main category: cs.CL

TL;DR: Selct2Know (S2K) is a cost-effective framework for enhancing domain-specific knowledge in large language models by combining internal-external knowledge self-selection and selective supervised fine-tuning, outperforming existing methods in domain-specific QA benchmarks.

Details Motivation: Domain-specific scenarios pose challenges for large language models due to long-tail knowledge distributions, and current methods like RAG and continued pretraining have limitations such as hallucinations, latency, cost, and lack of cross-domain flexibility. Method: Selct2Know (S2K) uses an internal-external knowledge self-selection strategy and selective supervised fine-tuning, along with a structured reasoning data generation pipeline and GRPO integration to enhance reasoning. Result: Experiments on medical, legal, and financial QA benchmarks show that S2K consistently outperforms existing methods and matches domain-pretrained LLMs at a significantly lower cost. Conclusion: S2K provides an effective and cost-efficient solution for enhancing domain-specific knowledge in LLMs, addressing the limitations of previous approaches by leveraging a progressive knowledge acquisition strategy. Abstract: Large Language Models (LLMs) perform well in general QA but often struggle in domain-specific scenarios. Retrieval-Augmented Generation (RAG) introduces external knowledge but suffers from hallucinations and latency due to noisy retrievals. Continued pretraining internalizes domain knowledge but is costly and lacks cross-domain flexibility. We attribute this challenge to the long-tail distribution of domain knowledge, which leaves partial yet useful internal knowledge underutilized. We further argue that knowledge acquisition should be progressive, mirroring human learning: first understanding concepts, then applying them to complex reasoning. To address this, we propose Selct2Know (S2K), a cost-effective framework that internalizes domain knowledge through an internal-external knowledge self-selection strategy and selective supervised fine-tuning. We also introduce a structured reasoning data generation pipeline and integrate GRPO to enhance reasoning ability. Experiments on medical, legal, and financial QA benchmarks show that S2K consistently outperforms existing methods and matches domain-pretrained LLMs with significantly lower cost.

[16] Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall

Sijia Cui,Aiyao He,Shuai Xu,Hongming Zhang,Yanna Wang,Qingyang Zhang,Yajing Wang,Bo Xu

Main category: cs.CL

TL;DR: The paper introduces SEER, a self-guided method for improving multi-step tool usage in LLMs by leveraging a continuously updated experience pool, resulting in improved performance on benchmarks and real-world domains.

Details Motivation: LLMs struggle with multi-step tool usage, and existing methods require significant expert effort and complex prompt engineering. There is a need for a more efficient and scalable approach. Method: The proposed method, Stepwise Experience Recall (SEER), performs fine-grained, stepwise retrieval from a continually updated experience pool, incrementally augmenting it with past successful trajectories. Result: On the ToolQA benchmark, SEER achieves an average improvement of 6.1% on easy and 4.7% on hard questions. On τ-bench, it demonstrates accuracy gains of 7.44% and 23.38% using Qwen2.5-7B and Qwen2.5-72B models, respectively. Conclusion: SEER demonstrates substantial accuracy gains on the ToolQA benchmark and real-world domains, indicating its effectiveness in multi-step tool usage for LLMs. Abstract: Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substantial expert effort and prompt engineering becomes increasingly complex and inefficient as tool diversity and task difficulty scale. To address these challenges, we propose a self-guided method, Stepwise Experience Recall (SEER), which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of relying on static or manually curated library, SEER incrementally augments the experience pool with past successful trajectories, enabling continuous expansion of the pool and improved model performance over time. Evaluated on the ToolQA benchmark, SEER achieves an average improvement of 6.1\% on easy and 4.7\% on hard questions. We further test SEER on $\tau$-bench, which includes two real-world domains. Powered by Qwen2.5-7B and Qwen2.5-72B models, SEER demonstrates substantial accuracy gains of 7.44\% and 23.38\%, respectively.

[17] Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?

Momoka Furuhashi,Kouta Nakayama,Takashi Kodama,Saku Sugawara

Main category: cs.CL

TL;DR: 本研究探讨了使用清单评估生成任务的有效性,发现选择性使用清单在成对比较中效果更好,同时揭示了人工评估可能存在的不一致性。

Details Motivation: 由于模糊的标准,使用大型语言模型对生成任务进行自动评估面临挑战,而自动清单生成作为一种潜在有前景的方法,其有效性尚未得到充分探索。 Method: 通过六种方法生成清单,在八个模型规模上评估其有效性,并通过相关性分析识别与人工评估相关的清单项目。 Result: 实验表明选择性使用清单能改善成对比较任务的评估性能,但直接评分任务中的效果不一致。即使与人工评分相关性较低的清单项也反映了人工编写的评估标准,表明人工评估可能存在不一致性。 Conclusion: 研究发现选择性使用清单在成对比较设置中通常能提高评估性能,但在直接评分中的效果不太一致。研究强调需要更明确地定义客观评估标准,以指导人工和自动评估。 Abstract: Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored. We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations. Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring. Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations. \footnote{Our code is available at~https://github.com/momo0817/checklist-effectiveness-study

[18] VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models

Hanling Zhang,Yayu Zhou,Tongcheng Fang,Zhihang Yuan,Guohao Dai,Yu Wang

Main category: cs.CL

TL;DR: VocabTailor improves memory efficiency in small language models by dynamically managing vocabulary components, allowing deployment on edge devices without performance loss.

Details Motivation: SLMs face memory bottlenecks due to vocabulary-related components, and existing static pruning methods are inflexible and cause information loss. Method: VocabTailor uses a hybrid static-dynamic vocabulary selection strategy and offloads embeddings to reduce memory usage. Result: VocabTailor achieves up to a 99% reduction in memory usage for vocabulary-related components with minimal impact on performance. Conclusion: VocabTailor effectively addresses memory constraints in SLMs by dynamically selecting vocabulary components, significantly outperforming static pruning methods. Abstract: Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.

[19] WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai

Peerat Limkonchotiwat,Pume Tuchinda,Lalita Lowphansirikul,Surapon Nonesung,Panuthep Tasawong,Alham Fikri Aji,Can Udomcharoenchaikit,Sarana Nutanong

Main category: cs.CL

TL;DR: WangchanThaiInstruct是一个为泰国语设计的评估和指令调整数据集,它通过结合文化和职业背景的指令数据提高了大语言模型的对齐性。

Details Motivation: 现有的基准测试通常依赖翻译,缺少在现实世界应用中所需的文化和领域特定细微差别。 Method: 通过多阶段的质量控制过程,包括注释者、领域专家和AI研究人员的参与,创建了WangchanThaiInstruct数据集,并进行了零样本评估和指令微调研究。 Result: 使用WangchanThaiInstruct进行微调的模型在域内和域外基准测试中都优于使用翻译数据的模型。 Conclusion: WangchanThaiInstruct强调了在低资源、语言多样环境下,需要文化和职业背景的指令数据来改进LLM的对齐。 Abstract: Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.

[20] UniCoM: A Universal Code-Switching Speech Generator

Sangmin Lee,Woojin Chung,Seyun Um,Hong-Goo Kang

Main category: cs.CL

TL;DR: 本文提出了一种新的代码转换语音样本生成方法UniCoM和SWORDS算法,并构建了CS-FLEURS语料库,旨在推动多语言语音技术的发展。

Details Motivation: 代码转换在现实对话中常见,但适合处理此现象的系统因合适数据集的缺乏而受限,因此需要生成高质量的代码转换数据。 Method: 提出了一种新的生成代码转换样本的流水线UniCoM,其中包括SWORDS算法,该算法通过替换选定的词为其翻译来生成代码转换语音。 Result: 使用UniCoM构建的CS-FLEURS多语言代码转换语料库在自动语音识别和语音到文本翻译中表现出高可懂度和自然性。 Conclusion: UniCoM通过生成高质量、自然的代码转换样本来解决多语言语音技术中的挑战,CS-FLEURS在客观和主观指标上均表现出色,预计能推动代码转换语音技术的发展。 Abstract: Code-switching (CS), the alternation between two or more languages within a single speaker's utterances, is common in real-world conversations and poses significant challenges for multilingual speech technology. However, systems capable of handling this phenomenon remain underexplored, primarily due to the scarcity of suitable datasets. To resolve this issue, we propose Universal Code-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CS samples without altering sentence semantics. Our approach utilizes an algorithm we call Substituting WORDs with Synonyms (SWORDS), which generates CS speech by replacing selected words with their translations while considering their parts of speech. Using UniCoM, we construct Code-Switching FLEURS (CS-FLEURS), a multilingual CS corpus designed for automatic speech recognition (ASR) and speech-to-text translation (S2TT). Experimental results show that CS-FLEURS achieves high intelligibility and naturalness, performing comparably to existing datasets on both objective and subjective metrics. We expect our approach to advance CS speech technology and enable more inclusive multilingual systems.

[21] EMNLP: Educator-role Moral and Normative Large Language Models Profiling

Yilin Jiang,Mingzi Zhang,Sheng Jin,Zengyi Yu,Xiangjie Kong,Binghao Tu

Main category: cs.CL

TL;DR: 该论文开发了EMNLP框架,用于评估教师角色下LLMs的道德和心理特征,发现其在抽象道德推理上的优势与在情感复杂情境和提示注入风险中的不足。

Details Motivation: 尽管大型语言模型可以模拟职业角色,但在心理和伦理方面的评估仍不充分,因此需要一个专门针对教育角色的评估框架。 Method: EMNLP框架扩展了现有的量表,构建了88个教师特定的道德困境,并通过软提示注入测试评估模型的合规性和脆弱性。 Result: 实验结果显示,教师角色的LLMs比人类教师表现出更理想化和极端的人格特征,擅长抽象道德推理,但在情感复杂情境中表现不佳;更强的推理能力与更高的提示注入风险相关。 Conclusion: 该论文提出了EMNLP框架,用于评估教师角色下的大型语言模型(LLMs)的道德和规范特征,揭示了模型在抽象道德推理方面的优势与在情感复杂情境中的不足,并指出了模型能力与安全性之间的悖论。 Abstract: Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 12 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at https://e-m-n-l-p.github.io/.

[22] Conflict-Aware Soft Prompting for Retrieval-Augmented Generation

Eunseong Choi,June Park,Hyeri Lee,Jongwuk Lee

Main category: cs.CL

TL;DR: CARE improves RAG systems by resolving conflicts between external context and LLM knowledge, resulting in better performance on QA and fact-checking tasks.

Details Motivation: RAG systems often fail to resolve conflicts between incorrect external context and correct parametric knowledge of LLMs, known as context-memory conflict. This work aims to address this issue. Method: Conflict-Aware REtrieval-Augmented Generation (CARE) uses a context assessor and a base LLM to discern unreliable context and guide reasoning toward reliable knowledge sources. Result: CARE leads to an average performance gain of 5.0% on QA and fact-checking benchmarks. Conclusion: CARE effectively mitigates context-memory conflicts in RAG systems, leading to improved performance on QA and fact-checking benchmarks. Abstract: Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM's parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0\% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.

[23] TComQA: Extracting Temporal Commonsense from Text

Lekshmi R Nair,Arun Sankar,Koninika Pal

Main category: cs.CL

TL;DR: This paper proposes a pipeline to mine temporal commonsense using LLMs, resulting in the TComQA dataset, which improves model performance on temporal reasoning tasks.

Details Motivation: Temporal context understanding is crucial but challenging for machines, as it is often not explicitly stated in natural language, and current LLMs struggle with temporal reasoning. Method: A pipeline leveraging large language models (LLMs) to mine temporal commonsense and construct the TComQA dataset, validated through crowdsourcing and evaluated using experimental setups. Result: The TComQA dataset achieved over 80% precision in extracting temporal commonsense, and models trained on TComQA outperformed those fine-tuned on existing datasets for temporal question answering. Conclusion: The proposed temporal commonsense extraction pipeline effectively improves the performance of models on temporal question answering tasks. Abstract: Understanding events necessitates grasping their temporal context, which is often not explicitly stated in natural language. For example, it is not a trivial task for a machine to infer that a museum tour may last for a few hours, but can not take months. Recent studies indicate that even advanced large language models (LLMs) struggle in generating text that require reasoning with temporal commonsense due to its infrequent explicit mention in text. Therefore, automatically mining temporal commonsense for events enables the creation of robust language models. In this work, we investigate the capacity of LLMs to extract temporal commonsense from text and evaluate multiple experimental setups to assess their effectiveness. Here, we propose a temporal commonsense extraction pipeline that leverages LLMs to automatically mine temporal commonsense and use it to construct TComQA, a dataset derived from SAMSum and RealNews corpora. TComQA has been validated through crowdsourcing and achieves over 80\% precision in extracting temporal commonsense. The model trained with TComQA also outperforms an LLM fine-tuned on existing dataset of temporal question answering task.

[24] CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing

Abdul Rehman,Jian-Jun Zhang,Xiaosong Yang

Main category: cs.CL

TL;DR: CUPE is a compact, efficient model for universal phoneme recognition that achieves strong cross-lingual performance by focusing on short, phoneme-length acoustic patterns.

Details Motivation: The need for pure phoneme representations free from contextual influence across multiple languages motivated the development of a more efficient and lightweight model. Method: CUPE processes short, fixed-width speech windows independently, modeling basic acoustic patterns within phoneme-length segments to achieve universal phoneme recognition. Result: CUPE achieves competitive cross-lingual performance with fewer parameters than existing approaches, demonstrating strong generalization on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus. Conclusion: CUPE is a lightweight model that effectively captures key phoneme features within a short time frame, demonstrating strong cross-lingual generalization by focusing on fundamental acoustic patterns. Abstract: Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our development of CUPE - a lightweight model that captures key phoneme features in just 120 milliseconds, about one phoneme's length. CUPE processes short, fixed-width windows independently and, despite fewer parameters than current approaches, achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages. Our extensive evaluation through supervised and self-supervised training on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus, demonstrates strong cross-lingual generalization and reveals that effective universal speech processing is possible through modeling basic acoustic patterns within phoneme-length windows.

[25] KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models

Haji Gul,Abul Ghani Naim,Ajaz Ahmad Bhat

Main category: cs.CL

TL;DR: This paper proposes a new meta-metric called KG Evaluation based on Distance from Average Solution (EDAS) to provide a unified, reliable, and interpretable evaluation framework for Knowledge Graph Completion models by synthesizing performance across multiple datasets and metrics into a single score.

Details Motivation: The motivation lies in the challenge of comparing Knowledge Graph Completion (KGC) models across multiple datasets and metrics due to inconsistencies in performance rankings. There is a need for a unified meta-metric that enables reliable and interpretable evaluation. Method: KG Evaluation based on Distance from Average Solution (EDAS) was proposed to integrate performance across all metrics and datasets into a single normalized score. Experimental results on benchmark datasets such as FB15k-237 and WN18RR were used to validate the effectiveness of EDAS. Result: Experimental results showed that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking, offering a consistent, robust, and generalizable framework for evaluating KGC models. Conclusion: KG Evaluation based on Distance from Average Solution (EDAS) is a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and diverse evaluation criteria into a single normalized score. It offers a consistent, robust, and generalizable framework for evaluating KGC models. Abstract: Knowledge Graphs (KGs) enable applications in various domains such as semantic search, recommendation systems, and natural language processing. KGs are often incomplete, missing entities and relations, an issue addressed by Knowledge Graph Completion (KGC) methods that predict missing elements. Different evaluation metrics, such as Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hit@k, are commonly used to assess the performance of such KGC models. A major challenge in evaluating KGC models, however, lies in comparing their performance across multiple datasets and metrics. A model may outperform others on one dataset but underperform on another, making it difficult to determine overall superiority. Moreover, even within a single dataset, different metrics such as MRR and Hit@1 can yield conflicting rankings, where one model excels in MRR while another performs better in Hit@1, further complicating model selection for downstream tasks. These inconsistencies hinder holistic comparisons and highlight the need for a unified meta-metric that integrates performance across all metrics and datasets to enable a more reliable and interpretable evaluation framework. To address this need, we propose KG Evaluation based on Distance from Average Solution (EDAS), a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and diverse evaluation criteria into a single normalized score ($M_i \in [0,1]$). Unlike traditional metrics that focus on isolated aspects of performance, EDAS offers a global perspective that supports more informed model selection and promotes fairness in cross-dataset evaluation. Experimental results on benchmark datasets such as FB15k-237 and WN18RR demonstrate that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking, offering a consistent, robust, and generalizable framework for evaluating KGC models.

[26] A Survey on Large Language Model Benchmarks

Shiwen Ni,Guhong Chen,Shuaimin Li,Xuanang Chen,Siyi Li,Bingli Wang,Qiyao Wang,Xingjian Wang,Yifan Zhang,Liyang Fan,Chengming Li,Ruifeng Xu,Le Sun,Min Yang

Main category: cs.CL

TL;DR: This paper reviews and classifies existing benchmarks for large language models, highlights their limitations, and proposes a framework for future improvements.

Details Motivation: With the rapid development of large language models, there is a growing need for effective evaluation benchmarks to quantitatively assess model performance and guide future development. Method: The paper systematically reviews and categorizes 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. Result: The paper identifies problems in current benchmarks, such as inflated scores, unfair evaluations, and insufficient focus on dynamic environments, and provides a framework for the future design of benchmarks. Conclusion: The paper concludes that current large language model benchmarks face issues like data contamination, cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments. It suggests a design paradigm for future benchmark innovation. Abstract: In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.

[27] Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation

Yichi Zhang,Yao Huang,Yifan Wang,Yitong Sun,Chang Liu,Zhe Zhao,Zhengwei Fang,Huanran Chen,Xiao Yang,Xingxing Wei,Hang Su,Yinpeng Dong,Jun Zhu

Main category: cs.CL

TL;DR: This paper introduces MultiTrust-X, a benchmark for evaluating the trustworthiness of MLLMs across multiple dimensions. It identifies key vulnerabilities in current models and proposes RESA, a reasoning-based approach that significantly improves trustworthiness.

Details Motivation: Despite progress in MLLM capabilities, their trustworthiness remains a major concern. Existing approaches focus on narrow aspects and overlook multimodal risks, necessitating a more holistic evaluation framework like MultiTrust-X. Method: The authors propose MultiTrust-X, a comprehensive benchmark with a three-dimensional framework that evaluates MLLMs across five trustworthiness aspects, two novel risk types, and multiple mitigation strategies. They conduct extensive experiments on over 30 models and analyze 8 mitigation methods. Result: Experiments revealed significant vulnerabilities in current MLLMs, including risk amplification during multimodal training and inference. Existing mitigation strategies show limited effectiveness and often introduce trade-offs. The proposed RESA method, which incorporates reasoning, achieves state-of-the-art results in improving trustworthiness. Conclusion: The study concludes that current MLLMs have significant trustworthiness vulnerabilities, with a gap between trustworthiness and general capabilities. It highlights the need for better mitigation strategies, especially those that incorporate reasoning, such as the proposed RESA approach. Abstract: The trustworthiness of Multimodal Large Language Models (MLLMs) remains an intense concern despite the significant progress in their capabilities. Existing evaluation and mitigation approaches often focus on narrow aspects and overlook risks introduced by the multimodality. To tackle these challenges, we propose MultiTrust-X, a comprehensive benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs. We define a three-dimensional framework, encompassing five trustworthiness aspects which include truthfulness, robustness, safety, fairness, and privacy; two novel risk types covering multimodal risks and cross-modal impacts; and various mitigation strategies from the perspectives of data, model architecture, training, and inference algorithms. Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets, enabling holistic evaluations over 30 open-source and proprietary MLLMs and in-depth analysis with 8 representative mitigation methods. Our extensive experiments reveal significant vulnerabilities in current models, including a gap between trustworthiness and general capabilities, as well as the amplification of potential risks in base LLMs by both multimodal training and inference. Moreover, our controlled analysis uncovers key limitations in existing mitigation strategies that, while some methods yield improvements in specific aspects, few effectively address overall trustworthiness, and many introduce unexpected trade-offs that compromise model utility. These findings also provide practical insights for future improvements, such as the benefits of reasoning to better balance safety and performance. Based on these insights, we introduce a Reasoning-Enhanced Safety Alignment (RESA) approach that equips the model with chain-of-thought reasoning ability to discover the underlying risks, achieving state-of-the-art results.

[28] Confidence-Modulated Speculative Decoding for Large Language Models

Jaydip Sen,Subhasis Dasgupta,Hetvi Waghela

Main category: cs.CL

TL;DR: 这篇论文介绍了一种新的推测解码方法,通过动态调整推测生成的令牌数量和调节验证过程,提高了自回归推理的效率和适应性。

Details Motivation: 现有推测解码方法依赖于固定的起草长度和刚性的验证标准,限制了其在不同模型不确定性和输入复杂性下的适应性。 Method: 利用熵和基于边界的不确定性度量,动态调整每次迭代中推测生成的令牌数量,并使用相同的置信度信号调节验证过程。 Result: 实验表明,该方法在机器翻译和摘要任务中比标准推测解码显著提高了速度,同时保持或改善了BLEU和ROUGE评分。 Conclusion: 该论文提出了一种基于信息理论的推测解码框架,通过置信度调节的起草机制,提高了自回归推理的效率和适应性。 Abstract: Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.

[29] Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Woojin Chung,Jeonghoon Kim

Main category: cs.CL

TL;DR: 通过扩展词汇量以降低标记化文本的复杂性,有助于提高大型语言模型的性能。

Details Motivation: 最近的实践倾向于使用越来越大的词汇量,但其带来的好处的来源尚不清楚。 Method: 通过将语言模型的词汇量从24K扩展到196K,并保持数据、计算和优化不变,进行控制研究。 Result: 研究发现,更大的词汇量主要通过降低标记化文本的复杂性来减少交叉熵,并且这种训练优势可以转移到下游任务中。 Conclusion: 研究得出,降低标记化文本的复杂性有助于提高大型语言模型的性能。 Abstract: Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but the source of the benefit is unclear. We conduct a controlled study that scales the language model's vocabulary from 24K to 196K while holding data, compute, and optimization fixed. We first quantify the complexity of tokenized text, formalized via Kolmogorov complexity, and show that larger vocabularies reduce this complexity. Above 24K, every common word is already a single token, so further growth mainly deepens the relative token-frequency imbalance. A word-level loss decomposition shows that larger vocabularies reduce cross-entropy almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. Constraining input and output embedding norms to attenuate the effect of token-frequency imbalance reverses the gain, directly showing that the model exploits rather than suffers from imbalance. Because the same frequent words cover roughly 77% of tokens in downstream benchmarks, this training advantage transfers intact. We also show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results reframe "bigger vocabularies help" as "lowering the complexity of tokenized text helps," providing a simple, principled lever for tokenizer-model co-design and clarifying the loss dynamics that govern language-model scaling in pre-training.

[30] Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models

Tobias Schreieder,Tim Schopf,Michael Färber

Main category: cs.CL

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: The increasing adoption of large language models (LLMs) has been accompanied by growing concerns regarding their reliability and trustworthiness. As a result, a growing body of research focuses on evidence-based text generation with LLMs, aiming to link model outputs to supporting evidence to ensure traceability and verifiability. However, the field is fragmented due to inconsistent terminology, isolated evaluation practices, and a lack of unified benchmarks. To bridge this gap, we systematically analyze 134 papers, introduce a unified taxonomy of evidence-based text generation with LLMs, and investigate 300 evaluation metrics across seven key dimensions. Thereby, we focus on approaches that use citations, attribution, or quotations for evidence-based text generation. Building on this, we examine the distinctive characteristics and representative methods in the field. Finally, we highlight open challenges and outline promising directions for future work.

[31] When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models

Cheng Wang,Gelei Deng,Xianglin Yang,Han Qiu,Tianwei Zhang

Main category: cs.CL

TL;DR: This paper introduces MCR-BENCH, a benchmark for evaluating how Large Audio-Language Models (LALMs) handle conflicting audio-text inputs. The study finds that LALMs exhibit a strong bias toward text over audio, leading to performance issues and reliability concerns in real-world scenarios. Mitigation strategies and improvements in modality fusion are recommended.

Details Motivation: The motivation for this research stems from the lack of understanding regarding how Large Audio-Language Models (LALMs) handle conflicting information between audio and text modalities, despite their enhanced audio perception capabilities. This gap raises concerns about their reliability in real-world applications where input inconsistencies are common. Method: The study introduces MCR-BENCH, a benchmark for evaluating how LALMs prioritize information in inconsistent audio-text pairs. It conducts extensive evaluations across diverse audio understanding tasks, investigates factors influencing text bias, explores mitigation strategies through supervised fine-tuning, and analyzes model confidence patterns. Result: The evaluation reveals that LALMs show a strong bias towards textual input when inconsistencies exist between audio and text modalities. This results in significant performance degradation in audio-centric tasks and persistent overconfidence in the models even with contradictory inputs. Conclusion: The study concludes that Large Audio-Language Models (LALMs) exhibit a significant bias towards textual input over audio evidence when faced with inconsistencies, leading to performance degradation in audio-centric tasks and reliability concerns for real-world applications. Improvements in modality balance and fusion mechanisms are necessary to handle conflicting multimodal inputs effectively. Abstract: Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.

[32] LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

Yirong Sun,Yizhong Geng,Peidong Wei,Yanjun Chen,Jinghan Yang,Rongfei Chen,Wei Zhang,Xiaoyu Shen

Main category: cs.CL

TL;DR: LLaSO是首个完全开源的端到端大规模语音-语言建模框架,提供数据集、基准测试和参考模型,以促进LSLM领域的透明性和可复现性。

Details Motivation: LSLM研究因架构碎片化和缺乏透明度而受到阻碍,模型权重经常被发布,但其对应的训练数据和配置却未公开,LLaSO旨在解决这些关键问题。 Method: LLaSO提供了三个核心资源:LLaSO-Align(语音-文本对齐语料库),LLaSO-Instruct(多任务指令调优数据集)和LLaSO-Eval(可复现的基准测试)。此外,LLaSO-Base是一个基于公共数据训练的参考模型,用于验证框架性能。 Result: LLaSO-Base模型在标准化评估中达到了0.72的归一化分数,建立了强大的、可复现的基线,并且优于其他类似模型。研究还发现,尽管更广泛的训练覆盖范围提高了性能,但在未见过的任务上仍存在显著的泛化差距,尤其是在纯音频场景中。 Conclusion: LLaSO提供了一个完整的开源堆栈,包括数据、基准测试和模型,为统一研究工作并加速社区驱动的LSLM进展奠定了基础。 Abstract: The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.

[33] A Study of Privacy-preserving Language Modeling Approaches

Pritilata Saha,Abhirup Sinha

Main category: cs.CL

TL;DR: This research offers a comprehensive analysis of privacy-preserving approaches in language modeling, highlighting their strengths and limitations, contributing insights, and suggesting future research directions.

Details Motivation: Language models often trained on sensitive data can pose privacy risks by memorizing and disclosing information during privacy attacks. Understanding and mitigating these risks is crucial as privacy is a fundamental human right. Method: This study conducts a comprehensive and in-depth review of existing privacy-preserving approaches in language modeling, highlighting their strengths and investigating their limitations. Result: The study provides an overview of privacy-preserving language modeling approaches, offering insights into their effectiveness and limitations. Conclusion: The research contributes to the ongoing efforts in privacy-preserving language modeling by offering valuable insights and identifying future research directions. Abstract: Recent developments in language modeling have increased their use in various applications and domains. Language models, often trained on sensitive data, can memorize and disclose this information during privacy attacks, raising concerns about protecting individuals' privacy rights. Preserving privacy in language models has become a crucial area of research, as privacy is one of the fundamental human rights. Despite its significance, understanding of how much privacy risk these language models possess and how it can be mitigated is still limited. This research addresses this by providing a comprehensive study of the privacy-preserving language modeling approaches. This study gives an in-depth overview of these approaches, highlights their strengths, and investigates their limitations. The outcomes of this study contribute to the ongoing research on privacy-preserving language modeling, providing valuable insights and outlining future research directions.

[34] M-HELP: Using Social Media Data to Detect Mental Health Help-Seeking Signals

MSVPJ Sathvik,Zuhair Hasan Shaik,Vivek Gupta

Main category: cs.CL

TL;DR: 本文提出了一种新的心理健康数据集M-Help,可用于社交媒体上识别求助行为。

Details Motivation: 心理健康障碍是一个全球性危机,而目前尚缺乏有效识别正在寻求帮助的个体的数据集。 Method: 开发了一个新的数据集M-Help,并使用AI模型进行训练,以完成识别求助者、诊断心理健康状况和发现根本原因的任务。 Result: AI模型在M-Help数据集上能够有效完成三项任务:识别求助者、诊断心理健康状况和发现潜在原因。 Conclusion: M-Help是一个可以帮助识别社交媒体上求助行为的新数据集,可以用于训练AI模型来诊断心理健康状况及找出问题根源。 Abstract: Mental health disorders are a global crisis. While various datasets exist for detecting such disorders, there remains a critical gap in identifying individuals actively seeking help. This paper introduces a novel dataset, M-Help, specifically designed to detect help-seeking behavior on social media. The dataset goes beyond traditional labels by identifying not only help-seeking activity but also specific mental health disorders and their underlying causes, such as relationship challenges or financial stressors. AI models trained on M-Help can address three key tasks: identifying help-seekers, diagnosing mental health conditions, and uncovering the root causes of issues.

[35] Principle Methods of Rendering Non-equivalent Words from Uzbek and Dari to Russian and English

Mohammad Ibrahim Qani

Main category: cs.CL

TL;DR: This research explores methods for translating non-equivalent words, such as those related to culture and tradition, to reduce misunderstandings between languages, with 25 words successfully rendered as examples.

Details Motivation: The motivation behind this research is to address misunderstandings caused by non-equivalent words in translation, aiming to develop methods for accurately rendering such words into the target language. Method: The research was conducted using library-based research methods to explore the rendering of non-equivalent words from the source language to the target language. Result: The study identified various methods and rules for rendering non-equivalent words and successfully rendered 25 non-equivalent words from Dar & Uzbek into English and Russian. Conclusion: The research concludes that rendering non-equivalent words requires specific methods and rules to ensure accurate translation, highlighting the importance of professional translation in bridging language gaps. Abstract: These pure languages understanding directly relates to translation knowledge where linguists and translators need to work and research to eradicate misunderstanding. Misunderstandings mostly appear in non-equivalent words because there are different local and internal words like food, garment, cultural and traditional words and others in every notion. Truly, most of these words do not have equivalent in the target language and these words need to be worked and find their equivalent in the target language to fully understand the both languages. The purpose of this research is to introduce the methods of rendering non-equivalent words professionally from the source language to the target language and this research has been completed using library-based research. However, some of these non-equivalent words are already professionally rendered to the target language but still there many other words to be rendered. As a result, this research paper includes different ways and rules of rendering non-equivalent words from source language to the target language and 25 non-equvalent words have been rendered from Dar & Uzbek into English and Russian languages.

[36] PyTOD: Programmable Task-Oriented Dialogue with Execution Feedback

Alexandru Coca,Bo-Hsiang Tseng,Pete Boothroyd,Jianpeng Cheng,Mark Gaynor,Zhenxing Zhang,Joe Stacey,Tristan Guigue,Héctor Martinez Alonso,Diarmuid Ó Séaghdha,Anders Johannsen

Main category: cs.CL

TL;DR: PyTOD 通过生成可执行代码和使用策略及执行反馈,提高了任务导向对话代理的状态跟踪性能。

Details Motivation: 可编程任务导向对话 (TOD) 代理的有效性依赖于准确的状态跟踪,而现有的方法在准确性方面存在挑战。 Method: PyTOD 使用一种简单的受限解码方法,利用语言模型而不是语法规则来遵循 API 模式。 Result: PyTOD 在具有挑战性的 SGD 基准上实现了最先进的状态跟踪性能,并在准确性和鲁棒性方面超过了强大的基线方法。 Conclusion: PyTOD 是一种新的 TOD 代理,它通过生成可执行代码来跟踪对话状态,并利用策略和执行反馈进行高效的错误校正,从而实现最先进的状态跟踪性能。 Abstract: Programmable task-oriented dialogue (TOD) agents enable language models to follow structured dialogue policies, but their effectiveness hinges on accurate state tracking. We present PyTOD, an agent that generates executable code to track dialogue state and uses policy and execution feedback for efficient error correction. To this end, PyTOD employs a simple constrained decoding approach, using a language model instead of grammar rules to follow API schemata. This leads to state-of-the-art state tracking performance on the challenging SGD benchmark. Our experiments show that PyTOD surpasses strong baselines in both accuracy and robust user goal estimation as the dialogue progresses, demonstrating the effectiveness of execution-aware state tracking.

[37] RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores

Yingshu Li,Yunyi Liu,Lingqiao Liu,Lei Wang,Luping Zhou

Main category: cs.CL

TL;DR: RadReason是一种新的放射学报告评估框架,提供细粒度评分和人类可读解释。

Details Motivation: 现有的放射学报告自动评估方法要么生成粗略的整体评分,要么依赖于不透明的黑箱模型,这限制了它们在现实世界临床工作流程中的实用性。 Method: RadReason基于Group Relative Policy Optimization方法,并引入了两种创新:子评分动态加权和多数指导优势缩放。 Result: RadReason不仅输出六个临床定义错误类型的细粒度子评分,还生成人类可读的解释,说明每个评分背后的原因。 Conclusion: RadReason是一种可解释的、成本效益高的放射学报告评估框架,优于所有先前的离线指标,并实现了与基于GPT-4的评估的平等性能。 Abstract: Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces human-readable justifications that explain the rationale behind each score. Our method builds on Group Relative Policy Optimization and incorporates two key innovations: (1) Sub-score Dynamic Weighting, which adaptively prioritizes clinically challenging error types based on live F1 statistics; and (2) Majority-Guided Advantage Scaling, which adjusts policy gradient updates based on prompt difficulty derived from sub-score agreement. Together, these components enable more stable optimization and better alignment with expert clinical judgment. Experiments on the ReXVal benchmark show that RadReason surpasses all prior offline metrics and achieves parity with GPT-4-based evaluations, while remaining explainable, cost-efficient, and suitable for clinical deployment. Code will be released upon publication.

[38] SLM4Offer: Personalized Marketing Offer Generation Using Contrastive Learning Based Fine-Tuning

Vedasamhitha Challapalli,Konduru Venkat Sai,Piyush Pratap Singh,Rupesh Prasad,Arvind Maurya,Atul Singh

Main category: cs.CL

TL;DR: This paper introduces SLM4Offer, a generative AI model for personalized offer generation that uses contrastive learning to improve offer acceptance rates by 17 percent over traditional methods.

Details Motivation: Personalized marketing has emerged as a pivotal strategy for enhancing customer engagement and driving business growth, with prior studies suggesting that effective personalization strategies can boost revenue by up to 40 percent. Method: The study introduces SLM4Offer, which is developed by fine-tuning a pre-trained encoder-decoder language model (Google's T5-Small) using a contrastive learning approach with InfoNCE loss to align customer personas with relevant offers in a shared embedding space. Result: Experimental results show a 17 percent improvement in offer acceptance rate over a supervised fine-tuning baseline, highlighting the effectiveness of contrastive objectives in advancing personalized marketing. Conclusion: The study concludes that SLM4Offer, a generative AI model for personalized offer generation using contrastive learning, significantly improves offer acceptance rates compared to a supervised fine-tuning baseline, demonstrating the effectiveness of contrastive objectives in personalized marketing. Abstract: Personalized marketing has emerged as a pivotal strategy for enhancing customer engagement and driving business growth. Academic and industry efforts have predominantly focused on recommendation systems and personalized advertisements. Nonetheless, this facet of personalization holds significant potential for increasing conversion rates and improving customer satisfaction. Prior studies suggest that well-executed personalization strategies can boost revenue by up to 40 percent, underscoring the strategic importance of developing intelligent, data-driven approaches for offer generation. This work introduces SLM4Offer, a generative AI model for personalized offer generation, developed by fine-tuning a pre-trained encoder-decoder language model, specifically Google's Text-to-Text Transfer Transformer (T5-Small 60M) using a contrastive learning approach. SLM4Offer employs InfoNCE (Information Noise-Contrastive Estimation) loss to align customer personas with relevant offers in a shared embedding space. A key innovation in SLM4Offer lies in the adaptive learning behaviour introduced by contrastive loss, which reshapes the latent space during training and enhances the model's generalizability. The model is fine-tuned and evaluated on a synthetic dataset designed to simulate customer behaviour and offer acceptance patterns. Experimental results demonstrate a 17 percent improvement in offer acceptance rate over a supervised fine-tuning baseline, highlighting the effectiveness of contrastive objectives in advancing personalized marketing.

[39] Subjective Behaviors and Preferences in LLM: Language of Browsing

Sai Sundaresan,Harshita Chopra,Atanu R. Sinha,Koustava Goswami,Nagasai Saketh Naidu,Raghav Karan,N Anushka

Main category: cs.CL

TL;DR: 本文提出了一种针对用户浏览行为的异质性训练方法HeTLM,表明小模型可能在个性化任务中优于大模型。

Details Motivation: 论文动机在于质疑当前大语言模型是否能够有效处理用户主观、异质的行为和偏好,尤其是在用户浏览行为中体现的个性化“语言”。 Method: 论文通过引入一种针对用户浏览行为的页面级分词器和一种异质感知的语言模型训练方法HeTLM来进行实验,并与传统的大模型进行比较。 Result: 研究发现,使用页面级分词器训练的小模型优于大型预训练或微调模型;HeTLM在控制参数数量的情况下优于单一模型;生成结果的均值更高且方差更低,表明对齐效果更好。 Conclusion: 论文得出结论,小语言模型在特定浏览行为数据上可能比大模型表现更好,并且通过引入异质感知的训练方法HeTLM,可以更好地捕捉用户的异质性偏好和行为。 Abstract: A Large Language Model (LLM) offers versatility across domains and tasks, purportedly benefiting users with a wide variety of behaviors and preferences. We question this perception about an LLM when users have inherently subjective behaviors and preferences, as seen in their ubiquitous and idiosyncratic browsing of websites or apps. The sequential behavior logs of pages, thus generated, form something akin to each user's self-constructed "language", albeit without the structure and grammar imbued in natural languages. We ask: (i) Can a small LM represent the "language of browsing" better than a large LM? (ii) Can an LM with a single set of parameters (or, single LM) adequately capture myriad users' heterogeneous, subjective behaviors and preferences? (iii) Can a single LM with high average performance, yield low variance in performance to make alignment good at user level? We introduce clusterwise LM training, HeTLM (Heterogeneity aware Training of Language Model), appropriate for subjective behaviors. We find that (i) a small LM trained using a page-level tokenizer outperforms large pretrained or finetuned LMs; (ii) HeTLM with heterogeneous cluster specific set of parameters outperforms a single LM of the same family, controlling for the number of parameters; and (iii) a higher mean and a lower variance in generation ensues, implying improved alignment.

[40] Influence-driven Curriculum Learning for Pre-training on Limited Data

Loris Schoenegger,Lukas Thoma,Terra Blevins,Benjamin Roth

Main category: cs.CL

TL;DR: Curriculum learning becomes effective for language model pre-training when using a model-centric difficulty metric like training data influence.

Details Motivation: Curriculum learning has shown limited success in pre-training language models, which motivated the exploration of a more model-centric difficulty metric. Method: The authors experimented with sorting training examples by their training data influence, which estimates the effect of individual examples on the model's output. Result: Models trained on the proposed curricula outperformed those trained in random order by over 10 percentage points in benchmarks. Conclusion: Curriculum learning can be made effective for language model pre-training by using a model-centric difficulty metric such as training data influence. Abstract: Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.

[41] SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts -- Extended Version

Nghiem Thanh Pham,Tung Kieu,Duc-Manh Nguyen,Son Ha Xuan,Nghia Duong-Trung,Danh Le-Phuoc

Main category: cs.CL

TL;DR: The paper introduces SLM-Bench, a comprehensive benchmark for evaluating small language models on multiple dimensions like accuracy, efficiency, and sustainability, highlighting diverse trade-offs among models.

Details Motivation: The motivation is to address the lack of systematic evaluation of the performance and environmental impact of small language models, providing a comprehensive benchmark for fair comparison and reproducibility. Method: The authors introduced SLM-Bench, a benchmark for evaluating small language models across multiple dimensions including accuracy, computational efficiency, and sustainability. They evaluated 15 models on 9 NLP tasks using 23 datasets and 4 hardware configurations. Result: The result is the development of SLM-Bench, which evaluates small language models across 11 metrics, revealing diverse trade-offs among models in terms of accuracy and energy efficiency. Conclusion: SLM-Bench sets a new standard for evaluating small language models, highlighting the importance of balancing accuracy and energy efficiency while enhancing reproducibility and fair comparison. Abstract: Small Language Models (SLMs) offer computational efficiency and accessibility, yet a systematic evaluation of their performance and environmental impact remains lacking. We introduce SLM-Bench, the first benchmark specifically designed to assess SLMs across multiple dimensions, including accuracy, computational efficiency, and sustainability metrics. SLM-Bench evaluates 15 SLMs on 9 NLP tasks using 23 datasets spanning 14 domains. The evaluation is conducted on 4 hardware configurations, providing a rigorous comparison of their effectiveness. Unlike prior benchmarks, SLM-Bench quantifies 11 metrics across correctness, computation, and consumption, enabling a holistic assessment of efficiency trade-offs. Our evaluation considers controlled hardware conditions, ensuring fair comparisons across models. We develop an open-source benchmarking pipeline with standardized evaluation protocols to facilitate reproducibility and further research. Our findings highlight the diverse trade-offs among SLMs, where some models excel in accuracy while others achieve superior energy efficiency. SLM-Bench sets a new standard for SLM evaluation, bridging the gap between resource efficiency and real-world applicability.

[42] HebID: Detecting Social Identities in Hebrew-language Political Text

Guy Mor-Lan,Naama Rivlin-Angert,Yael R. Kaplan,Tamir Sheafer,Shaul R. Shenhav

Main category: cs.CL

TL;DR: 本文介绍了HebID,一个用于社会身份检测的多标签希伯来语语料库,并评估了不同模型在识别以色列政治家社交媒体帖子和社会调查中身份表达的有效性。

Details Motivation: 现有的群体和身份检测数据集主要是以英语为中心的,单一标签的,并且关注的是粗略的身份类别。 Method: 引入了一个多标签希伯来语语料库HebID,并对多标签和单标签编码器以及生成式LLMs进行了基准测试。 Result: 发现希伯来语调整后的LLMs提供了最好的结果,并观察到了政治家帖子和公众调查之间的身份表达差异。 Conclusion: HebID提供了研究希伯来语社会身份的综合基础,并为其他非英语政治环境中的类似研究提供了一个模型。 Abstract: Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians' Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74). We apply our classifier to politicians' Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.

[43] Dream 7B: Diffusion Large Language Models

Jiacheng Ye,Zhihui Xie,Lin Zheng,Jiahui Gao,Zirui Wu,Xin Jiang,Zhenguo Li,Lingpeng Kong

Main category: cs.CL

TL;DR: Dream 7B is an open diffusion large language model that outperforms existing diffusion language models by using discrete diffusion modeling to refine sequences in parallel.

Details Motivation: To overcome the limitations of autoregressive models that generate tokens sequentially, thus introducing Dream 7B which refines sequences in parallel through iterative denoising. Method: The model uses discrete diffusion modeling to refine sequences in parallel through iterative denoising, using training techniques including AR-based LLM initialization and context-adaptive token-level noise rescheduling. Result: Dream 7B consistently outperforms existing diffusion language models on general, mathematical, and coding tasks while demonstrating superior planning abilities and inference flexibility. Conclusion: Dream 7B is the most powerful open diffusion large language model to date and shows superior performance compared to existing diffusion language models. Abstract: We introduce Dream 7B, the most powerful open diffusion large language model to date. Unlike autoregressive (AR) models that generate tokens sequentially, Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising. Our model consistently outperforms existing diffusion language models on general, mathematical, and coding tasks. Dream 7B demonstrates superior planning abilities and inference flexibility, including arbitrary-order generation, infilling capabilities, and tunable quality-speed trade-offs. These results are achieved through simple yet effective training techniques, including AR-based LLM initialization and context-adaptive token-level noise rescheduling. We release both Dream-Base and Dream-Instruct to facilitate further research in diffusion-based language modeling.

[44] The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech

Naama Rivlin-Angert,Guy Mor-Lan

Main category: cs.CL

TL;DR: 该研究首次对政治非正统话语进行了大规模计算分析,开发了能够有效识别和分类PDD的模型,并发现PDD在不同平台和政治群体中的分布特征。

Details Motivation: 研究者首次对政治非正统话语(PDD)进行了大规模计算研究,旨在理解其在民主讨论中的作用。 Method: 提出了一种结合微调编码器模型和解码器LLM的两阶段分类流水线,并开发了最佳模型DictaLM 2.0。 Result: 在二元PDD检测中,最佳模型DictaLM 2.0的F1得分为0.74,在PDD特征分类中的宏F1得分为0.67;研究还发现PDD在社交媒体上更为普遍,男性政治家使用更多,右倾政治人物倾向性更强,且在选举活动和重大政治事件期间显著增加。 Conclusion: 自动化分析政治非正统话语(PDD)是可行且有价值的,有助于理解民主讨论。 Abstract: We present the first large-scale computational study of political delegitimization discourse (PDD), defined as symbolic attacks on the normative validity of political entities. We curate and manually annotate a novel Hebrew-language corpus of 10,410 sentences drawn from Knesset speeches (1993-2023), Facebook posts (2018-2021), and leading news outlets, of which 1,812 instances (17.4\%) exhibit PDD and 642 carry additional annotations for intensity, incivility, target type, and affective framing. We introduce a two-stage classification pipeline combining finetuned encoder models and decoder LLMs. Our best model (DictaLM 2.0) attains an F$_1$ of 0.74 for binary PDD detection and a macro-F$_1$ of 0.67 for classification of delegitimization characteristics. Applying this classifier to longitudinal and cross-platform data, we see a marked rise in PDD over three decades, higher prevalence on social media versus parliamentary debate, greater use by male than female politicians, and stronger tendencies among right-leaning actors - with pronounced spikes during election campaigns and major political events. Our findings demonstrate the feasibility and value of automated PDD analysis for understanding democratic discourse.

[45] SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking

Xiangyang Zhu,Yuan Tian,Chunyi Li,Kaiwei Zhang,Wei Sun,Guangtao Zhai

Main category: cs.CL

TL;DR: The paper proposes SafetyFlow, a fully automated system for building LLM safety benchmarks, significantly reducing time and resource costs while generating a comprehensive and effective dataset.

Details Motivation: Existing LLM safety evaluation benchmarks are labor-intensive, time-consuming, and exhibit redundancy and limited difficulty, necessitating an automated solution. Method: The paper introduces SafetyFlow, an agent-flow system that orchestrates seven specialized agents to automate the benchmark construction process, integrating human expertise without manual curation. Result: SafetyFlowBench, the automatically generated dataset, contains 23,446 queries and enables the efficient evaluation of 49 advanced LLMs, demonstrating efficacy and efficiency. Conclusion: SafetyFlow provides a fully automated pipeline for constructing LLM safety benchmarks, significantly reducing time and resource costs while ensuring low redundancy and strong discriminative power in the dataset. Abstract: The rapid proliferation of large language models (LLMs) has intensified the requirement for reliable safety evaluation to uncover model vulnerabilities. To this end, numerous LLM safety evaluation benchmarks are proposed. However, existing benchmarks generally rely on labor-intensive manual curation, which causes excessive time and resource consumption. They also exhibit significant redundancy and limited difficulty. To alleviate these problems, we introduce SafetyFlow, the first agent-flow system designed to automate the construction of LLM safety benchmarks. SafetyFlow can automatically build a comprehensive safety benchmark in only four days without any human intervention by orchestrating seven specialized agents, significantly reducing time and resource cost. Equipped with versatile tools, the agents of SafetyFlow ensure process and cost controllability while integrating human expertise into the automatic pipeline. The final constructed dataset, SafetyFlowBench, contains 23,446 queries with low redundancy and strong discriminative power. Our contribution includes the first fully automated benchmarking pipeline and a comprehensive safety benchmark. We evaluate the safety of 49 advanced LLMs on our dataset and conduct extensive experiments to validate our efficacy and efficiency.

[46] Trained Miniatures: Low cost, High Efficacy SLMs for Sales & Marketing

Ishaan Bhola,Mukunda NS,Sravanth Kurmala,Harsh Nandwani,Arihant Jain

Main category: cs.CL

TL;DR: This paper proposes 'Trained Miniatures' or Small Language Models as a cost-effective alternative to Large Language Models for specific high-value applications like sales and marketing outreach.

Details Motivation: The motivation behind the study is the heavy computational requirements and high costs associated with Large Language Models (LLMs), especially for targeted applications like sales and marketing outreach, which are not always feasible. Method: The method involves creating Small Language Models (SLMs) that are fine-tuned for specific, high-value applications such as sales and marketing outreach. Result: The result is the introduction of 'Trained Miniatures' which are Small Language Models that can generate similar domain-specific responses as Large Language Models but at a significantly reduced cost. Conclusion: The paper concludes that by using 'Trained Miniatures' or Small Language Models fine-tuned for specific applications, it is possible to generate domain-specific responses at a fraction of the cost of Large Language Models. Abstract: Large language models (LLMs) excel in text generation; however, these creative elements require heavy computation and are accompanied by a steep cost. Especially for targeted applications such as sales and marketing outreach, these costs are far from feasible. This paper introduces the concept of "Trained Miniatures" - Small Language Models(SLMs) fine-tuned for specific, high-value applications, generating similar domain-specific responses for a fraction of the cost.

[47] SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding,Wen Sun,Dailin Li,Wei Zou,Jiaming Wang,Jiajun Chen,Shujian Huang

Main category: cs.CL

TL;DR: SDGO是一种利用模型自身判别能力来提高生成内容安全性的强化学习方法,无需额外数据或外部模型,有效且实用。

Details Motivation: 大型语言模型在各种自然语言处理任务中表现出色,但容易受到越狱攻击,从而生成有害内容。作者发现了一个关键的安全不一致现象:LLMs在作为判别器时比作为生成器时更能有效识别有害请求。 Method: 提出了一种名为SDGO的强化学习框架,该框架通过迭代自我改进,将模型的判别能力和生成能力进行对齐。 Result: 实验表明,与基于提示和基于训练的基线方法相比,SDGO显著提高了模型的安全性,并且在一般基准测试中保持了实用性。此外,该方法在对抗分布外(OOD)越狱攻击方面表现出色。 Conclusion: SDGO通过利用模型自身的判别能力作为奖励信号,显著提高了模型的安全性,并且在无需额外注释数据或外部模型的情况下保持了模型的实用性。 Abstract: Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.

[48] Benchmarking Computer Science Survey Generation

Weihang Su,Anzhe Xie,Qingyao Ai,Jianming Long,Jiaxin Mao,Ziyi Ye,Yiqun Liu

Main category: cs.CL

TL;DR: The paper introduces SurGE, a benchmark for evaluating automated scientific survey generation, highlighting the ongoing challenges in this field.

Details Motivation: The motivation is the increasing infeasibility of manually creating scientific survey articles due to the rapid growth of academic literature and the lack of standardized benchmarks for automated survey generation. Method: The paper introduces SurGE, a benchmark that includes test instances and a large-scale academic corpus, along with an automated evaluation framework assessing four dimensions of generated surveys. Result: The result shows that even advanced LLM-based approaches find survey generation challenging, emphasizing the complexity of the task. Conclusion: The paper concludes that the SurGE benchmark provides a comprehensive framework for evaluating scientific survey generation and highlights the need for further research in this area. Abstract: Scientific survey articles play a vital role in summarizing research progress, yet their manual creation is becoming increasingly infeasible due to the rapid growth of academic literature. While large language models (LLMs) offer promising capabilities for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To address this gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for evaluating scientific survey generation in the computer science domain. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool. In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based approaches shows that survey generation remains highly challenging, even for advanced self-reflection frameworks. These findings highlight the complexity of the task and the necessity for continued research. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

[49] Position Bias Mitigates Position Bias:Mitigate Position Bias Through Inter-Position Knowledge Distillation

Yifei Wang,Feng Xiong,Yong Wang,Linjing Li,Xiangxiang Chu,Daniel Dajun Zeng

Main category: cs.CL

TL;DR: 本文提出了Pos2Distill,一种用于缓解位置偏差的框架,通过知识蒸馏提升长上下文任务的整体性能和泛化能力。

Details Motivation: 位置偏差严重影响长上下文理解和处理能力,尽管已有工作试图通过修改架构来缓解这一问题,但位置偏差依然显著存在。 Method: 提出了Pos2Distill框架,通过位置到位置的知识蒸馏,将优势位置的能力转移到劣势位置,从而减小性能差距。 Result: 通过Pos2Distill方法,显著提高了各个上下文位置的性能一致性,并在长上下文检索和推理任务中取得了显著性能提升。 Conclusion: Pos2Distill有效减少了位置偏差,提升了长上下文检索和推理任务的整体性能,并展现出强大的跨任务泛化能力。 Abstract: Positional bias (PB), manifesting as non-uniform sensitivity across different contextual locations, significantly impairs long-context comprehension and processing capabilities. While prior work seeks to mitigate PB through modifying the architectures causing its emergence, significant PB still persists. To address PB effectively, we introduce \textbf{Pos2Distill}, a position to position knowledge distillation framework. Pos2Distill transfers the superior capabilities from advantageous positions to less favorable ones, thereby reducing the huge performance gaps. The conceptual principle is to leverage the inherent, position-induced disparity to counteract the PB itself. We identify distinct manifestations of PB under \textbf{\textsc{r}}etrieval and \textbf{\textsc{r}}easoning paradigms, thereby designing two specialized instantiations: \emph{Pos2Distill-R\textsuperscript{1}} and \emph{Pos2Distill-R\textsuperscript{2}} respectively, both grounded in this core principle. By employing the Pos2Distill approach, we achieve enhanced uniformity and significant performance gains across all contextual positions in long-context retrieval and reasoning tasks. Crucially, both specialized systems exhibit strong cross-task generalization mutually, while achieving superior performance on their respective tasks.

[50] Stemming -- The Evolution and Current State with a Focus on Bangla

Abhijit Paul,Mashiat Amin Farin,Sharif Md. Abdullah,Ahmedul Kabir,Zarif Masud,Shebuti Rayana

Main category: cs.CL

TL;DR: 这篇论文探讨了孟加拉语在词干提取方面的挑战和重要性,指出当前研究的不足,并建议未来的研究方向。

Details Motivation: 孟加拉语是世界上使用最广泛的语言之一,但由于资源有限和带注释的数据集缺乏,它在数字化方面存在不足。对于像孟加拉语这样的低资源、高度屈折语言,词干提取是一个关键的预处理步骤,因为它可以减少算法和模型的复杂性。 Method: 本文采用了全面调查的方法,对孟加拉语词干提取的方法进行了综述,并指出了现有文献中的显著空白。 Result: 论文发现了现有研究中的显著空白,强调了从以往研究中的不连贯性以及缺乏可访问的实现来复制。 Conclusion: 这篇论文强调了孟加拉语词干提取的重要性,并倡导开发稳健的孟加拉语词干提取器,以增强语言分析和处理。 Abstract: Bangla, the seventh most widely spoken language worldwide with 300 million native speakers, faces digital under-representation due to limited resources and lack of annotated datasets. Stemming, a critical preprocessing step in language analysis, is essential for low-resource, highly-inflectional languages like Bangla, because it can reduce the complexity of algorithms and models by significantly reducing the number of words the algorithm needs to consider. This paper conducts a comprehensive survey of stemming approaches, emphasizing the importance of handling morphological variants effectively. While exploring the landscape of Bangla stemming, it becomes evident that there is a significant gap in the existing literature. The paper highlights the discontinuity from previous research and the scarcity of accessible implementations for replication. Furthermore, it critiques the evaluation methodologies, stressing the need for more relevant metrics. In the context of Bangla's rich morphology and diverse dialects, the paper acknowledges the challenges it poses. To address these challenges, the paper suggests directions for Bangla stemmer development. It concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing.

[51] EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models

Xinyi Ling,Hanwen Du,Zhihui Zhu,Xia Ning

Main category: cs.CL

TL;DR: This paper introduces EcomMMMU, a large e-commerce multimodal dataset, and proposes SUMEI, a method to strategically utilize multiple images by predicting visual utility, showing that product images do not always improve performance and can sometimes be detrimental.

Details Motivation: The motivation is to investigate whether product images in e-commerce platforms always enhance product understanding or can introduce redundancy or degrade performance, which existing datasets are limited in addressing. Method: The paper introduces EcomMMMU, a large-scale dataset for e-commerce multimodal multitask understanding, and proposes SUMEI, a data-driven approach to predict visual utilities and strategically use multiple images for downstream tasks. Result: The analysis on the EcomMMMU dataset reveals that product images can degrade performance in some cases, indicating that multimodal large language models (MLLMs) may struggle with effectively utilizing visual content. SUMEI demonstrates effectiveness and robustness in experiments. Conclusion: The paper concludes that while e-commerce platforms have rich multimodal data, product images do not consistently enhance performance and can sometimes degrade it. The proposed SUMEI method effectively addresses this issue by strategically utilizing multiple images based on predicted visual utilities. Abstract: E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through https://anonymous.4open.science/r/submission25.

[52] End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Qiaoyu Zheng,Yuze Sun,Chaoyi Wu,Weike Zhao,Pengcheng Qiu,Yongguo Yu,Kun Sun,Yanfeng Wang,Ya Zhang,Weidi Xie

Main category: cs.CL

TL;DR: Deep-DxSearch是一个通过强化学习训练的端到端代理RAG系统,用于提高医疗诊断的准确性。

Details Motivation: 现有的检索和工具增强方法由于对外部知识的利用较弱和反馈推理的可追溯性差,限制了其效果。 Method: 构建了一个大规模的医疗检索语料库,并将LLM作为核心代理,使用定制的奖励机制进行端到端的强化学习训练。 Result: Deep-DxSearch在多个数据中心的表现优于提示工程和无需训练的RAG方法,并且在常见病和罕见病的诊断中都取得了显著的准确率提升。 Conclusion: Deep-DxSearch在医疗诊断中实现了更高的诊断准确性,并且通过奖励设计和检索语料库组件的消融研究,证明了其独特性和有效性。 Abstract: Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.

[53] Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

Yufeng Zhao,Junnan Liu,Hongwei Liu,Dongsheng Zhu,Yuan Shen,Songyang Zhang,Kai Chen

Main category: cs.CL

TL;DR: ReasonZoo评估了工具集成推理(TIR)在多种推理任务上的有效性,并提出了两个新指标来衡量推理效率。研究发现,TIR不仅提高了模型的推理能力,还减少了过度思考,使推理过程更加高效。

Details Motivation: 尽管大型语言模型(LLMs)在推理任务上取得了进展,但在需要精确计算的任务上仍表现不足。工具集成推理(TIR)被认为是一个潜在解决方案,但其泛化能力和对模型思考过程的实际改进尚不明确。 Method: 引入了一个包含九个推理类别的全面基准测试ReasonZoo,并提出了两个新指标:Performance-Aware Cost (PAC) 和 Area Under the Performance-Cost Curve (AUC-PCC),用于评估推理效率。 Result: 实验表明,启用TIR的模型在数学和非数学任务上均优于未启用TIR的模型。TIR提高了推理效率,PAC和AUC-PCC指标显示模型过度思考减少,推理更高效。 Conclusion: TIR在提升LLMs的推理能力方面具有广泛的益处,并有助于模型更高效地进行复杂推理任务。 Abstract: Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model's reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.

[54] LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin,Dinghan Shen,Silei Xu,Jianbing Han,Sixun Dong,Mian Zhang,Yebowen Hu,Shujian Liu,Simin Ma,Song Wang,Sathish Reddy Indurthi,Xun Wang,Yiran Chen,Kaiqiang Song

Main category: cs.CL

TL;DR: LiveMCP-101是一个新的基准,用于评估AI代理在现实环境中协调使用多种MCP工具解决复杂任务的能力。

Details Motivation: 尽管模型上下文协议(MCP)为工具集成提供了强大的标准化框架,但在基准测试AI代理如何有效使用多样化的MCP工具在现实动态场景中解决多步骤任务方面存在显著差距。 Method: 引入了一个新的评估方法,该方法利用真实执行计划而非原始API输出,并构建了一个包含101个精心策划的现实世界查询的基准测试。 Result: 实验显示,即使是前沿的LLM也仅能达到低于60%的成功率,这突显了工具编排中的主要挑战。 Conclusion: LiveMCP-101是一个用于评估AI代理在现实世界中使用工具能力的基准,旨在推动能够通过工具使用可靠执行复杂任务的自主AI系统的发展。 Abstract: Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

cs.CV [Back]

[55] Heatmap Regression without Soft-Argmax for Facial Landmark Detection

Chiao-An Yang,Raymond A. Yeh

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的面部关键点检测方法,无需使用Soft-argmax,训练更快且性能优越。

Details Motivation: 论文重新审视了长期使用的Soft-argmax方法,并探索了是否还有其他能够实现强大性能的方法。 Method: 提出了一种基于经典结构预测框架的替代训练目标,以替代传统的Soft-argmax方法。 Result: 该方法在WFLW、COFW和300W三个面部关键点检测基准上都达到了最先进的性能,训练速度提高了2.2倍,并保持了更好的/有竞争力的准确性。 Conclusion: 该论文提出了一种不依赖Soft-argmax的面部关键点检测方法,并在多个基准测试中实现了最先进的性能,同时训练速度更快。 Abstract: Facial landmark detection is an important task in computer vision with numerous applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been widely used to achieve state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. Since argmax is not differentiable, these methods use a differentiable approximation, Soft-argmax, to enable end-to-end training on deep-nets. In this work, we revisit this long-standing choice of using Soft-argmax and demonstrate that it is not the only way to achieve strong performance. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W), converging 2.2x faster during training while maintaining better/competitive accuracy. Our code is available here: https://github.com/ca-joe-yang/regression-without-softarg.

[56] Fast Graph Neural Network for Image Classification

Mustafa Mohammadi Gharasuie,Luis Rueda

Main category: cs.CV

TL;DR: This study introduces a novel image classification technique combining Graph Convolutional Networks (GCNs) with Voronoi diagrams, achieving improved accuracy and efficiency over existing methods, especially for complex and fine-grained images.

Details Motivation: The motivation is to improve image classification by overcoming the limitations of conventional convolutional neural networks (CNNs) through the use of GCNs and Voronoi diagrams, particularly for complex and fine-grained data. Method: The study introduces an approach that represents images as graphs, with pixels or regions as vertices, which are then refined using Delaunay triangulations. This method leverages the strengths of GCNs and Voronoi diagrams to enhance classification accuracy. Result: The proposed model achieves significant improvements in preprocessing efficiency and classification accuracy across benchmark datasets, outperforming state-of-the-art approaches in challenging scenarios. Conclusion: This research presents a novel method for image classification by integrating Graph Convolutional Networks (GCNs) with Voronoi diagrams, offering a new perspective and expanding the potential of graph-based learning in computer vision and unstructured data analysis. Abstract: The rapid progress in image classification has been largely driven by the adoption of Graph Convolutional Networks (GCNs), which offer a robust framework for handling complex data structures. This study introduces a novel approach that integrates GCNs with Voronoi diagrams to enhance image classification by leveraging their ability to effectively model relational data. Unlike conventional convolutional neural networks (CNNs), our method represents images as graphs, where pixels or regions function as vertices. These graphs are then refined using corresponding Delaunay triangulations, optimizing their representation. The proposed model achieves significant improvements in both preprocessing efficiency and classification accuracy across various benchmark datasets, surpassing state-of-the-art approaches, particularly in challenging scenarios involving intricate scenes and fine-grained categories. Experimental results, validated through cross-validation, underscore the effectiveness of combining GCNs with Voronoi diagrams for advancing image classification. This research not only presents a novel perspective on image classification but also expands the potential applications of graph-based learning paradigms in computer vision and unstructured data analysis.

[57] You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

Hakjin Lee,Junghoon Seo,Jaehoon Sim

Main category: cs.CV

TL;DR: YOPO是一种新的单阶段方法,能够从单个RGB图像中准确恢复未见实例的全9-DoF姿态,且无需额外数据。

Details Motivation: 需要一种更简单的、仅使用RGB的替代方法,直接在类别级别进行学习。 Method: 提出了一种单阶段、基于查询的框架YOPO,将类别级别的9-DoF估计视为2D检测的自然扩展。 Result: YOPO在REAL275数据集上表现优异,IoU50达到了79.6%,10°10cm指标下达到了54.1%。 Conclusion: YOPO实现了在没有额外数据的情况下,统一对象检测和9-DoF姿态估计,并在三个基准测试中达到了新的技术水平。 Abstract: Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% $\rm{IoU}_{50}$ and 54.1% under the $10^\circ$$10{\rm{cm}}$ metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on our project.

[58] Paired-Sampling Contrastive Framework for Joint Physical-Digital Face Attack Detection

Andrei Balykin,Anvar Ganiev,Denis Kondranin,Kirill Polevoda,Nikolai Liudkevich,Artem Petrov

Main category: cs.CV

TL;DR: 本文提出了一种统一的面部反欺骗训练框架,通过自动匹配的真实和攻击自拍对进行对比学习,有效降低了平均分类错误率,同时具有轻量级和高效训练的优势。

Details Motivation: 传统方法分别处理物理呈现攻击和数字伪造攻击,增加了系统复杂性并留下了组合攻击的漏洞。 Method: 使用配对采样对比框架,利用自动匹配的真实和攻击自拍对来学习与模态无关的活体线索。 Result: 在第6届面部反欺骗挑战统一物理-数字攻击检测基准上,该方法平均分类错误率(ACER)为2.10%,优于以往方案。 Conclusion: 提出了一种统一的面部反欺骗方法,该方法在处理物理和数字攻击向量方面表现出色,同时具有轻量级和高效的训练特性。 Abstract: Modern face recognition systems remain vulnerable to spoofing attempts, including both physical presentation attacks and digital forgeries. Traditionally, these two attack vectors have been handled by separate models, each targeting its own artifacts and modalities. However, maintaining distinct detectors increases system complexity and inference latency and leaves systems exposed to combined attack vectors. We propose the Paired-Sampling Contrastive Framework, a unified training approach that leverages automatically matched pairs of genuine and attack selfies to learn modality-agnostic liveness cues. Evaluated on the 6th Face Anti-Spoofing Challenge Unified Physical-Digital Attack Detection benchmark, our method achieves an average classification error rate (ACER) of 2.10 percent, outperforming prior solutions. The framework is lightweight (4.46 GFLOPs) and trains in under one hour, making it practical for real-world deployment. Code and pretrained models are available at https://github.com/xPONYx/iccv2025_deepfake_challenge.

[59] TAIGen: Training-Free Adversarial Image Generation via Diffusion Models

Susim Roy,Anubhooti Jain,Mayank Vatsa,Richa Singh

Main category: cs.CV

TL;DR: 本文提出了一种新的对抗图像生成方法TAIGen,该方法利用扩散模型,通过在混合步骤间隔注入扰动,实现了高效且高质量的对抗样本生成。

Details Motivation: 现有的对抗攻击方法通常需要大量计算资源并生成低质量的图像,而扩散模型虽然能生成高质量图像,但通常需要数百个采样步骤。 Method: TAIGen使用无条件扩散模型,通过在混合步骤间隔注入扰动,并采用选择性的RGB通道策略。 Result: TAIGen在ImageNet数据集上的攻击成功率分别为ResNet 70.6%,MNASNet 80.8%,ShuffleNet 97.8%,同时生成对抗样本的速度比现有方法快10倍。 Conclusion: TAIGen是一种高效的对抗图像生成方法,它在保持图像质量的同时显著提高了攻击成功率,并且比现有的基于扩散模型的攻击快10倍。 Abstract: Adversarial attacks from generative models often produce low-quality images and require substantial computational resources. Diffusion models, though capable of high-quality generation, typically need hundreds of sampling steps for adversarial generation. This paper introduces TAIGen, a training-free black-box method for efficient adversarial image generation. TAIGen produces adversarial examples using only 3-20 sampling steps from unconditional diffusion models. Our key finding is that perturbations injected during the mixing step interval achieve comparable attack effectiveness without processing all timesteps. We develop a selective RGB channel strategy that applies attention maps to the red channel while using GradCAM-guided perturbations on green and blue channels. This design preserves image structure while maximizing misclassification in target models. TAIGen maintains visual quality with PSNR above 30 dB across all tested datasets. On ImageNet with VGGNet as source, TAIGen achieves 70.6% success against ResNet, 80.8% against MNASNet, and 97.8% against ShuffleNet. The method generates adversarial examples 10x faster than existing diffusion-based attacks. Our method achieves the lowest robust accuracy, indicating it is the most impactful attack as the defense mechanism is least successful in purifying the images generated by TAIGen.

[60] Reversible Unfolding Network for Concealed Visual Perception with Generative Refinement

Chunming He,Fengyang Xiao,Rihan Zhang,Chengyu Fang,Deng-Ping Fan,Sina Farsiu

Main category: cs.CV

TL;DR: RUN++ is a novel approach to concealed visual perception that combines reversible modeling in both mask and RGB domains with diffusion refinement, enhancing accuracy and robustness in challenging scenarios.

Details Motivation: Existing CVP methods primarily focus on the mask domain using reversible strategies, leaving the RGB domain underexplored. This work aims to bridge that gap and improve performance by combining both domains and incorporating diffusion-based refinement. Method: The method formulates the CVP task as a mathematical optimization problem, unfolding its iterative solution into a multi-stage deep network. It integrates three modules: CORE for mask domain modeling, CARE for RGB domain enhancement, and FINE with a Bernoulli diffusion model for refinement. Result: RUN++ achieves improved foreground-background separation and detail restoration in uncertain regions, significantly reducing false positives and negatives while maintaining computational efficiency. Conclusion: RUN++ introduces a novel reversible unfolding network with generative refinement that effectively enhances concealed visual perception by synergizing mask and RGB domain strategies and leveraging a diffusion model for precise detail restoration. Abstract: Existing methods for concealed visual perception (CVP) often leverage reversible strategies to decrease uncertainty, yet these are typically confined to the mask domain, leaving the potential of the RGB domain underexplored. To address this, we propose a reversible unfolding network with generative refinement, termed RUN++. Specifically, RUN++ first formulates the CVP task as a mathematical optimization problem and unfolds the iterative solution into a multi-stage deep network. This approach provides a principled way to apply reversible modeling across both mask and RGB domains while leveraging a diffusion model to resolve the resulting uncertainty. Each stage of the network integrates three purpose-driven modules: a Concealed Object Region Extraction (CORE) module applies reversible modeling to the mask domain to identify core object regions; a Context-Aware Region Enhancement (CARE) module extends this principle to the RGB domain to foster better foreground-background separation; and a Finetuning Iteration via Noise-based Enhancement (FINE) module provides a final refinement. The FINE module introduces a targeted Bernoulli diffusion model that refines only the uncertain regions of the segmentation mask, harnessing the generative power of diffusion for fine-detail restoration without the prohibitive computational cost of a full-image process. This unique synergy, where the unfolding network provides a strong uncertainty prior for the diffusion model, allows RUN++ to efficiently direct its focus toward ambiguous areas, significantly mitigating false positives and negatives. Furthermore, we introduce a new paradigm for building robust CVP systems that remain effective under real-world degradations and extend this concept into a broader bi-level optimization framework.

[61] GasTwinFormer: A Hybrid Vision Transformer for Livestock Methane Emission Segmentation and Dietary Classification in Optical Gas Imaging

Toqi Tahamid Sarker,Mohamed Embaby,Taminul Islam,Amer AbuGhazaleh,Khaled R Ahmed

Main category: cs.CV

TL;DR: GasTwinFormer是一种高效的深度学习模型,用于实时监测畜牧业中的甲烷排放和饮食分类。

Details Motivation: 畜牧业甲烷排放占人为甲烷排放的32%,因此需要自动化监测来制定气候缓解策略。 Method: 引入了一种名为GasTwinFormer的混合视觉Transformer模型,结合了空间缩减的全局注意力和局部分组注意力机制,并采用轻量级LR-ASPP解码器进行多尺度特征聚合。 Result: GasTwinFormer在分割任务中达到了74.47%的mIoU和83.63%的mF1,同时在饮食分类上实现了完美的分类准确率(100%)。模型参数仅为3.348M,计算量为3.428G FLOPs,并实现了114.9 FPS的推理速度。 Conclusion: GasTwinFormer作为一种高效的混合视觉Transformer模型,被成功应用于实时甲烷排放分割和饮食分类,为畜牧业排放监测提供了一种实用的解决方案。 Abstract: Livestock methane emissions represent 32% of human-caused methane production, making automated monitoring critical for climate mitigation strategies. We introduce GasTwinFormer, a hybrid vision transformer for real-time methane emission segmentation and dietary classification in optical gas imaging through a novel Mix Twin encoder alternating between spatially-reduced global attention and locally-grouped attention mechanisms. Our architecture incorporates a lightweight LR-ASPP decoder for multi-scale feature aggregation and enables simultaneous methane segmentation and dietary classification in a unified framework. We contribute the first comprehensive beef cattle methane emission dataset using OGI, containing 11,694 annotated frames across three dietary treatments. GasTwinFormer achieves 74.47% mIoU and 83.63% mF1 for segmentation while maintaining exceptional efficiency with only 3.348M parameters, 3.428G FLOPs, and 114.9 FPS inference speed. Additionally, our method achieves perfect dietary classification accuracy (100%), demonstrating the effectiveness of leveraging diet-emission correlations. Extensive ablation studies validate each architectural component, establishing GasTwinFormer as a practical solution for real-time livestock emission monitoring. Please see our project page at gastwinformer.github.io.

[62] CurveFlow: Curvature-Guided Flow Matching for Image Generation

Yan Luo,Drake Du,Hao Huang,Yi Fang,Mengyu Wang

Main category: cs.CV

TL;DR: CurveFlow通过非线性轨迹和曲率引导提升了文本到图像生成的语义一致性和图像质量。

Details Motivation: 现有修正流模型基于线性轨迹,可能迫使图像生成过程经过数据流形的低概率区域,影响语义对齐。 Method: CurveFlow是一种新的流匹配框架,通过直接引入曲率指导和曲率正则化技术来学习平滑的非线性轨迹。 Result: CurveFlow在MS COCO 2014和2017上实现了最先进的文本到图像生成性能,在BLEU、METEOR、ROUGE和CLAIR等语义一致性指标上显著优于现有方法。 Conclusion: CurveFlow通过引入曲率引导的非线性轨迹,显著提高了文本到图像生成的语义一致性和图像质量。 Abstract: Existing rectified flow models are based on linear trajectories between data and noise distributions. This linearity enforces zero curvature, which can inadvertently force the image generation process through low-probability regions of the data manifold. A key question remains underexplored: how does the curvature of these trajectories correlate with the semantic alignment between generated images and their corresponding captions, i.e., instructional compliance? To address this, we introduce CurveFlow, a novel flow matching framework designed to learn smooth, non-linear trajectories by directly incorporating curvature guidance into the flow path. Our method features a robust curvature regularization technique that penalizes abrupt changes in the trajectory's intrinsic dynamics.Extensive experiments on MS COCO 2014 and 2017 demonstrate that CurveFlow achieves state-of-the-art performance in text-to-image generation, significantly outperforming both standard rectified flow variants and other non-linear baselines like Rectified Diffusion. The improvements are especially evident in semantic consistency metrics such as BLEU, METEOR, ROUGE, and CLAIR. This confirms that our curvature-aware modeling substantially enhances the model's ability to faithfully follow complex instructions while simultaneously maintaining high image quality. The code is made publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/CurveFlow.

[63] HiRQA: Hierarchical Ranking and Quality Alignment for Opinion-Unaware Image Quality Assessment

Vaishnav Ramesh,Haining Wang,Md Jahidul Islam

Main category: cs.CV

TL;DR: HiRQA是一种新的图像质量评估框架,它不需要参考图像或主观标签,通过排序和对比学习提升模型性能,在合成和真实数据上表现优异,并具备实时部署能力。

Details Motivation: 为了解决现有无参考图像质量评估方法中存在的数据集偏差和依赖主观标签的问题。 Method: HiRQA通过结合排序和对比学习的方法,引入了高阶排序损失、嵌入距离损失和训练时对比对齐损失来提升模型性能。 Result: HiRQA在各种失真情况下都展示了最先进的性能、强大的泛化能力和可扩展性,并且其轻量版本HiRQA-S具有实时部署的能力。 Conclusion: HiRQA是一个无需参考图像和主观标签的图像质量评估框架,在合成和真实退化数据上均表现出色,同时具备实时部署能力。 Abstract: Despite significant progress in no-reference image quality assessment (NR-IQA), dataset biases and reliance on subjective labels continue to hinder their generalization performance. We propose HiRQA, Hierarchical Ranking and Quality Alignment), a self-supervised, opinion-unaware framework that offers a hierarchical, quality-aware embedding through a combination of ranking and contrastive learning. Unlike prior approaches that depend on pristine references or auxiliary modalities at inference time, HiRQA predicts quality scores using only the input image. We introduce a novel higher-order ranking loss that supervises quality predictions through relational ordering across distortion pairs, along with an embedding distance loss that enforces consistency between feature distances and perceptual differences. A training-time contrastive alignment loss, guided by structured textual prompts, further enhances the learned representation. Trained only on synthetic distortions, HiRQA generalizes effectively to authentic degradations, as demonstrated through evaluation on various distortions such as lens flare, haze, motion blur, and low-light conditions. For real-time deployment, we introduce \textbf{HiRQA-S}, a lightweight variant with an inference time of only 3.5 ms per image. Extensive experiments across synthetic and authentic benchmarks validate HiRQA's state-of-the-art (SOTA) performance, strong generalization ability, and scalability.

[64] Reliable Multi-view 3D Reconstruction for `Just-in-time' Edge Environments

Md. Nurul Absur,Abhinav Kumar,Swastik Brahma,Saptarshi Debroy

Main category: cs.CV

TL;DR: 本文提出了一种新的边缘资源管理方法,通过投资组合理论优化和遗传算法,提高了多视角3D重建在动态边缘环境中的可靠性。

Details Motivation: 多视角3D重建应用在紧急响应、战术场景和公共安全等需要快速态势感知的关键用例中正在引发变革,但其在边缘环境中的可靠性问题亟需解决。 Method: 使用遗传算法解决投资组合理论优化问题,并通过公开和定制的3D数据集展示了所提出的相机选择策略的优势。 Result: 实验表明,该方法能够在相机遭遇时空相关中断的情况下,仍保证3D重建的质量优于传统基线策略。 Conclusion: 本文提出了一种受投资组合理论启发的边缘资源管理策略,以在可能遭遇系统中断的情况下实现可靠的多视角3D重建。 Abstract: Multi-view 3D reconstruction applications are revolutionizing critical use cases that require rapid situational-awareness, such as emergency response, tactical scenarios, and public safety. In many cases, their near-real-time latency requirements and ad-hoc needs for compute resources necessitate adoption of `Just-in-time' edge environments where the system is set up on the fly to support the applications during the mission lifetime. However, reliability issues can arise from the inherent dynamism and operational adversities of such edge environments, resulting in spatiotemporally correlated disruptions that impact the camera operations, which can lead to sustained degradation of reconstruction quality. In this paper, we propose a novel portfolio theory inspired edge resource management strategy for reliable multi-view 3D reconstruction against possible system disruptions. Our proposed methodology can guarantee reconstruction quality satisfaction even when the cameras are prone to spatiotemporally correlated disruptions. The portfolio theoretic optimization problem is solved using a genetic algorithm that converges quickly for realistic system settings. Using publicly available and customized 3D datasets, we demonstrate the proposed camera selection strategy's benefits in guaranteeing reliable 3D reconstruction against traditional baseline strategies, under spatiotemporal disruptions.

[65] XDR-LVLM: An Explainable Vision-Language Large Model for Diabetic Retinopathy Diagnosis

Masato Ito,Kaito Tanaka,Keisuke Matsuda,Aya Nakayama

Main category: cs.CV

TL;DR: 本文提出了一种新的糖尿病视网膜病变诊断框架XDR-LVLM,它结合了视觉-语言大模型和自然语言解释,提高了诊断的准确性和临床适用性。

Details Motivation: 深度学习模型在糖尿病视网膜病变检测中的黑箱特性常常阻碍了临床应用,因此需要提高透明度和可解释性。 Method: XDR-LVLM框架结合了医学视觉编码器、LVLM核心以及多任务提示工程和多阶段微调,以深入理解眼底图像中的病理特征并生成综合诊断报告。 Result: XDR-LVLM在DDR数据集上实现了84.55%的平衡准确率和79.92%的F1得分,同时在概念检测方面也取得了优异的结果,并且生成的解释在临床上具有高流畅性、准确性和实用性。 Conclusion: XDR-LVLM通过自然语言解释提高了糖尿病视网膜病变诊断的透明度和可解释性,弥合了自动化诊断与临床需求之间的差距。 Abstract: Diabetic Retinopathy (DR) is a major cause of global blindness, necessitating early and accurate diagnosis. While deep learning models have shown promise in DR detection, their black-box nature often hinders clinical adoption due to a lack of transparency and interpretability. To address this, we propose XDR-LVLM (eXplainable Diabetic Retinopathy Diagnosis with LVLM), a novel framework that leverages Vision-Language Large Models (LVLMs) for high-precision DR diagnosis coupled with natural language-based explanations. XDR-LVLM integrates a specialized Medical Vision Encoder, an LVLM Core, and employs Multi-task Prompt Engineering and Multi-stage Fine-tuning to deeply understand pathological features within fundus images and generate comprehensive diagnostic reports. These reports explicitly include DR severity grading, identification of key pathological concepts (e.g., hemorrhages, exudates, microaneurysms), and detailed explanations linking observed features to the diagnosis. Extensive experiments on the Diabetic Retinopathy (DDR) dataset demonstrate that XDR-LVLM achieves state-of-the-art performance, with a Balanced Accuracy of 84.55% and an F1 Score of 79.92% for disease diagnosis, and superior results for concept detection (77.95% BACC, 66.88% F1). Furthermore, human evaluations confirm the high fluency, accuracy, and clinical utility of the generated explanations, showcasing XDR-LVLM's ability to bridge the gap between automated diagnosis and clinical needs by providing robust and interpretable insights.

[66] MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

Xuyang Chen,Zhijun Zhai,Kaixuan Zhou,Zengmao Wang,Jianan He,Dong Wang,Yanfeng Zhang,mingwei Sun,Rüdiger Westermann,Konrad Schindler,Liqiu Meng

Main category: cs.CV

TL;DR: 该论文提出了一种基于网格的场景合成方法MeSS,通过改进图像扩散模型的跨视图一致性,实现高质量、风格一致的城市场景生成,并结合3D高斯随机化技术进行多样化渲染。

Details Motivation: 尽管城市网格模型已广泛可用,但由于缺乏真实纹理,其在虚拟城市导航和自动驾驶中的应用受到限制。 Method: 提出MeSS方法,结合Cascaded Outpainting ControlNets、AGInpaint和GCAlign模块,改进图像扩散模型的跨视图一致性,并通过3D高斯随机化技术重建场景。 Result: 该方法在几何对齐和生成质量方面优于现有方法,并能通过重新照明和风格迁移技术实现多样化渲染。 Conclusion: MeSS为基于网格的城市场景生成提供了一种有效方案,解决了纹理不足问题,拓展了其在虚拟导航和自动驾驶中的应用潜力。 Abstract: Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques.

[67] Adversarial Agent Behavior Learning in Autonomous Driving Using Deep Reinforcement Learning

Arjun Srinivasan,Anubhav Paras,Aniket Bera

Main category: cs.CV

TL;DR: 本文提出了一种针对基于规则的智能体的对抗行为生成方法,用于引发失败场景,并展示了其对累积奖励的负面影响。

Details Motivation: 在安全关键应用中,如自动驾驶,正确建模基于规则的智能体至关重要。当前使用了几种行为建模策略和IDM模型来建模周围智能体。 Method: 提出了一种基于学习的方法来推导基于规则的智能体的对抗行为。 Result: 评估对抗智能体对所有基于规则的智能体的影响,显示了累积奖励的减少。 Conclusion: 提出了一种基于学习的方法来推导基于规则的智能体的对抗行为,以引发失败场景,并评估了对抗智能体对所有基于规则的智能体的影响,显示了累积奖励的减少。 Abstract: Existing approaches in reinforcement learning train an agent to learn desired optimal behavior in an environment with rule based surrounding agents. In safety critical applications such as autonomous driving it is crucial that the rule based agents are modelled properly. Several behavior modelling strategies and IDM models are used currently to model the surrounding agents. We present a learning based method to derive the adversarial behavior for the rule based agents to cause failure scenarios. We evaluate our adversarial agent against all the rule based agents and show the decrease in cumulative reward.

[68] DyMorph-B2I: Dynamic and Morphology-Guided Binary-to-Instance Segmentation for Renal Pathology

Leiyue Zhao,Yuechen Yang,Yanfan Zhu,Haichun Yang,Yuankai Huo,Paul D. Simonson,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng

Main category: cs.CV

TL;DR: DyMorph-B2I是一种动态的、基于形态引导的肾病理二值到实例分割方法,结合多种经典算法并进行优化,显著提高了分割精度。

Details Motivation: 现有的肾病理数据集和自动化方法通常只提供二值(语义)掩码,限制了实例级分割和下游分析的精度。经典后处理技术在面对肾组织复杂的形态和连接关系时效果有限。 Method: DyMorph-B2I结合了分水岭算法、骨架化和形态学操作,并通过自适应几何优化和可定制的超参数调优实现二值掩码到实例分割的转换。 Result: 实验结果表明,DyMorph-B2I在稳健分离粘连和异质结构方面优于传统方法,实现了更好的实例分割效果。 Conclusion: DyMorph-B2I是一个有效的肾病理二值到实例分割的解决方案,优于传统的单独方法和简单组合方法,能够提高形态计量分析的准确性。 Abstract: Accurate morphological quantification of renal pathology functional units relies on instance-level segmentation, yet most existing datasets and automated methods provide only binary (semantic) masks, limiting the precision of downstream analyses. Although classical post-processing techniques such as watershed, morphological operations, and skeletonization, are often used to separate semantic masks into instances, their individual effectiveness is constrained by the diverse morphologies and complex connectivity found in renal tissue. In this study, we present DyMorph-B2I, a dynamic, morphology-guided binary-to-instance segmentation pipeline tailored for renal pathology. Our approach integrates watershed, skeletonization, and morphological operations within a unified framework, complemented by adaptive geometric refinement and customizable hyperparameter tuning for each class of functional unit. Through systematic parameter optimization, DyMorph-B2I robustly separates adherent and heterogeneous structures present in binary masks. Experimental results demonstrate that our method outperforms individual classical approaches and na\"ive combinations, enabling superior instance separation and facilitating more accurate morphometric analysis in renal pathology workflows. The pipeline is publicly available at: https://github.com/ddrrnn123/DyMorph-B2I.

[69] STAGNet: A Spatio-Temporal Graph and LSTM Framework for Accident Anticipation

Vipooshan Vipulananthan,Kumudu Mohottala,Kavindu Chinthana,Nimsara Paramulla,Charith D Chitraranjan

Main category: cs.CV

TL;DR: 本文提出了一种名为STAGNet的模型,通过改进时空特征的聚合方式,实现了更高效的基于车载摄像头视频的事故预测。

Details Motivation: 为了提高道路安全性并降低事故风险,使用更经济且易部署的解决方案(如仅依赖车载摄像头视频)进行事故预测变得尤为重要。 Method: STAGNet结合了时空特征,并通过循环网络进行聚合,以提高基于车载摄像头视频的事故预测性能。 Result: 在三个公开数据集上的实验表明,STAGNet在平均精度和平均碰撞时间方面均优于现有方法。 Conclusion: STAGNet模型在事故预测方面优于现有方法,并展示了其在不同数据集上的泛化能力。 Abstract: Accident prediction and timely warnings play a key role in improving road safety by reducing the risk of injury to road users and minimizing property damage. Advanced Driver Assistance Systems (ADAS) are designed to support human drivers and are especially useful when they can anticipate potential accidents before they happen. While many existing systems depend on a range of sensors such as LiDAR, radar, and GPS, relying solely on dash-cam video input presents a more challenging but a more cost-effective and easily deployable solution. In this work, we incorporate better spatio-temporal features and aggregate them through a recurrent network to improve upon state-of-the-art graph neural networks for predicting accidents from dash-cam videos. Experiments using three publicly available datasets show that our proposed STAGNet model achieves higher average precision and mean time-to-collision values than previous methods, both when cross-validated on a given dataset and when trained and tested on different datasets.

[70] Collaborative Multi-Modal Coding for High-Quality 3D Generation

Ziang Cao,Zhaoxi Chen,Liang Pan,Ziwei Liu

Main category: cs.CV

TL;DR: TriMM是一种新型的3D生成模型,利用多模态数据(如RGB、RGBD和点云)来提高3D资产的质量,通过结合不同模态的优势和使用三平面潜在扩散模型实现高效生成。

Details Motivation: 3D内容本质上包含多模态特征,而现有模型多局限于单一模态或3D结构,忽略了多模态数据的互补优势。因此,TriMM旨在通过整合多模态数据来提升3D建模的效果。 Method: TriMM引入了协作多模态编码,结合模态特定特征,并通过辅助2D和3D监督提升编码性能,最后采用三平面潜在扩散模型生成高质量3D资产。 Result: TriMM在多个知名数据集上表现出色,即使使用少量训练数据也能与大规模训练模型竞争,并验证了其对RGB-D等新模态数据的适应能力。 Conclusion: TriMM通过整合多模态数据,有效提升了3D生成的质量和鲁棒性,为未来多模态3D建模提供了可行方案。 Abstract: 3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

[71] Center-Oriented Prototype Contrastive Clustering

Shihao Dong,Xiaotong Zhou,Yuhui Zheng,Huiying Xu,Xinzhong Zhu

Main category: cs.CV

TL;DR: This paper proposes a novel contrastive clustering framework that improves prototype computation and reduces class conflicts, achieving better performance than existing methods.

Details Motivation: Contrastive learning is widely used in clustering, but class conflicts and deviations in prototype calculation remain challenging. Existing methods suffer from inaccuracies in hard prototype computation, which can lead to inter-class conflicts and prototype drift. Method: The framework includes a soft prototype contrastive module and a dual consistency learning module. The soft prototype module calculates prototypes using weighted probabilities of samples belonging to cluster centers, while the dual consistency module aligns transformations of samples and neighborhoods to ensure feature consistency. Result: Extensive experiments on five datasets demonstrate the effectiveness of the proposed method in comparison to state-of-the-art approaches. Conclusion: The proposed center-oriented prototype contrastive clustering framework effectively addresses inter-class conflicts and prototype drift, showing superior performance compared to existing methods. Abstract: Contrastive learning is widely used in clustering tasks due to its discriminative representation. However, the conflict problem between classes is difficult to solve effectively. Existing methods try to solve this problem through prototype contrast, but there is a deviation between the calculation of hard prototypes and the true cluster center. To address this problem, we propose a center-oriented prototype contrastive clustering framework, which consists of a soft prototype contrastive module and a dual consistency learning module. In short, the soft prototype contrastive module uses the probability that the sample belongs to the cluster center as a weight to calculate the prototype of each category, while avoiding inter-class conflicts and reducing prototype drift. The dual consistency learning module aligns different transformations of the same sample and the neighborhoods of different samples respectively, ensuring that the features have transformation-invariant semantic information and compact intra-cluster distribution, while providing reliable guarantees for the calculation of prototypes. Extensive experiments on five datasets show that the proposed method is effective compared to the SOTA. Our code is published on https://github.com/LouisDong95/CPCC.

[72] AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation

Ruipu Wu,Yige Zhang,Jinyu Chen,Linjiang Huang,Shifeng Zhang,Xu Zhou,Liang Wang,Si Liu

Main category: cs.CV

TL;DR: 论文提出了一种新的双高度无人机协作视觉语言导航任务和框架,并构建了大规模数据集HaL-13k,以提高无人机在复杂环境中的导航性能。

Details Motivation: 由于传统无人机视觉语言导航任务在长轨迹和复杂机动性方面面临挑战,需要人类干预或过于详细的指令,因此提出新的任务和框架以提高导航性能。 Method: 引入Dual-Altitude UAV Collaborative VLN (DuAl-VLN)任务,构建HaL-13k数据集,并提出AeroDuo框架,分别利用高海拔和低海拔无人机的互补优势进行导航。 Result: 构建了包含13,838条协作高低空无人机轨迹的HaL-13k数据集,并提出了高效的AeroDuo框架,其中高海拔无人机用于环境推理,低海拔无人机用于精确导航。 Conclusion: 该论文提出了一种新的双高度无人机协作视觉语言导航框架AeroDuo,并构建了支持训练和评估的HaL-13k数据集,以解决传统无人机视觉语言导航任务中的挑战。 Abstract: Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and complex maneuverability of UAVs, achieving reliable UAV-VLN performance is challenging and often requires human intervention or overly detailed instructions. To harness the advantages of UAVs' high mobility, which could provide multi-grained perspectives, while maintaining a manageable motion space for learning, we introduce a novel task called Dual-Altitude UAV Collaborative VLN (DuAl-VLN). In this task, two UAVs operate at distinct altitudes: a high-altitude UAV responsible for broad environmental reasoning, and a low-altitude UAV tasked with precise navigation. To support the training and evaluation of the DuAl-VLN, we construct the HaL-13k, a dataset comprising 13,838 collaborative high-low UAV demonstration trajectories, each paired with target-oriented language instructions. This dataset includes both unseen maps and an unseen object validation set to systematically evaluate the model's generalization capabilities across novel environments and unfamiliar targets. To consolidate their complementary strengths, we propose a dual-UAV collaborative VLN framework, AeroDuo, where the high-altitude UAV integrates a multimodal large language model (Pilot-LLM) for target reasoning, while the low-altitude UAV employs a lightweight multi-stage policy for navigation and target grounding. The two UAVs work collaboratively and only exchange minimal coordinate information to ensure efficiency.

[73] Pretrained Diffusion Models Are Inherently Skipped-Step Samplers

Wenju Xu

Main category: cs.CV

TL;DR: this paper introduces skipped-step sampling for diffusion models, enabling faster generation without sacrificing quality, and demonstrates its effectiveness on multiple pretrained models.

Details Motivation: diffusion models require long-sequence step-by-step generation, and existing methods like DDIM use non-Markovian processes to reduce sampling steps, but it is unclear if the same efficiency can be achieved with the original diffusion process. Method: introduce skipped-step sampling to bypass multiple intermediate denoising steps and integrate it with DDIM. Result: experiments show that the proposed method achieves high-quality generation with significantly reduced sampling steps on various pretrained diffusion models. Conclusion: skipped-step sampling mechanism is derived from the same training objective as the standard diffusion model, and integrating the accelerated sampling technique with DDIM achieves high-quality generation with reduced sampling steps. Abstract: Diffusion models have been achieving state-of-the-art results across various generation tasks. However, a notable drawback is their sequential generation process, requiring long-sequence step-by-step generation. Existing methods, such as DDIM, attempt to reduce sampling steps by constructing a class of non-Markovian diffusion processes that maintain the same training objective. However, there remains a gap in understanding whether the original diffusion process can achieve the same efficiency without resorting to non-Markovian processes. In this paper, we provide a confirmative answer and introduce skipped-step sampling, a mechanism that bypasses multiple intermediate denoising steps in the iterative generation process, in contrast with the traditional step-by-step refinement of standard diffusion inference. Crucially, we demonstrate that this skipped-step sampling mechanism is derived from the same training objective as the standard diffusion model, indicating that accelerated sampling via skipped-step sampling via a Markovian way is an intrinsic property of pretrained diffusion models. Additionally, we propose an enhanced generation method by integrating our accelerated sampling technique with DDIM. Extensive experiments on popular pretrained diffusion models, including the OpenAI ADM, Stable Diffusion, and Open Sora models, show that our method achieves high-quality generation with significantly reduced sampling steps.

[74] Comp-X: On Defining an Interactive Learned Image Compression Paradigm With Expert-driven LLM Agent

Yixin Gao,Xin Li,Xiaohan Pan,Runsen Feng,Bingchen Li,Yunpeng Qi,Yiting Lu,Zhengxue Cheng,Zhibo Chen,Jörn Ostermann

Main category: cs.CV

TL;DR: The paper presents Comp-X, an intelligently interactive image compression paradigm using LLM agent, which unifies different coding modes into one framework and provides efficient understanding of coding requests.

Details Motivation: Commonly used image codecs suffer from limited coding modes and rely on manual mode selection by engineers, making them unfriendly for unprofessional users. The evolution of the image coding paradigm aims to overcome these limitations. Method: The paper introduces three key innovations: a multi-functional coding framework that unifies different coding modes, an interactive coding agent using augmented in-context learning with expert feedback, and the IIC-bench benchmark for evaluation. Result: Extensive experimental results demonstrate that Comp-X can efficiently understand coding requests and achieve impressive textual interaction capability. Conclusion: Comp-X provides a promising avenue for AGI in image compression by maintaining comparable performance with a single coding framework while offering efficient understanding of coding requests and impressive textual interaction capability. Abstract: We present Comp-X, the first intelligently interactive image compression paradigm empowered by the impressive reasoning capability of large language model (LLM) agent. Notably, commonly used image codecs usually suffer from limited coding modes and rely on manual mode selection by engineers, making them unfriendly for unprofessional users. To overcome this, we advance the evolution of image coding paradigm by introducing three key innovations: (i) multi-functional coding framework, which unifies different coding modes of various objective/requirements, including human-machine perception, variable coding, and spatial bit allocation, into one framework. (ii) interactive coding agent, where we propose an augmented in-context learning method with coding expert feedback to teach the LLM agent how to understand the coding request, mode selection, and the use of the coding tools. (iii) IIC-bench, the first dedicated benchmark comprising diverse user requests and the corresponding annotations from coding experts, which is systematically designed for intelligently interactive image compression evaluation. Extensive experimental results demonstrate that our proposed Comp-X can understand the coding requests efficiently and achieve impressive textual interaction capability. Meanwhile, it can maintain comparable compression performance even with a single coding framework, providing a promising avenue for artificial general intelligence (AGI) in image compression.

[75] Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images

Jinsol Song,Jiamu Wang,Anh Tien Nguyen,Keunho Byeon,Sangjeong Ahn,Sung Hak Lee,Jin Tae Kwak

Main category: cs.CV

TL;DR: 本文提出了一种名为Ano-NAViLa的视觉-语言模型,用于病理图像中的异常检测,通过结合正常和异常病理知识,在两个淋巴结数据集上实现了最先进的异常检测和定位性能。

Details Motivation: 现有的异常检测方法在工业环境中主要设计,在病理学中由于计算限制、多样的组织结构和缺乏可解释性而面临局限性。 Method: 提出了一种名为Ano-NAViLa的正常和异常病理知识增强的视觉-语言模型,基于预训练的视觉-语言模型并加入轻量级的可训练MLP,以提高病理图像中异常检测的准确性和鲁棒性。 Result: 在两个不同器官的淋巴结数据集上评估,Ano-NAViLa在异常检测和定位方面均优于现有模型,达到了最先进的性能。 Conclusion: Ano-NAViLa通过整合正常和异常病理知识,提高了病理图像中异常检测的准确性、鲁棒性和可解释性,为未来在该领域的研究提供了新的方向。 Abstract: Anomaly detection in computational pathology aims to identify rare and scarce anomalies where disease-related data are often limited or missing. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a Normal and Abnormal pathology knowledge-augmented Vision-Language model for Anomaly detection in pathology images. Ano-NAViLa is built on a pre-trained vision-language model with a lightweight trainable MLP. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.

[76] RATopo: Improving Lane Topology Reasoning via Redundancy Assignment

Han Li,Shaofei Huang,Longfei Xu,Yulu Gao,Beipeng Mu,Si Liu

Main category: cs.CV

TL;DR: 本文提出RATopo方法,通过重构Transformer解码器结构和引入冗余匹配策略,增强车道拓扑关系推理能力。

Details Motivation: 现有方法采用先检测后推理的范式,拓扑关系监督受限于检测阶段一对一匹配结果,导致监督范围有限,影响拓扑推理性能。 Method: 提出RATopo方法,通过交换Transformer解码器中的交叉注意力和自注意力层,保留冗余车道预测,并通过多个参数独立的并行交叉注意力块增强车道检测的多样性。 Result: 在OpenLane-V2数据集上的大量实验表明,RATopo能够有效提升拓扑关系推理性能。 Conclusion: RATopo策略具有模型无关性,可无缝集成到现有拓扑推理框架中,持续提升车道与车道、车道与交通元素之间的拓扑性能。 Abstract: Lane topology reasoning plays a critical role in autonomous driving by modeling the connections among lanes and the topological relationships between lanes and traffic elements. Most existing methods adopt a first-detect-then-reason paradigm, where topological relationships are supervised based on the one-to-one assignment results obtained during the detection stage. This supervision strategy results in suboptimal topology reasoning performance due to the limited range of valid supervision. In this paper, we propose RATopo, a Redundancy Assignment strategy for lane Topology reasoning that enables quantity-rich and geometry-diverse topology supervision. Specifically, we restructure the Transformer decoder by swapping the cross-attention and self-attention layers. This allows redundant lane predictions to be retained before suppression, enabling effective one-to-many assignment. We also instantiate multiple parallel cross-attention blocks with independent parameters, which further enhances the diversity of detected lanes. Extensive experiments on OpenLane-V2 demonstrate that our RATopo strategy is model-agnostic and can be seamlessly integrated into existing topology reasoning frameworks, consistently improving both lane-lane and lane-traffic topology performance.

[77] DesignCLIP: Multimodal Learning with CLIP for Design Patent Understanding

Zhu Wang,Homaira Huda Shomee,Sathya N. Ravi,Sourav Medya

Main category: cs.CV

TL;DR: 本文提出了DesignCLIP,一个基于CLIP模型的统一框架,用于处理设计专利的应用,通过类感知分类和对比学习,结合生成的详细标题和多视角图像学习,实现了优于基线和SOTA模型的性能。

Details Motivation: 专利图片往往无法传达全面的视觉上下文和语义信息,这可能导致在现有技术检索中的评估出现歧义。 Method: 使用CLIP模型开发了一个统一的框架DesignCLIP,结合了类感知分类和对比学习,利用生成的详细标题和多视角图像学习。 Result: DesignCLIP在各种下游任务中均表现出色,包括专利分类和专利检索,并且探索了多模态专利检索的潜力。 Conclusion: DesignCLIP是一个有前景的多模态方法,可以更可靠和准确地进行专利分析,并且在所有任务中都优于基线和SOTA模型。 Abstract: In the field of design patent analysis, traditional tasks such as patent classification and patent image retrieval heavily depend on the image data. However, patent images -- typically consisting of sketches with abstract and structural elements of an invention -- often fall short in conveying comprehensive visual context and semantic information. This inadequacy can lead to ambiguities in evaluation during prior art searches. Recent advancements in vision-language models, such as CLIP, offer promising opportunities for more reliable and accurate AI-driven patent analysis. In this work, we leverage CLIP models to develop a unified framework DesignCLIP for design patent applications with a large-scale dataset of U.S. design patents. To address the unique characteristics of patent data, DesignCLIP incorporates class-aware classification and contrastive learning, utilizing generated detailed captions for patent images and multi-views image learning. We validate the effectiveness of DesignCLIP across various downstream tasks, including patent classification and patent retrieval. Additionally, we explore multimodal patent retrieval, which provides the potential to enhance creativity and innovation in design by offering more diverse sources of inspiration. Our experiments show that DesignCLIP consistently outperforms baseline and SOTA models in the patent domain on all tasks. Our findings underscore the promise of multimodal approaches in advancing patent analysis. The codebase is available here: https://anonymous.4open.science/r/PATENTCLIP-4661/README.md.

[78] TPA: Temporal Prompt Alignment for Fetal Congenital Heart Defect Classification

Darya Taratynova,Alya Almsouti,Beknur Kalmakhanbet,Numan Saeed,Mohammad Yaqub

Main category: cs.CV

TL;DR: Temporal Prompt Alignment (TPA) is a novel framework for fetal CHD classification in ultrasound videos that improves classification accuracy and prediction calibration by integrating temporal modeling, contrastive learning with text prompts, and uncertainty quantification.

Details Motivation: Congenital heart defect (CHD) detection in ultrasound videos is limited by image noise, probe variability, and inadequate use of temporal information in existing machine learning approaches. Automated, temporally-aware, and well-calibrated methods are needed to improve diagnostic accuracy and clinical reliability. Method: Temporal Prompt Alignment (TPA) uses an image encoder to extract features from ultrasound video frames, aggregates these features using a trainable temporal extractor, aligns video representations with text prompts using a contrastive loss, and enhances calibration with a Conditional Variational Autoencoder Style Modulation (CVAESM) module. Result: TPA achieved state-of-the-art macro F1 scores of 85.40% for CHD diagnosis, reduced expected calibration error by 5.38% and adaptive ECE by 6.8%. On EchoNet-Dynamic's three-class task, TPA improved macro F1 by 4.73% (from 53.89% to 58.62%). Conclusion: Temporal Prompt Alignment (TPA) is an effective framework for fetal congenital heart defect (CHD) classification that integrates temporal modeling, prompt-aware contrastive learning, and uncertainty quantification, showing significant improvements in performance and calibration. Abstract: Congenital heart defect (CHD) detection in ultrasound videos is hindered by image noise and probe positioning variability. While automated methods can reduce operator dependence, current machine learning approaches often neglect temporal information, limit themselves to binary classification, and do not account for prediction calibration. We propose Temporal Prompt Alignment (TPA), a method leveraging foundation image-text model and prompt-aware contrastive learning to classify fetal CHD on cardiac ultrasound videos. TPA extracts features from each frame of video subclips using an image encoder, aggregates them with a trainable temporal extractor to capture heart motion, and aligns the video representation with class-specific text prompts via a margin-hinge contrastive loss. To enhance calibration for clinical reliability, we introduce a Conditional Variational Autoencoder Style Modulation (CVAESM) module, which learns a latent style vector to modulate embeddings and quantifies classification uncertainty. Evaluated on a private dataset for CHD detection and on a large public dataset, EchoNet-Dynamic, for systolic dysfunction, TPA achieves state-of-the-art macro F1 scores of 85.40% for CHD diagnosis, while also reducing expected calibration error by 5.38% and adaptive ECE by 6.8%. On EchoNet-Dynamic's three-class task, it boosts macro F1 by 4.73% (from 53.89% to 58.62%). Temporal Prompt Alignment (TPA) is a framework for fetal congenital heart defect (CHD) classification in ultrasound videos that integrates temporal modeling, prompt-aware contrastive learning, and uncertainty quantification.

[79] BasketLiDAR: The First LiDAR-Camera Multimodal Dataset for Professional Basketball MOT

Ryunosuke Hayashi,Kohei Torimi,Rokuto Nagata,Kazuma Ikeda,Ozora Sako,Taichi Nakamura,Masaki Tani,Yoshimitsu Aoki,Kentaro Yoshioka

Main category: cs.CV

TL;DR: 本文提出BasketLiDAR,一个结合LiDAR点云和同步多视角相机镜头的专业篮球环境多模态数据集,并提出一种新的MOT框架,实现实时操作和优越的跟踪性能。

Details Motivation: 传统的多摄像机系统受限于视频数据的二维本质和复杂的3D重建处理,使得实时分析具有挑战性;篮球场景是MOT领域最具挑战性的场景之一。 Method: 构建了一个结合LiDAR点云和同步多视角相机镜头的专业篮球环境多模态数据集BasketLiDAR,并提出了一种新的MOT框架,该框架包括仅使用LiDAR的实时跟踪流水线和融合LiDAR和相机数据的多模态跟踪流水线。 Result: BasketLiDAR数据集包含4445帧和3105个球员ID,实验结果表明所提出的方法能够在传统仅相机方法难以实现的实时操作中实现,同时即使在遮挡条件下也能够实现优越的跟踪性能。 Conclusion: BasketLiDAR实现了实时操作,并在遮挡条件下实现了优越的跟踪性能,同时结合了LiDAR和相机数据的多模态跟踪框架。 Abstract: Real-time 3D trajectory player tracking in sports plays a crucial role in tactical analysis, performance evaluation, and enhancing spectator experience. Traditional systems rely on multi-camera setups, but are constrained by the inherently two-dimensional nature of video data and the need for complex 3D reconstruction processing, making real-time analysis challenging. Basketball, in particular, represents one of the most difficult scenarios in the MOT field, as ten players move rapidly and complexly within a confined court space, with frequent occlusions caused by intense physical contact. To address these challenges, this paper constructs BasketLiDAR, the first multimodal dataset in the sports MOT field that combines LiDAR point clouds with synchronized multi-view camera footage in a professional basketball environment, and proposes a novel MOT framework that simultaneously achieves improved tracking accuracy and reduced computational cost. The BasketLiDAR dataset contains a total of 4,445 frames and 3,105 player IDs, with fully synchronized IDs between three LiDAR sensors and three multi-view cameras. We recorded 5-on-5 and 3-on-3 game data from actual professional basketball players, providing complete 3D positional information and ID annotations for each player. Based on this dataset, we developed a novel MOT algorithm that leverages LiDAR's high-precision 3D spatial information. The proposed method consists of a real-time tracking pipeline using LiDAR alone and a multimodal tracking pipeline that fuses LiDAR and camera data. Experimental results demonstrate that our approach achieves real-time operation, which was difficult with conventional camera-only methods, while achieving superior tracking performance even under occlusion conditions. The dataset is available upon request at: https://sites.google.com/keio.jp/keio-csg/projects/basket-lidar

[80] First RAG, Second SEG: A Training-Free Paradigm for Camouflaged Object Detection

Wutao Liu,YiDan Wang,Pan Gao

Main category: cs.CV

TL;DR: 本文提出了一种新的无训练的伪装对象检测方法RAG-SEG,该方法在保持高性能的同时,显著降低了计算需求和资源消耗。

Details Motivation: 由于对象与其背景之间的高度相似性,伪装对象检测(COD)在计算机视觉中是一个重大挑战。现有的方法通常依赖于大量的训练和计算资源,而基础模型如SAM在没有微调的情况下难以处理COD任务,并且需要高质量的提示以获得良好的性能。 Method: 提出了一种无训练的COD范式RAG-SEG,将COD任务分为两个阶段:检索增强生成(RAG)用于生成粗略掩码作为提示,接着使用基于SAM的分割(SEG)进行优化。 Result: 广泛的实验表明,RAG-SEG的表现与最先进的方法相当或超越。 Conclusion: RAG-SEG方法无需传统训练,同时保持了竞争性能,并且所有实验均在个人笔记本电脑上完成,突出了该方法的计算效率和实用性。 Abstract: Camouflaged object detection (COD) poses a significant challenge in computer vision due to the high similarity between objects and their backgrounds. Existing approaches often rely on heavy training and large computational resources. While foundation models such as the Segment Anything Model (SAM) offer strong generalization, they still struggle to handle COD tasks without fine-tuning and require high-quality prompts to yield good performance. However, generating such prompts manually is costly and inefficient. To address these challenges, we propose \textbf{First RAG, Second SEG (RAG-SEG)}, a training-free paradigm that decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks as prompts, followed by SAM-based segmentation (SEG) for refinement. RAG-SEG constructs a compact retrieval database via unsupervised clustering, enabling fast and effective feature retrieval. During inference, the retrieved features produce pseudo-labels that guide precise mask generation using SAM2. Our method eliminates the need for conventional training while maintaining competitive performance. Extensive experiments on benchmark COD datasets demonstrate that RAG-SEG performs on par with or surpasses state-of-the-art methods. Notably, all experiments are conducted on a \textbf{personal laptop}, highlighting the computational efficiency and practicality of our approach. We present further analysis in the Appendix, covering limitations, salient object detection extension, and possible improvements.

[81] VideoEraser: Concept Erasure in Text-to-Video Diffusion Models

Naen Xu,Jinghuai Zhang,Changjiang Li,Zhi Chen,Chunyi Zhou,Qingming Li,Tianyu Du,Shouling Ji

Main category: cs.CV

TL;DR: 为了解决T2V扩散模型可能生成有害内容的问题,提出了VideoEraser这一无需训练的框架,其通过两个阶段的方法在多个任务上均表现出色。

Details Motivation: 由于T2V扩散模型可能被滥用生成有害或误导性内容,因此需要一种方法来防止这种情况的发生。 Method: 提出了一种无需训练的框架VideoEraser,包括Selective Prompt Embedding Adjustment (SPEA) 和 Adversarial-Resilient Noise Guidance (ARNG) 两个阶段。 Result: VideoEraser在四个任务上的实验结果表明,其在效能、完整性、保真度、鲁棒性和泛化性方面均优于现有方法,平均抑制不良内容生成的效果提高了46%。 Conclusion: VideoEraser有效地解决了T2V扩散模型在隐私、版权和安全方面的问题,表现出色,特别是在抑制不良内容生成方面达到了最先进的水平。 Abstract: The rapid growth of text-to-video (T2V) diffusion models has raised concerns about privacy, copyright, and safety due to their potential misuse in generating harmful or misleading content. These models are often trained on numerous datasets, including unauthorized personal identities, artistic creations, and harmful materials, which can lead to uncontrolled production and distribution of such content. To address this, we propose VideoEraser, a training-free framework that prevents T2V diffusion models from generating videos with undesirable concepts, even when explicitly prompted with those concepts. Designed as a plug-and-play module, VideoEraser can seamlessly integrate with representative T2V diffusion models via a two-stage process: Selective Prompt Embedding Adjustment (SPEA) and Adversarial-Resilient Noise Guidance (ARNG). We conduct extensive evaluations across four tasks, including object erasure, artistic style erasure, celebrity erasure, and explicit content erasure. Experimental results show that VideoEraser consistently outperforms prior methods regarding efficacy, integrity, fidelity, robustness, and generalizability. Notably, VideoEraser achieves state-of-the-art performance in suppressing undesirable content during T2V generation, reducing it by 46% on average across four tasks compared to baselines.

[82] Predicting Road Crossing Behaviour using Pose Detection and Sequence Modelling

Subhasis Dasgupta,Preetam Saha,Agniva Roy,Jaydip Sen

Main category: cs.CV

TL;DR: 该研究提出了一种基于深度学习的端到端框架,用于预测行人过马路的意图,比较了GRU、LSTM和1D CNN等序列建模方法的性能。

Details Motivation: 随着AI系统的普及,自动驾驶车辆需要能够远距离预测行人是否打算过马路,以提高安全性。 Method: 研究使用深度学习模型进行姿态检测,并结合序列建模技术进行时间预测,分析了GRU、LSTM和1D CNN三种模型的表现。 Result: 实验发现GRU在预测行人意图方面优于LSTM,而1D CNN在速度上表现最佳。 Conclusion: 结合姿态检测和序列建模的端到端深度学习框架可有效预测行人过马路的意图,为自动驾驶系统提供支持。 Abstract: The world is constantly moving towards AI based systems and autonomous vehicles are now reality in different parts of the world. These vehicles require sensors and cameras to detect objects and maneuver according to that. It becomes important to for such vehicles to also predict from a distant if a person is about to cross a road or not. The current study focused on predicting the intent of crossing the road by pedestrians in an experimental setup. The study involved working with deep learning models to predict poses and sequence modelling for temporal predictions. The study analysed three different sequence modelling to understand the prediction behaviour and it was found out that GRU was better in predicting the intent compared to LSTM model but 1D CNN was the best model in terms of speed. The study involved video analysis, and the output of pose detection model was integrated later on to sequence modelling techniques for an end-to-end deep learning framework for predicting road crossing intents.

[83] RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features

Olga Matykina,Dmitry Yudin

Main category: cs.CV

TL;DR: RCDINO是一种新的多模态三维物体检测模型,通过融合摄像头和雷达数据,在nuScenes数据集上表现出色。

Details Motivation: 三维物体检测对于自动驾驶和机器人技术至关重要,依赖于摄像头和雷达多模态数据的有效融合。 Method: 提出了RCDINO,一种基于多模态Transformer的模型,通过将视觉骨干特征与预训练DINOv2基础模型的语义丰富表示融合来增强视觉表示。 Result: 实验表明,RCDINO在保持与基线架构兼容性的同时,提升了模型的检测性能。 Conclusion: RCDINO实现了雷达-相机模型中的最先进性能,在nuScenes数据集上达到了56.4 NDS和48.1 mAP。 Abstract: Three-dimensional object detection is essential for autonomous driving and robotics, relying on effective fusion of multimodal data from cameras and radar. This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features by fusing them with semantically rich representations from the pretrained DINOv2 foundation model. This approach enriches visual representations and improves the model's detection performance while preserving compatibility with the baseline architecture. Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models, with 56.4 NDS and 48.1 mAP. Our implementation is available at https://github.com/OlgaMatykina/RCDINO.

[84] An Empirical Study on How Video-LLMs Answer Video Questions

Chenhui Gou,Ziyu Ma,Zicheng Duan,Haoyu He,Feng Chen,Akide Liu,Bohan Zhuang,Jianfei Cai,Hamid Rezatofighi

Main category: cs.CV

TL;DR: This paper systematically analyzes how Video-LLMs process video content internally, using attention knockouts to reveal that early layers encode video information, later layers handle reasoning, and language-guided retrieval plays a key role in spatial-temporal modeling, offering insights for more efficient models.

Details Motivation: Despite the strong capabilities of Video-LLMs in answering video questions, most existing research focuses on performance improvement without much understanding of their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. Method: The paper uses attention knockouts as the primary analytical tool, designing three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. These are applied across different layers in two settings (global and fine-grained) to interpret the internal mechanisms of Video-LLMs. Result: The study reveals three key findings: 1) Video information extraction mainly occurs in early layers, dividing processing into perceptual encoding and abstract reasoning stages; 2) Certain intermediate layers have a significant impact on video question answering; and 3) Spatial-temporal modeling relies more on language-guided retrieval rather than self-attention among video tokens. The study also demonstrates how these insights can reduce attention computation. Conclusion: This paper concludes that Video-LLMs process video content in a structured manner, with early layers focusing on perceptual encoding and later layers on abstract reasoning, and that spatial-temporal modeling relies more on language-guided retrieval than on self-attention among video tokens. Additionally, the research offers insights into reducing attention computation for efficiency. Abstract: Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process -- lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter's high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.

[85] Transfer learning optimization based on evolutionary selective fine tuning

Jacinto Colan,Ana Davila,Yasuhisa Hasegawa

Main category: cs.CV

TL;DR: BioTune是一个利用进化算法选择性微调模型层次以提高迁移学习效率和降低计算成本的方法。

Details Motivation: 传统的微调方法通常涉及更新所有模型参数,这可能导致过拟合和更高的计算成本。 Method: BioTune采用了一种进化算法来识别一组需要微调的层次,以优化模型在给定目标任务上的性能。 Result: 在来自不同领域的九个图像分类数据集上的评估表明,与现有的微调方法(如AutoRGN和LoRA)相比,BioTune实现了具有竞争力或改进的准确性与效率。 Conclusion: BioTune是一个进化自适应微调技术,可以提高迁移学习的效率,减少计算成本。 Abstract: Deep learning has shown substantial progress in image analysis. However, the computational demands of large, fully trained models remain a consideration. Transfer learning offers a strategy for adapting pre-trained models to new tasks. Traditional fine-tuning often involves updating all model parameters, which can potentially lead to overfitting and higher computational costs. This paper introduces BioTune, an evolutionary adaptive fine-tuning technique that selectively fine-tunes layers to enhance transfer learning efficiency. BioTune employs an evolutionary algorithm to identify a focused set of layers for fine-tuning, aiming to optimize model performance on a given target task. Evaluation across nine image classification datasets from various domains indicates that BioTune achieves competitive or improved accuracy and efficiency compared to existing fine-tuning methods such as AutoRGN and LoRA. By concentrating the fine-tuning process on a subset of relevant layers, BioTune reduces the number of trainable parameters, potentially leading to decreased computational cost and facilitating more efficient transfer learning across diverse data characteristics and distributions.

[86] Image-Conditioned 3D Gaussian Splat Quantization

Xinshuang Liu,Runfa Blark Li,Keito Suzuki,Truong Nguyen

Main category: cs.CV

TL;DR: 本文提出了一种新的3D高斯点压缩方法ICGS-Quantizer,该方法通过利用跨高斯和跨属性相关性以及共享码本,提高了压缩效率,并通过解码时的图像条件实现了对场景变化的适应性。

Details Motivation: 现有的3DGS压缩方法在大规模场景或需要长期存档和适应场景变化的场景下存在局限性。 Method: ICGS-Quantizer利用跨高斯和跨属性相关性,并采用跨训练场景的共享码本,同时将编码、量化和解码过程联合训练。 Result: ICGS-Quantizer将3DGS的存储需求降低到千字节范围,同时在压缩效率和对场景变化的适应性方面优于现有方法。 Conclusion: ICGS-Quantizer有效提高了3D高斯点的压缩效率,并通过解码时的图像条件实现了对场景变化的适应性。 Abstract: 3D Gaussian Splatting (3DGS) has attracted considerable attention for enabling high-quality real-time rendering. Although 3DGS compression methods have been proposed for deployment on storage-constrained devices, two limitations hinder archival use: (1) they compress medium-scale scenes only to the megabyte range, which remains impractical for large-scale scenes or extensive scene collections; and (2) they lack mechanisms to accommodate scene changes after long-term archival. To address these limitations, we propose an Image-Conditioned Gaussian Splat Quantizer (ICGS-Quantizer) that substantially enhances compression efficiency and provides adaptability to scene changes after archiving. ICGS-Quantizer improves quantization efficiency by jointly exploiting inter-Gaussian and inter-attribute correlations and by using shared codebooks across all training scenes, which are then fixed and applied to previously unseen test scenes, eliminating the overhead of per-scene codebooks. This approach effectively reduces the storage requirements for 3DGS to the kilobyte range while preserving visual fidelity. To enable adaptability to post-archival scene changes, ICGS-Quantizer conditions scene decoding on images captured at decoding time. The encoding, quantization, and decoding processes are trained jointly, ensuring that the codes, which are quantized representations of the scene, are effective for conditional decoding. We evaluate ICGS-Quantizer on 3D scene compression and 3D scene updating. Experimental results show that ICGS-Quantizer consistently outperforms state-of-the-art methods in compression efficiency and adaptability to scene changes. Our code, model, and data will be publicly available on GitHub.

[87] DriveSplat: Decoupled Driving Scene Reconstruction with Geometry-enhanced Partitioned Neural Gaussians

Cong Wang,Xianda Guo,Wenbo Xu,Wei Tian,Ruiqi Song,Chenming Zhang,Lingxi Li,Long Chen

Main category: cs.CV

TL;DR: DriveSplat improves 3D scene reconstruction for driving scenarios by better handling dynamic and static elements with region-wise voxel initialization and deformable Gaussians.

Details Motivation: Current 3D Gaussian Splatting methods struggle with motion blur, background optimization, and geometric accuracy in driving scenarios. Method: DriveSplat uses region-wise voxel initialization, deformable neural Gaussians, and a deformation network, supervised by depth and normal priors. Result: DriveSplat achieves state-of-the-art performance in novel-view synthesis on the Waymo and KITTI datasets. Conclusion: DriveSplat provides high-quality 3D scene reconstruction for driving scenarios by improving dynamic-static decoupling and geometric representation. Abstract: In the realm of driving scenarios, the presence of rapidly moving vehicles, pedestrians in motion, and large-scale static backgrounds poses significant challenges for 3D scene reconstruction. Recent methods based on 3D Gaussian Splatting address the motion blur problem by decoupling dynamic and static components within the scene. However, these decoupling strategies overlook background optimization with adequate geometry relationships and rely solely on fitting each training view by adding Gaussians. Therefore, these models exhibit limited robustness in rendering novel views and lack an accurate geometric representation. To address the above issues, we introduce DriveSplat, a high-quality reconstruction method for driving scenarios based on neural Gaussian representations with dynamic-static decoupling. To better accommodate the predominantly linear motion patterns of driving viewpoints, a region-wise voxel initialization scheme is employed, which partitions the scene into near, middle, and far regions to enhance close-range detail representation. Deformable neural Gaussians are introduced to model non-rigid dynamic actors, whose parameters are temporally adjusted by a learnable deformation network. The entire framework is further supervised by depth and normal priors from pre-trained models, improving the accuracy of geometric structures. Our method has been rigorously evaluated on the Waymo and KITTI datasets, demonstrating state-of-the-art performance in novel-view synthesis for driving scenarios.

[88] DIO: Refining Mutual Information and Causal Chain to Enhance Machine Abstract Reasoning Ability

Ruizhuo Song,Beiming Yuan

Main category: cs.CV

TL;DR: 该论文研究了深度学习模型在抽象推理任务(RPM问题)中的局限性,并提出了基于因果链建模的改进方法来提升模型的推理能力。

Details Motivation: 尽管当前深度学习模型在各个领域表现出色,但其在抽象推理方面仍存在根本瓶颈。为了应对这一挑战,学术界引入了Raven's Progressive Matrices (RPM)问题作为评估深度学习算法抽象推理能力的权威基准。论文旨在通过解决RPM问题来提升机器智能的抽象推理能力。 Method: 论文采用了“因果链建模”的视角来系统分析RPM任务中的完整因果链,并基于此设计了基线模型DIO的网络架构。随后通过实验分析了DIO的优化目标的局限性,并提出了三种改进方法来克服这些限制。 Result: 实验表明,基线模型DIO的优化目标无法使模型真正掌握预设的人类推理逻辑,其主要原因是互信息下限的紧致性以及互信息作为统计度量无法捕捉因果关系。 Conclusion: 该论文得出结论,通过最大化上下文与正确选项之间的互信息的变分下限来优化模型,并不能使模型真正掌握预设的人类推理逻辑,这主要是由于互信息下限的紧致性和统计特性无法捕捉主体与客体之间的因果关系。为了解决这些问题,论文逐步提出了三种改进方法。 Abstract: Despite the outstanding performance of current deep learning models across various domains, their fundamental bottleneck in abstract reasoning remains unresolved. To address this challenge, the academic community has introduced Raven's Progressive Matrices (RPM) problems as an authoritative benchmark for evaluating the abstract reasoning capabilities of deep learning algorithms, with a focus on core intelligence dimensions such as abstract reasoning, pattern recognition, and complex problem-solving. Therefore, this paper centers on solving RPM problems, aiming to contribute to enhancing the abstract reasoning abilities of machine intelligence. Firstly, this paper adopts a ``causal chain modeling'' perspective to systematically analyze the complete causal chain in RPM tasks: image $\rightarrow$ abstract attributes $\rightarrow$ progressive attribute patterns $\rightarrow$ pattern consistency $\rightarrow$ correct answer. Based on this analysis, the network architecture of the baseline model DIO is designed. However, experiments reveal that the optimization objective formulated for DIO, namely maximizing the variational lower bound of mutual information between the context and the correct option, fails to enable the model to genuinely acquire the predefined human reasoning logic. This is attributed to two main reasons: the tightness of the lower bound significantly impacts the effectiveness of mutual information maximization, and mutual information, as a statistical measure, does not capture the causal relationship between subjects and objects. To overcome these limitations, this paper progressively proposes three improvement methods:

[89] Spiking Variational Graph Representation Inference for Video Summarization

Wenrui Li,Wei Han,Liang-Jian Deng,Ruiqin Xiong,Xiaopeng Fan

Main category: cs.CV

TL;DR: This paper proposes SpiVG, a novel video summarization framework using SNN-based keyframe extraction, dynamic graph reasoning, and variational inference to improve performance and reduce noise and computational complexity.

Details Motivation: Existing video summarization methods struggle with capturing global temporal dependencies, maintaining semantic coherence, and handling noise during multi-channel feature fusion, especially with the rise of short video content. Method: The proposed Spiking Variational Graph (SpiVG) Network includes a keyframe extractor based on Spiking Neural Networks (SNN), a Dynamic Aggregation Graph Reasoner for fine-grained reasoning, and a Variational Inference Reconstruction Module using ELBO optimization and posterior distribution regularization. Result: Experimental results show that SpiVG outperforms existing methods on datasets like SumMe, TVSum, VideoXum, and QFVS. Conclusion: SpiVG surpasses existing methods in video summarization across multiple datasets, while reducing computational complexity and addressing noise in multi-channel feature fusion. Abstract: With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.

[90] From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations

Anthony Bisulco,Rahul Ramesh,Randall Balestriero,Pratik Chaudhari

Main category: cs.CV

TL;DR: This paper explores how Masked Autoencoders (MAEs) learn spatial correlations in images and how hyperparameters like masking ratio and patch size can be used to control the types of features captured, offering insights into selecting these parameters for practical applications.

Details Motivation: Despite the effectiveness of MAEs as a pretraining technique for vision foundation models, they require extensive hyperparameter tuning when applied to new datasets, and the connection between MAE hyperparameters and performance on downstream tasks is not well understood. Method: The authors analytically derive the features learned by a linear MAE and extend this analysis to non-linear MAEs to understand how MAEs learn spatial correlations in the input image. Result: The study shows that masking ratio and patch size can be used to select for features that capture short- and long-range spatial correlations, and MAE representations adapt to spatial correlations in the dataset beyond second-order statistics. Conclusion: MAE representations adapt to spatial correlations in the dataset beyond second-order statistics, and this work provides insights on how to select MAE hyperparameters in practice. Abstract: Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models. Despite their effectiveness, they require extensive hyperparameter tuning (masking ratio, patch size, encoder/decoder layers) when applied to novel datasets. While prior theoretical works have analyzed MAEs in terms of their attention patterns and hierarchical latent variable models, the connection between MAE hyperparameters and performance on downstream tasks is relatively unexplored. This work investigates how MAEs learn spatial correlations in the input image. We analytically derive the features learned by a linear MAE and show that masking ratio and patch size can be used to select for features that capture short- and long-range spatial correlations. We extend this analysis to non-linear MAEs to show that MAE representations adapt to spatial correlations in the dataset, beyond second-order statistics. Finally, we discuss some insights on how to select MAE hyper-parameters in practice.

[91] Bidirectional Temporal Information Propagation for Moving Infrared Small Target Detection

Dengyan Luo,Yanping Xiang,Hu Wang,Luping Ji. Shuai Li,Mao Ye

Main category: cs.CV

TL;DR: This paper introduces BIRD, a bidirectional temporal information propagation method for moving infrared small target detection, which effectively utilizes both local and global temporal information and achieves excellent performance with fast inference speed.

Details Motivation: Existing sliding-window-based multi-frame methods do not consider joint optimization of the entire video clip and ignore global temporal information, leading to redundant computation and sub-optimal performance. This paper aims to address these limitations. Method: The paper proposes a Bidirectional temporal information propagation method (BIRD) that utilizes both local and global temporal information from adjacent, past, and future frames through Local Temporal Motion Fusion (LTMF) and Global Temporal Motion Fusion (GTMF) modules. Features are fused bidirectionally and optimized using detection loss and Spatio-Temporal Fusion (STF) loss. Result: Extensive experiments show that the proposed BIRD method achieves state-of-the-art performance and demonstrates fast inference speed. Conclusion: BIRD is an effective method for moving infrared small target detection that achieves state-of-the-art performance and fast inference speed. Abstract: Moving infrared small target detection is broadly adopted in infrared search and track systems, and has attracted considerable research focus in recent years. The existing learning-based multi-frame methods mainly aggregate the information of adjacent frames in a sliding window fashion to assist the detection of the current frame. However, the sliding-window-based methods do not consider joint optimization of the entire video clip and ignore the global temporal information outside the sliding window, resulting in redundant computation and sub-optimal performance. In this paper, we propose a Bidirectional temporal information propagation method for moving InfraRed small target Detection, dubbed BIRD. The bidirectional propagation strategy simultaneously utilizes local temporal information of adjacent frames and global temporal information of past and future frames in a recursive fashion. Specifically, in the forward and backward propagation branches, we first design a Local Temporal Motion Fusion (LTMF) module to model local spatio-temporal dependency between a target frame and its two adjacent frames. Then, a Global Temporal Motion Fusion (GTMF) module is developed to further aggregate the global propagation feature with the local fusion feature. Finally, the bidirectional aggregated features are fused and input into the detection head for detection. In addition, the entire video clip is jointly optimized by the traditional detection loss and the additional Spatio-Temporal Fusion (STF) loss. Extensive experiments demonstrate that the proposed BIRD method not only achieves the state-of-the-art performance but also shows a fast inference speed.

[92] A Curated Dataset and Deep Learning Approach for Minor Dent Detection in Vehicles

Danish Zia Baig,Mohsin Kamal

Main category: cs.CV

TL;DR: This paper proposes a deep learning-based method using the YOLOv8 framework to automatically detect microscopic surface flaws like tiny dents on car exteriors, achieving high accuracy with the YOLOv8m-t42 model.

Details Motivation: Conventional car damage inspection techniques are labor-intensive, manual, and often miss small surface imperfections like microscopic dents. There is a need for faster and more precise inspection methods. Method: The paper employs the YOLOv8 object recognition framework, training models YOLOv8m, YOLOv8m-t4, and YOLOv8m-t42 on a bespoke dataset of annotated car surface images with real-time data augmentation to detect microscopic surface flaws like tiny dents. Result: The YOLOv8m-t42 model achieved a precision of 0.86, recall of 0.84, F1-score of 0.85, and a PR curve area of 0.88, outperforming YOLOv8m-t4. It demonstrated excellent detection accuracy and low inference latency, making it suitable for real-time applications, although with a slightly reduced mAP@0.5:0.95 of 0.20 and mAP@0.5 of 0.60. Conclusion: YOLOv8m-t42 is more accurate and suitable for practical dent detection applications despite slower convergence compared to YOLOv8m-t4. Abstract: Conventional car damage inspection techniques are labor-intensive, manual, and frequently overlook tiny surface imperfections like microscopic dents. Machine learning provides an innovative solution to the increasing demand for quicker and more precise inspection methods. The paper uses the YOLOv8 object recognition framework to provide a deep learning-based solution for automatically detecting microscopic surface flaws, notably tiny dents, on car exteriors. Traditional automotive damage inspection procedures are manual, time-consuming, and frequently unreliable at detecting tiny flaws. To solve this, a bespoke dataset containing annotated photos of car surfaces under various lighting circumstances, angles, and textures was created. To improve robustness, the YOLOv8m model and its customized variants, YOLOv8m-t4 and YOLOv8m-t42, were trained employing real-time data augmentation approaches. Experimental results show that the technique has excellent detection accuracy and low inference latency, making it suited for real-time applications such as automated insurance evaluations and automobile inspections. Evaluation parameters such as mean Average Precision (mAP), precision, recall, and F1-score verified the model's efficacy. With a precision of 0.86, recall of 0.84, and F1-score of 0.85, the YOLOv8m-t42 model outperformed the YOLOv8m-t4 model (precision: 0.81, recall: 0.79, F1-score: 0.80) in identifying microscopic surface defects. With a little reduced mAP@0.5:0.95 of 0.20, the mAP@0.5 for YOLOv8m-t42 stabilized at 0.60. Furthermore, YOLOv8m-t42's PR curve area was 0.88, suggesting more consistent performance than YOLOv8m-t4 (0.82). YOLOv8m-t42 has greater accuracy and is more appropriate for practical dent detection applications, even though its convergence is slower.

[93] Aligning Moments in Time using Video Queries

Yogesh Kumar,Uday Agarwal,Manish Gupta,Anand Mishra

Main category: cs.CV

TL;DR: This paper introduces MATR, a transformer-based model with dual-stage alignment and self-supervised pre-training, which achieves significant improvements in video-to-video moment retrieval performance.

Details Motivation: The task of video-to-video moment retrieval poses challenges like semantic frame-level alignment and modeling dependencies between videos, which motivated the development of a more effective solution. Method: MATR uses a transformer-based architecture with dual-stage sequence alignment to model query-target video dependencies and includes a self-supervised pre-training technique for task-specific initialization. Result: MATR achieves a 13.1% improvement in R@1 and 8.1% in mIoU on the ActivityNet-VRL dataset, and a 14.7% gain in R@1 and 14.4% in mIoU on the SportsMoments dataset. Conclusion: The proposed MATR model achieves significant performance improvements in video-to-video moment retrieval tasks compared to state-of-the-art methods, as demonstrated on two datasets. Abstract: Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query video. Additionally, to provide a strong task-specific initialization for MATR, we propose a self-supervised pre-training technique that involves training the model to localize random clips within videos. Extensive experiments demonstrate that MATR achieves notable performance improvements of 13.1% in R@1 and 8.1% in mIoU on an absolute scale compared to state-of-the-art methods on the popular ActivityNet-VRL dataset. Additionally, on our newly proposed dataset, SportsMoments, MATR shows a 14.7% gain in R@1 and a 14.4% gain in mIoU on an absolute scale over strong baselines.

[94] Enhancing Novel View Synthesis from extremely sparse views with SfM-free 3D Gaussian Splatting Framework

Zongqi He,Hanmin Li,Kin-Chung Chan,Yushen Zuo,Hao Xie,Zhe Xiao,Jun Xiao,Kin-Man Lam

Main category: cs.CV

TL;DR: 本文提出了一种无需SfM的3DGS方法,能够从极少视角输入中联合估计相机姿态和重建3D场景,显著提升了渲染质量和几何结构准确性。

Details Motivation: 3D高斯点阵在稀疏视角输入下依赖于精确的相机姿态和稠密的多视角输入,而现实场景中这些条件难以满足,因此需要一种更鲁棒的方法进行3D重建和视角合成。 Method: 提出了一种稠密立体模块和一致性视图插值模块,用于估计相机姿态和重建全局稠密点云,并引入了多尺度拉普拉斯一致正则化和自适应空间感知多尺度几何正则化来提升几何结构和渲染内容的质量。 Result: 实验表明,该方法在极稀疏视角条件下(仅使用2个训练视角)在PSNR上比现有3DGS方法高出2.75dB,并且生成的图像失真最小,保留了丰富的高频细节,视觉质量更优。 Conclusion: 该论文提出了一种无需SfM的3DGS方法,能够从极少视角输入中联合估计相机姿态并重建3D场景,显著提升了在极稀疏视角条件下的渲染质量和几何结构的准确性。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated remarkable real-time performance in novel view synthesis, yet its effectiveness relies heavily on dense multi-view inputs with precisely known camera poses, which are rarely available in real-world scenarios. When input views become extremely sparse, the Structure-from-Motion (SfM) method that 3DGS depends on for initialization fails to accurately reconstruct the 3D geometric structures of scenes, resulting in degraded rendering quality. In this paper, we propose a novel SfM-free 3DGS-based method that jointly estimates camera poses and reconstructs 3D scenes from extremely sparse-view inputs. Specifically, instead of SfM, we propose a dense stereo module to progressively estimates camera pose information and reconstructs a global dense point cloud for initialization. To address the inherent problem of information scarcity in extremely sparse-view settings, we propose a coherent view interpolation module that interpolates camera poses based on training view pairs and generates viewpoint-consistent content as additional supervision signals for training. Furthermore, we introduce multi-scale Laplacian consistent regularization and adaptive spatial-aware multi-scale geometry regularization to enhance the quality of geometrical structures and rendered content. Experiments show that our method significantly outperforms other state-of-the-art 3DGS-based approaches, achieving a remarkable 2.75dB improvement in PSNR under extremely sparse-view conditions (using only 2 training views). The images synthesized by our method exhibit minimal distortion while preserving rich high-frequency details, resulting in superior visual quality compared to existing techniques.

[95] LGMSNet: Thinning a medical image segmentation model via dual-level multiscale fusion

Chengqi Dong,Fenghe Tang,Rongge Mao,Xinpei Gao,S. Kevin Zhou

Main category: cs.CV

TL;DR: LGMSNet是一个高效的医学图像分割框架,解决了通道冗余问题并提升了全局上下文感知能力,在多个数据集上表现出色,适用于资源受限的临床环境。

Details Motivation: 需要在资源受限的临床环境中使用轻量且通用的模型进行医学图像分割,现有模型往往牺牲性能以换取效率,并且忽略了通道冗余问题。 Method: 提出了LGMSNet,一个基于局部和全局双多尺度的新颖轻量级框架,采用异构层内核和稀疏Transformer-卷积混合分支。 Result: 在六个公开数据集上进行了广泛的实验,证明了LGMSNet优于现有最先进方法,并在零样本泛化测试中保持卓越性能。 Conclusion: LGMSNet展现出在资源受限医疗场景中部署的潜力,尤其在零样本泛化测试中表现卓越。 Abstract: Medical image segmentation plays a pivotal role in disease diagnosis and treatment planning, particularly in resource-constrained clinical settings where lightweight and generalizable models are urgently needed. However, existing lightweight models often compromise performance for efficiency and rarely adopt computationally expensive attention mechanisms, severely restricting their global contextual perception capabilities. Additionally, current architectures neglect the channel redundancy issue under the same convolutional kernels in medical imaging, which hinders effective feature extraction. To address these challenges, we propose LGMSNet, a novel lightweight framework based on local and global dual multiscale that achieves state-of-the-art performance with minimal computational overhead. LGMSNet employs heterogeneous intra-layer kernels to extract local high-frequency information while mitigating channel redundancy. In addition, the model integrates sparse transformer-convolutional hybrid branches to capture low-frequency global information. Extensive experiments across six public datasets demonstrate LGMSNet's superiority over existing state-of-the-art methods. In particular, LGMSNet maintains exceptional performance in zero-shot generalization tests on four unseen datasets, underscoring its potential for real-world deployment in resource-limited medical scenarios. The whole project code is in https://github.com/cq-dong/LGMSNet.

[96] MExECON: Multi-view Extended Explicit Clothed humans Optimized via Normal integration

Fulden Ece Uğur,Rafael Redondo,Albert Barreiro,Stefan Hristov,Roger Marí

Main category: cs.CV

TL;DR: MExECON is a novel pipeline for 3D reconstruction of clothed human avatars from multi-view RGB images, combining the JMBO algorithm and normal map integration to improve reconstruction accuracy without re-training.

Details Motivation: To improve the geometry and body pose estimation of clothed human avatars by leveraging multiple viewpoints, building on the capabilities of the single-view method ECON. Method: MExECON uses the Joint Multi-view Body Optimization (JMBO) algorithm to fit a single SMPL-X body model across all input views and integrates normal maps from multiple views for detailed surface reconstruction. Result: MExECON consistently improves reconstruction fidelity over the single-view baseline and performs competitively against modern few-shot 3D reconstruction methods. Conclusion: MExECON is an effective method for 3D reconstruction of clothed human avatars from sparse multi-view RGB images, achieving improved fidelity without requiring network re-training. Abstract: This work presents MExECON, a novel pipeline for 3D reconstruction of clothed human avatars from sparse multi-view RGB images. Building on the single-view method ECON, MExECON extends its capabilities to leverage multiple viewpoints, improving geometry and body pose estimation. At the core of the pipeline is the proposed Joint Multi-view Body Optimization (JMBO) algorithm, which fits a single SMPL-X body model jointly across all input views, enforcing multi-view consistency. The optimized body model serves as a low-frequency prior that guides the subsequent surface reconstruction, where geometric details are added via normal map integration. MExECON integrates normal maps from both front and back views to accurately capture fine-grained surface details such as clothing folds and hairstyles. All multi-view gains are achieved without requiring any network re-training. Experimental results show that MExECON consistently improves fidelity over the single-view baseline and achieves competitive performance compared to modern few-shot 3D reconstruction methods.

[97] Task-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion

Mengyu Wang,Zhenyu Liu,Kun Li,Yu Wang,Yuwei Wang,Yanyan Wei,Fei Wang

Main category: cs.CV

TL;DR: This paper proposes AdaSFFuse, a novel framework for task-generalized Multimodal Image Fusion (MMIF), which provides superior fusion performance with low computational cost and a compact network.

Details Motivation: Current Multimodal Image Fusion (MMIF) methods face challenges such as modality misalignment, high-frequency detail destruction, and task-specific limitations. This work aims to address these issues with a novel framework for task-generalized MMIF. Method: The paper proposes AdaSFFuse, which includes the Adaptive Approximate Wavelet Transform (AdaWAT) for frequency decoupling and Spatial-Frequency Mamba Blocks for efficient multimodal fusion. Result: Extensive experiments on four MMIF tasks demonstrate that AdaSFFuse improves the alignment and integration of multimodal features, reduces frequency loss, and preserves critical details. Conclusion: AdaSFFuse provides superior fusion performance with low computational cost and a compact network, offering a strong balance between performance and efficiency. Abstract: Multimodal Image Fusion (MMIF) aims to integrate complementary information from different imaging modalities to overcome the limitations of individual sensors. It enhances image quality and facilitates downstream applications such as remote sensing, medical diagnostics, and robotics. Despite significant advancements, current MMIF methods still face challenges such as modality misalignment, high-frequency detail destruction, and task-specific limitations. To address these challenges, we propose AdaSFFuse, a novel framework for task-generalized MMIF through adaptive cross-domain co-fusion learning. AdaSFFuse introduces two key innovations: the Adaptive Approximate Wavelet Transform (AdaWAT) for frequency decoupling, and the Spatial-Frequency Mamba Blocks for efficient multimodal fusion. AdaWAT adaptively separates the high- and low-frequency components of multimodal images from different scenes, enabling fine-grained extraction and alignment of distinct frequency characteristics for each modality. The Spatial-Frequency Mamba Blocks facilitate cross-domain fusion in both spatial and frequency domains, enhancing this process. These blocks dynamically adjust through learnable mappings to ensure robust fusion across diverse modalities. By combining these components, AdaSFFuse improves the alignment and integration of multimodal features, reduces frequency loss, and preserves critical details. Extensive experiments on four MMIF tasks -- Infrared-Visible Image Fusion (IVF), Multi-Focus Image Fusion (MFF), Multi-Exposure Image Fusion (MEF), and Medical Image Fusion (MIF) -- demonstrate AdaSFFuse's superior fusion performance, ensuring both low computational cost and a compact network, offering a strong balance between performance and efficiency. The code will be publicly available at https://github.com/Zhen-yu-Liu/AdaSFFuse.

[98] ExtraGS: Geometric-Aware Trajectory Extrapolation with Uncertainty-Guided Generative Priors

Kaiyuan Tan,Yingying Shen,Haohui Zhu,Zhiwei Zhan,Shan Zhao,Mingfei Tu,Hongcheng Luo,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye

Main category: cs.CV

TL;DR: ExtraGS is a novel framework for trajectory extrapolation in autonomous driving simulations that combines geometric and generative priors, resulting in more realistic and geometrically consistent views.

Details Motivation: Synthesizing extrapolated views from driving logs is essential for autonomous driving simulations, but existing methods suffer from poor geometric consistency and over-smoothed renderings. Method: ExtraGS uses a hybrid Gaussian-Signed Distance Function (SDF) design called Road Surface Gaussian (RSG), Far Field Gaussians (FFG) with learnable scaling factors, and a self-supervised uncertainty estimation framework based on spherical harmonics to selectively integrate generative priors. Result: Extensive experiments show that ExtraGS significantly improves the realism and geometric consistency of extrapolated views across multiple datasets, multi-camera setups, and various generative priors, while maintaining high fidelity along the original trajectory. Conclusion: ExtraGS provides a holistic framework for trajectory extrapolation that combines geometric and generative priors, enhancing realism and geometric consistency in extrapolated views for autonomous driving simulations. Abstract: Synthesizing extrapolated views from recorded driving logs is critical for simulating driving scenes for autonomous driving vehicles, yet it remains a challenging task. Recent methods leverage generative priors as pseudo ground truth, but often lead to poor geometric consistency and over-smoothed renderings. To address these limitations, we propose ExtraGS, a holistic framework for trajectory extrapolation that integrates both geometric and generative priors. At the core of ExtraGS is a novel Road Surface Gaussian(RSG) representation based on a hybrid Gaussian-Signed Distance Function (SDF) design, and Far Field Gaussians (FFG) that use learnable scaling factors to efficiently handle distant objects. Furthermore, we develop a self-supervised uncertainty estimation framework based on spherical harmonics that enables selective integration of generative priors only where extrapolation artifacts occur. Extensive experiments on multiple datasets, diverse multi-camera setups, and various generative priors demonstrate that ExtraGS significantly enhances the realism and geometric consistency of extrapolated views, while preserving high fidelity along the original trajectory.

[99] Multi-Object Sketch Animation with Grouping and Motion Trajectory Priors

Guotao Liang,Juncheng Hu,Ximing Xing,Jing Zhang,Qian Yu

Main category: cs.CV

TL;DR: GroupSketch improves vector sketch animation by addressing multi-object interactions and complex motions through a two-stage pipeline, outperforming current approaches.

Details Motivation: Existing methods struggle with multi-object interactions, temporal inconsistency, and poor generalization, limiting their effectiveness in complex scenarios. Method: GroupSketch uses a two-stage pipeline: Motion Initialization for coarse animation through interpolation and Motion Refinement using a Group-based Displacement Network (GDN) to enhance the animation with group-specific displacement fields. Result: GroupSketch generates high-quality, temporally consistent animations for complex, multi-object sketches, expanding the practical applications of sketch animation. Conclusion: GroupSketch effectively handles multi-object interactions and complex motions in vector sketch animation, outperforming existing methods in generating high-quality, temporally consistent animations. Abstract: We introduce GroupSketch, a novel method for vector sketch animation that effectively handles multi-object interactions and complex motions. Existing approaches struggle with these scenarios, either being limited to single-object cases or suffering from temporal inconsistency and poor generalization. To address these limitations, our method adopts a two-stage pipeline comprising Motion Initialization and Motion Refinement. In the first stage, the input sketch is interactively divided into semantic groups and key frames are defined, enabling the generation of a coarse animation via interpolation. In the second stage, we propose a Group-based Displacement Network (GDN), which refines the coarse animation by predicting group-specific displacement fields, leveraging priors from a text-to-video model. GDN further incorporates specialized modules, such as Context-conditioned Feature Enhancement (CCFE), to improve temporal consistency. Extensive experiments demonstrate that our approach significantly outperforms existing methods in generating high-quality, temporally consistent animations for complex, multi-object sketches, thus expanding the practical applications of sketch animation.

[100] D3FNet: A Differential Attention Fusion Network for Fine-Grained Road Structure Extraction in Remote Perception Systems

Chang Liu,Yang Xu,Tamas Sziranyi

Main category: cs.CV

TL;DR: D3FNet is designed to overcome challenges in extracting narrow roads from high-resolution remote sensing imagery by using a unique network structure with three key innovations, leading to improved performance on road region detection.

Details Motivation: Extracting narrow roads from high-resolution remote sensing imagery is challenging due to their limited width, fragmented topology, and frequent occlusions. Method: D3FNet, a Dilated Dual-Stream Differential Attention Fusion Network, is proposed for fine-grained road structure segmentation. It introduces three key innovations: a Differential Attention Dilation Extraction module, a Dual-stream Decoding Fusion Mechanism, and a multi-scale dilation strategy. Result: Extensive experiments on the DeepGlobe and CHN6-CUG benchmarks show that D3FNet achieves superior IoU and recall on challenging road regions, outperforming state-of-the-art baselines. Conclusion: D3FNet is a robust solution for fine-grained narrow road extraction in complex remote and cooperative perception scenarios. Abstract: Extracting narrow roads from high-resolution remote sensing imagery remains a significant challenge due to their limited width, fragmented topology, and frequent occlusions. To address these issues, we propose D3FNet, a Dilated Dual-Stream Differential Attention Fusion Network designed for fine-grained road structure segmentation in remote perception systems. Built upon the encoder-decoder backbone of D-LinkNet, D3FNet introduces three key innovations:(1) a Differential Attention Dilation Extraction (DADE) module that enhances subtle road features while suppressing background noise at the bottleneck; (2) a Dual-stream Decoding Fusion Mechanism (DDFM) that integrates original and attention-modulated features to balance spatial precision with semantic context; and (3) a multi-scale dilation strategy (rates 1, 3, 5, 9) that mitigates gridding artifacts and improves continuity in narrow road prediction. Unlike conventional models that overfit to generic road widths, D3FNet specifically targets fine-grained, occluded, and low-contrast road segments. Extensive experiments on the DeepGlobe and CHN6-CUG benchmarks show that D3FNet achieves superior IoU and recall on challenging road regions, outperforming state-of-the-art baselines. Ablation studies further verify the complementary synergy of attention-guided encoding and dual-path decoding. These results confirm D3FNet as a robust solution for fine-grained narrow road extraction in complex remote and cooperative perception scenarios.

[101] Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

Youjia Zhang,Youngeun Kim,Young-Geun Choi,Hongyeob Kim,Huiling Liu,Sungeun Hong

Main category: cs.CV

TL;DR: 本文提出了一种名为ADAPT的测试时间适应方法,通过高斯概率推理实现无需训练的推理,并解决了现有方法的局限性。

Details Motivation: 现有的测试时间适应方法存在依赖反向传播或迭代优化以及缺乏对类条件特征分布显式建模的问题,限制了其可扩展性和决策边界的可靠性。 Method: ADAPT通过使用逐渐更新的类均值和共享协方差矩阵建模类条件似然,将TTA重构为高斯概率推理任务。 Result: ADAPT在多种分布偏移情况下实现了最先进的性能,具有出色的可扩展性和鲁棒性。 Conclusion: ADAPT是一个无需源数据、无需梯度更新且无需完全访问目标数据的先进方法,支持在线和传导设置。 Abstract: Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.

[102] High-Frequency First: A Two-Stage Approach for Improving Image INR

Sumit Kumar Dam,Mrityunjoy Gain,Eui-Nam Huh,Choong Seon Hong

Main category: cs.CV

TL;DR: 本文提出了一种新的两阶段训练策略,通过自适应地强调高频细节,有效缓解了隐式神经表示中的频谱偏差问题,从而提高了图像重建质量。

Details Motivation: 神经网络存在频谱偏差,倾向于低频分量而难以捕捉如锐利边缘和精细纹理等高频细节。现有的解决方案主要集中在架构修改或特殊激活函数,而本文探索了一种正交方向,即通过训练过程直接引导。 Method: 引入了一个邻居感知的软掩码,自适应地为具有强局部变化的像素分配更高的权重,并采用两阶段训练策略:首先关注细节,然后进行全图像训练。 Result: 实验结果表明,该方法在图像重建质量方面有显著提升,并且与现有的INR方法形成了良好补充。 Conclusion: 该论文提出了一种新的两阶段训练策略,通过自适应地强调具有强局部变化的像素,以缓解神经网络中的频谱偏差问题,从而提高隐式神经表示(INR)的图像重建质量。 Abstract: Implicit Neural Representations (INRs) have emerged as a powerful alternative to traditional pixel-based formats by modeling images as continuous functions over spatial coordinates. A key challenge, however, lies in the spectral bias of neural networks, which tend to favor low-frequency components while struggling to capture high-frequency (HF) details such as sharp edges and fine textures. While prior approaches have addressed this limitation through architectural modifications or specialized activation functions, we propose an orthogonal direction by directly guiding the training process. Specifically, we introduce a two-stage training strategy where a neighbor-aware soft mask adaptively assigns higher weights to pixels with strong local variations, encouraging early focus on fine details. The model then transitions to full-image training. Experimental results show that our approach consistently improves reconstruction quality and complements existing INR methods. As a pioneering attempt to assign frequency-aware importance to pixels in image INR, our work offers a new avenue for mitigating the spectral bias problem.

[103] Fast globally optimal Truncated Least Squares point cloud registration with fixed rotation axis

Ivo Ivanov,Carsten Markgraf

Main category: cs.CV

TL;DR: This paper introduces a fast linear-time method for globally optimal point cloud registration under rotation, significantly outperforming existing approaches in speed, though limited to rotation-only problems.

Details Motivation: Existing globally optimal approaches, such as those based on semidefinite programming (SDP), are too slow for practical use, often requiring hundreds of seconds to solve for just 100 points. This motivates the need for a faster method. Method: A novel linear time convex relaxation and a contractor method to accelerate Branch and Bound (BnB) are proposed, focusing on the rotation-only truncated least squares (TLS) problem. Result: The proposed solver achieves provable global optimality in less than half a second for 3D point cloud registration with 100 points when the rotation axis is provided, making it two orders of magnitude faster than the state-of-the-art SDP solver STRIDE. Conclusion: The proposed method significantly improves the speed of solving the point cloud registration problem with a TLS formulation, achieving global optimality in linear time for rotation-only problems, although it cannot yet solve the full 6DoF problem. Abstract: Recent results showed that point cloud registration with given correspondences can be made robust to outlier rates of up to 95\% using the truncated least squares (TLS) formulation. However, solving this combinatorial optimization problem to global optimality is challenging. Provably globally optimal approaches using semidefinite programming (SDP) relaxations take hundreds of seconds for 100 points. In this paper, we propose a novel linear time convex relaxation as well as a contractor method to speed up Branch and Bound (BnB). Our solver can register two 3D point clouds with 100 points to provable global optimality in less than half a second when the axis of rotation is provided. Although it currently cannot solve the full 6DoF problem, it is two orders of magnitude faster than the state-of-the-art SDP solver STRIDE when solving the rotation-only TLS problem. In addition to providing a formal proof for global optimality, we present empirical evidence of global optimality using adversarial instances with local minimas close to the global minimum.

[104] Multi-perspective monitoring of wildlife and human activities from camera traps and drones with deep learning models

Hao Chen,Fang Qiu,Li An,Douglas Stow,Eve Bohnett,Haitao Lyu,Shuang Tian

Main category: cs.CV

TL;DR: 本研究通过结合相机陷阱和无人机热成像技术,利用深度学习模型实现野生动物和人类活动的多视角监测,并识别潜在的人兽冲突区域,为景观管理提供支持。

Details Motivation: 了解野生动物和人类活动的空间分布对于评估人兽冲突和制定有效的保护规划至关重要。 Method: 结合相机陷阱和无人机热成像技术,利用深度学习模型(YOLOv11s 和改进的 Faster RCNN)对野生动物和人类活动进行多视角监测,并进行空间模式分析以识别热点区域和潜在冲突区域。 Result: YOLOv11s 模型在相机陷阱图像中表现出最高性能,精度为 96.2%,召回率为 92.3%,mAP50 为 96.7% 和 81.3%。无人机热成像与改进的 Faster RCNN 模型相结合提供了补充的空中视角,空间模式分析识别出了野生动物和人类活动的热点区域及其重叠模式。 Conclusion: 研究揭示了整合多视角监测与自动物体检测技术在野生动物监测和景观管理中的重要性,特别是在识别和评估人兽冲突方面的潜力。 Abstract: Wildlife and human activities are key components of landscape systems. Understanding their spatial distribution is essential for evaluating human wildlife interactions and informing effective conservation planning. Multiperspective monitoring of wildlife and human activities by combining camera traps and drone imagery. Capturing the spatial patterns of their distributions, which allows the identification of the overlap of their activity zones and the assessment of the degree of human wildlife conflict. The study was conducted in Chitwan National Park (CNP), Nepal, and adjacent regions. Images collected by visible and nearinfrared camera traps and thermal infrared drones from February to July 2022 were processed to create training and testing datasets, which were used to build deep learning models to automatic identify wildlife and human activities. Drone collected thermal imagery was used for detecting targets to provide a multiple monitoring perspective. Spatial pattern analysis was performed to identify animal and resident activity hotspots and delineation potential human wildlife conflict zones. Among the deep learning models tested, YOLOv11s achieved the highest performance with a precision of 96.2%, recall of 92.3%, mAP50 of 96.7%, and mAP50 of 81.3%, making it the most effective for detecting objects in camera trap imagery. Drone based thermal imagery, analyzed with an enhanced Faster RCNN model, added a complementary aerial viewpoint for camera trap detections. Spatial pattern analysis identified clear hotspots for both wildlife and human activities and their overlapping patterns within certain areas in the CNP and buffer zones indicating potential conflict. This study reveals human wildlife conflicts within the conserved landscape. Integrating multiperspective monitoring with automated object detection enhances wildlife surveillance and landscape management.

[105] When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

Pengcheng Fang,Yuxia Chen,Rui Guo

Main category: cs.CV

TL;DR: This paper introduces Grounded VideoDiT, a Video LLM that improves temporal perception and language vision alignment through three novel techniques, achieving strong performance on video understanding tasks.

Details Motivation: Existing Video LLMs struggle with temporal perception, including implicit timestamp encoding, weak frame-level features, and drifting language-vision alignment, which limits their ability to understand videos in a fine-grained manner. Method: The paper introduces three key innovations: a Diffusion Temporal Latent (DTL) encoder for better temporal consistency, object grounded representations for improved alignment with visual evidence, and a mixed token scheme with discrete temporal tokens for explicit timestamp modeling. Result: Grounded VideoDiT achieves state-of-the-art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks, demonstrating its effectiveness in fine-grained temporal reasoning and robust grounding capabilities. Conclusion: Grounded VideoDiT achieves state-of-the-art results on multiple video understanding benchmarks by addressing limitations in temporal perception and language vision alignment in existing Video LLMs. Abstract: Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides explicit timestamp modeling, enabling fine grained temporal reasoning. Together, these designs equip Grounded VideoDiT with robust grounding capabilities, as validated by state of the art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks.

[106] Weakly-Supervised Learning for Tree Instances Segmentation in Airborne Lidar Point Clouds

Swann Emilien Céleste Destouches,Jesse Lahaye,Laurent Valentin Jospin,Jan Skaloud

Main category: cs.CV

TL;DR: This paper proposes a weakly supervised approach to improve tree instance segmentation in airborne laser scanning data, achieving a 34% improvement while reducing dependency on precisely labeled data.

Details Motivation: Tree instance segmentation of ALS data is crucial for forest monitoring but is challenging due to data variability and the high cost of obtaining precisely labeled datasets for supervised methods. Method: A weakly supervised method was proposed, using initial segmentation results from a non-finetuned model or closed-form algorithm, human-assisted quality ratings, and iterative model finetuning based on rating model feedback. Result: The proposed method improved the segmentation model by 34% in correctly identifying tree instances and significantly reduced the number of non-tree instances predicted. Conclusion: The proposed weakly supervised approach improves tree instance segmentation in ALS data by 34% while reducing reliance on large precisely labeled datasets, though it still faces challenges with small trees and complex surroundings. Abstract: Tree instance segmentation of airborne laser scanning (ALS) data is of utmost importance for forest monitoring, but remains challenging due to variations in the data caused by factors such as sensor resolution, vegetation state at acquisition time, terrain characteristics, etc. Moreover, obtaining a sufficient amount of precisely labeled data to train fully supervised instance segmentation methods is expensive. To address these challenges, we propose a weakly supervised approach where labels of an initial segmentation result obtained either by a non-finetuned model or a closed form algorithm are provided as a quality rating by a human operator. The labels produced during the quality assessment are then used to train a rating model, whose task is to classify a segmentation output into the same classes as specified by the human operator. Finally, the segmentation model is finetuned using feedback from the rating model. This in turn improves the original segmentation model by 34\% in terms of correctly identified tree instances while considerably reducing the number of non-tree instances predicted. Challenges still remain in data over sparsely forested regions characterized by small trees (less than two meters in height) or within complex surroundings containing shrubs, boulders, etc. which can be confused as trees where the performance of the proposed method is reduced.

[107] Towards a 3D Transfer-based Black-box Attack via Critical Feature Guidance

Shuchao Pang,Zhenghan Chen,Shen Zhang,Liming Lu,Siyuan Liang,Anan Du,Yongbin Zhou

Main category: cs.CV

TL;DR: CFG is a transfer-based black-box attack method that improves adversarial point cloud transferability without requiring target model information.

Details Motivation: Deep neural networks for 3D point clouds are vulnerable to adversarial examples, but obtaining information about target models is challenging in realistic scenarios. Method: CFG uses Critical Feature Guidance to prioritize corruption of critical features across diverse DNN architectures and constrains deviation in the loss function for imperceptibility. Result: Experiments on ModelNet40 and ScanObjectNN datasets show CFG outperforms existing attack methods significantly. Conclusion: CFG improves the transferability of adversarial point clouds without requiring information about target models, outperforming state-of-the-art methods. Abstract: Deep neural networks for 3D point clouds have been demonstrated to be vulnerable to adversarial examples. Previous 3D adversarial attack methods often exploit certain information about the target models, such as model parameters or outputs, to generate adversarial point clouds. However, in realistic scenarios, it is challenging to obtain any information about the target models under conditions of absolute security. Therefore, we focus on transfer-based attacks, where generating adversarial point clouds does not require any information about the target models. Based on our observation that the critical features used for point cloud classification are consistent across different DNN architectures, we propose CFG, a novel transfer-based black-box attack method that improves the transferability of adversarial point clouds via the proposed Critical Feature Guidance. Specifically, our method regularizes the search of adversarial point clouds by computing the importance of the extracted features, prioritizing the corruption of critical features that are likely to be adopted by diverse architectures. Further, we explicitly constrain the maximum deviation extent of the generated adversarial point clouds in the loss function to ensure their imperceptibility. Extensive experiments conducted on the ModelNet40 and ScanObjectNN benchmark datasets demonstrate that the proposed CFG outperforms the state-of-the-art attack methods by a large margin.

[108] MapKD: Unlocking Prior Knowledge with Cross-Modal Distillation for Efficient Online HD Map Construction

Ziyang Yan,Ruikai Li,Zhiyong Cui,Bohan Li,Han Jiang,Yilong Ren,Aoyong Li,Zhenning Li,Sijia Wen,Haiyang Yu

Main category: cs.CV

TL;DR: MapKD is a novel knowledge distillation framework that enhances online HD map construction by efficiently transferring knowledge from multimodal models to lightweight vision-based models, improving accuracy and speed.

Details Motivation: Existing methods for online HD map construction rely on outdated offline maps and multi-modal sensors, which incur unnecessary computational costs. MapKD addresses these limitations by enabling efficient, vision-centric inference. Method: MapKD utilizes a Teacher-Coach-Student paradigm with Token-Guided 2D Patch Distillation and Masked Semantic Response Distillation strategies for efficient cross-modal knowledge transfer. Result: Experiments on the nuScenes dataset show that MapKD improves student model performance by +6.68 mIoU and +10.94 mAP while accelerating inference speed. Conclusion: The proposed MapKD framework effectively transfers knowledge from a multimodal teacher model to a lightweight vision-based student model, enhancing performance while reducing computational overhead. Abstract: Online HD map construction is a fundamental task in autonomous driving systems, aiming to acquire semantic information of map elements around the ego vehicle based on real-time sensor inputs. Recently, several approaches have achieved promising results by incorporating offline priors such as SD maps and HD maps or by fusing multi-modal data. However, these methods depend on stale offline maps and multi-modal sensor suites, resulting in avoidable computational overhead at inference. To address these limitations, we employ a knowledge distillation strategy to transfer knowledge from multimodal models with prior knowledge to an efficient, low-cost, and vision-centric student model. Specifically, we propose MapKD, a novel multi-level cross-modal knowledge distillation framework with an innovative Teacher-Coach-Student (TCS) paradigm. This framework consists of: (1) a camera-LiDAR fusion model with SD/HD map priors serving as the teacher; (2) a vision-centric coach model with prior knowledge and simulated LiDAR to bridge the cross-modal knowledge transfer gap; and (3) a lightweight vision-based student model. Additionally, we introduce two targeted knowledge distillation strategies: Token-Guided 2D Patch Distillation (TGPD) for bird's eye view feature alignment and Masked Semantic Response Distillation (MSRD) for semantic learning guidance. Extensive experiments on the challenging nuScenes dataset demonstrate that MapKD improves the student model by +6.68 mIoU and +10.94 mAP while simultaneously accelerating inference speed. The code is available at:https://github.com/2004yan/MapKD2026.

[109] CM2LoD3: Reconstructing LoD3 Building Models Using Semantic Conflict Maps

Franz Hanke,Antonia Bieringer,Olaf Wysocki,Boris Jutzi

Main category: cs.CV

TL;DR: This paper proposes CM2LoD3, an automated method for generating detailed LoD3 building models using Conflict Maps and semantic segmentation, enabling scalable and efficient 3D urban modeling with 61% performance in reconstructing building features.

Details Motivation: LoD3 building models are essential for advanced urban analysis but have traditionally required manual generation, limiting their large-scale adoption. This research aims to automate LoD3 reconstruction for scalable and efficient 3D city modeling. Method: CM2LoD3 uses Conflict Maps from ray-to-model-prior analysis and semantically segments them using synthetically generated data from a Semantic Conflict Map Generator (SCMG). It also integrates confidence-score-based fusion of textured model segmentation to enhance 3D reconstruction accuracy. Result: Experimental results show that the CM2LoD3 method achieves 61% performance in segmenting and reconstructing building openings, enhanced by uncertainty-aware fusion of segmented building textures. Conclusion: The research successfully introduces CM2LoD3, a novel method for reconstructing detailed LoD3 building models using Conflict Maps and semantic segmentation, advancing automated and scalable 3D city modeling. Abstract: Detailed 3D building models are crucial for urban planning, digital twins, and disaster management applications. While Level of Detail 1 (LoD)1 and LoD2 building models are widely available, they lack detailed facade elements essential for advanced urban analysis. In contrast, LoD3 models address this limitation by incorporating facade elements such as windows, doors, and underpasses. However, their generation has traditionally required manual modeling, making large-scale adoption challenging. In this contribution, CM2LoD3, we present a novel method for reconstructing LoD3 building models leveraging Conflict Maps (CMs) obtained from ray-to-model-prior analysis. Unlike previous works, we concentrate on semantically segmenting real-world CMs with synthetically generated CMs from our developed Semantic Conflict Map Generator (SCMG). We also observe that additional segmentation of textured models can be fused with CMs using confidence scores to further increase segmentation performance and thus increase 3D reconstruction accuracy. Experimental results demonstrate the effectiveness of our CM2LoD3 method in segmenting and reconstructing building openings, with the 61% performance with uncertainty-aware fusion of segmented building textures. This research contributes to the advancement of automated LoD3 model reconstruction, paving the way for scalable and efficient 3D city modeling. Our project is available: https://github.com/InFraHank/CM2LoD3

[110] LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions

Yongju Jia,Jiarui Ma,Xiangxian Li,Baiqiao Zhang,Xianhui Cao,Juan Liu,Yulong Bian

Main category: cs.CV

TL;DR: This paper proposes a Multi-dimensional Dynamic Prompt Routing (MDPR) framework to address bias in class-imbalanced scenes during the fine-tuning of pre-trained vision-language models (VLMs). The framework constructs a knowledge base across five visual-semantic dimensions and uses dynamic routing to align global visual classes, retrieve optimal prompts, and balance fine-grained semantics. Experiments show that MDPR achieves comparable results with current SOTA methods while incurring minimal computational overhead.

Details Motivation: The motivation stems from the bias in class-imbalanced scenes during the fine-tuning of pre-trained vision-language models (VLMs), and the oversight of inherent class imbalance in VLMs' pre-training when using large language models (LLMs) to enhance VLM fine-tuning. This bias can accumulate and affect downstream tasks. Method: The paper proposes a Multi-dimensional Dynamic Prompt Routing (MDPR) framework that constructs a comprehensive knowledge base for classes across five visual-semantic dimensions. During fine-tuning, the dynamic routing mechanism aligns global visual classes, retrieves optimal prompts, and balances fine-grained semantics through logits fusion. Result: Extensive experiments on long-tailed benchmarks such as CIFAR-LT, ImageNet-LT, and Places-LT demonstrate that MDPR achieves comparable results with current state-of-the-art (SOTA) methods. Ablation studies confirm the effectiveness of the semantic library for tail classes and the minimal computational overhead of the dynamic routing mechanism. Conclusion: The paper concludes that MDPR is a flexible and efficient enhancement for VLM fine-tuning under data imbalance, achieving comparable results with current SOTA methods while incurring minimal computational overhead. Abstract: Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive capability in visual tasks, but their fine-tuning often suffers from bias in class-imbalanced scene. Recent works have introduced large language models (LLMs) to enhance VLM fine-tuning with supplementing semantic information. However, they often overlook inherent class imbalance in VLMs' pre-training, which may lead to bias accumulation in downstream tasks. To address this problem, this paper proposes a Multi-dimensional Dynamic Prompt Routing (MDPR) framework. MDPR constructs a comprehensive knowledge base for classes, spanning five visual-semantic dimensions. During fine-tuning, the dynamic routing mechanism aligns global visual classes, retrieves optimal prompts, and balances fine-grained semantics, yielding stable predictions through logits fusion. Extensive experiments on long-tailed benchmarks, including CIFAR-LT, ImageNet-LT, and Places-LT, demonstrate that MDPR achieves comparable results with current SOTA methods. Ablation studies further confirm the effectiveness of our semantic library for tail classes, and show that our dynamic routing incurs minimal computational overhead, making MDPR a flexible and efficient enhancement for VLM fine-tuning under data imbalance.

[111] StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

Yanlai Yang,Zhuokai Zhao,Satya Narayan Shukla,Aashu Singh,Shlok Kumar Mishra,Lizhu Zhang,Mengye Ren

Main category: cs.CV

TL;DR: StreamMem 是一种高效的查询无关KV缓存压缩机制,用于处理长视频,显著提升了多模态大语言模型的视频理解能力。

Details Motivation: 现有的多模态大语言模型在处理长视频时存在内存和计算开销大的问题,且现有压缩方法不适用于流媒体视频理解和多轮对话场景。 Method: StreamMem 采用流式编码视频帧,通过视觉标记和通用查询标记之间的注意力分数压缩KV缓存,同时保持固定大小的KV内存。 Result: StreamMem 在三个长视频理解和两个流视频问答基准测试中达到了最先进的查询无关KV缓存压缩性能,并与查询感知压缩方法具有竞争力。 Conclusion: StreamMem 提出了一种有效的查询无关的KV缓存压缩机制,用于长视频理解,在多个基准测试中表现优异。 Abstract: Multimodal large language models (MLLMs) have made significant progress in visual-language reasoning, but their ability to efficiently handle long videos remains limited. Despite recent advances in long-context MLLMs, storing and attending to the key-value (KV) cache for long visual contexts incurs substantial memory and computational overhead. Existing visual compression methods require either encoding the entire visual context before compression or having access to the questions in advance, which is impractical for long video understanding and multi-turn conversational settings. In this work, we propose StreamMem, a query-agnostic KV cache memory mechanism for streaming video understanding. Specifically, StreamMem encodes new video frames in a streaming manner, compressing the KV cache using attention scores between visual tokens and generic query tokens, while maintaining a fixed-size KV memory to enable efficient question answering (QA) in memory-constrained, long-video scenarios. Evaluation on three long video understanding and two streaming video question answering benchmarks shows that StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.

[112] WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception

Zhiheng Liu,Xueqing Deng,Shoufa Chen,Angtian Wang,Qiushan Guo,Mingfei Han,Zeyue Xue,Mengzhao Chen,Ping Luo,Linjie Yang

Main category: cs.CV

TL;DR: WorldWeaver通过联合建模RGB帧和感知条件,利用深度线索和分段噪声调度,有效解决了长视频生成中的时间漂移问题。

Details Motivation: 生成视频建模在长序列中保持结构和时间一致性仍然具有挑战性,当前方法主要依赖RGB信号,但会导致对象结构和运动的累积误差。 Method: WorldWeaver联合建模RGB帧和感知条件,并引入了深度线索和分段噪声调度以提高长视频生成的稳定性。 Result: 实验表明,WorldWeaver在扩散模型和修正流模型上都有效减少了时间漂移并提高了生成视频的质量。 Conclusion: WorldWeaver有效地减少了时间漂移并提高了生成视频的保真度,证明了其在长视频生成中的有效性。 Abstract: Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.

[113] Fine-grained Multi-class Nuclei Segmentation with Molecular-empowered All-in-SAM Model

Xueyuan Li,Can Cui,Ruining Deng,Yucheng Tang,Quan Liu,Tianyuan Yao,Shunxing Bao,Naweed Chowdhury,Haichun Yang,Yuankai Huo

Main category: cs.CV

TL;DR: 本文提出了一种新的分子增强All-in-SAM模型,旨在提升计算病理学中的细胞分类性能,同时减少对专业标注的依赖,推动医学诊断自动化。

Details Motivation: 现有的视觉基础模型在细粒度语义分割方面(如特定核亚型或特定细胞的识别)存在挑战,因此需要一种新方法来提升计算病理学的效果。 Method: 该方法采用全栈式策略,包括:(1)通过分子增强学习吸引非专业注释者参与标注,减少对详细像素级标注的需求;(2)利用SAM适配器调整SAM模型,强调特定语义的学习适应;(3)通过分子导向修正学习(MOCL)提高分割精度。 Result: 来自内部和公共数据集的实验结果表明,All-in-SAM模型在细胞分类性能方面显著提升,即使在标注质量不一的情况下也表现出色。 Conclusion: 该研究提出了一种分子增强的All-in-SAM模型,用于计算病理学,不仅减轻了注释者的负担,还扩展了精确生物医学图像分析在资源有限环境中的可及性,从而推进了医学诊断和病理图像分析的自动化。 Abstract: Purpose: Recent developments in computational pathology have been driven by advances in Vision Foundation Models, particularly the Segment Anything Model (SAM). This model facilitates nuclei segmentation through two primary methods: prompt-based zero-shot segmentation and the use of cell-specific SAM models for direct segmentation. These approaches enable effective segmentation across a range of nuclei and cells. However, general vision foundation models often face challenges with fine-grained semantic segmentation, such as identifying specific nuclei subtypes or particular cells. Approach: In this paper, we propose the molecular-empowered All-in-SAM Model to advance computational pathology by leveraging the capabilities of vision foundation models. This model incorporates a full-stack approach, focusing on: (1) annotation-engaging lay annotators through molecular-empowered learning to reduce the need for detailed pixel-level annotations, (2) learning-adapting the SAM model to emphasize specific semantics, which utilizes its strong generalizability with SAM adapter, and (3) refinement-enhancing segmentation accuracy by integrating Molecular-Oriented Corrective Learning (MOCL). Results: Experimental results from both in-house and public datasets show that the All-in-SAM model significantly improves cell classification performance, even when faced with varying annotation quality. Conclusions: Our approach not only reduces the workload for annotators but also extends the accessibility of precise biomedical image analysis to resource-limited settings, thereby advancing medical diagnostics and automating pathology image analysis.

[114] Waver: Wave Your Way to Lifelike Video Generation

Yifu Zhang,Hao Yang,Yuqi Zhang,Yifei Hu,Fengda Zhu,Chuang Lin,Xiaofeng Mei,Yi Jiang,Zehuan Yuan,Bingyue Peng

Main category: cs.CV

TL;DR: Waver是一个统一的高性能图像和视频生成模型,支持文本到视频、图像到视频和文本到图像的生成,并在多个排行榜上表现优异。

Details Motivation: 推动视频生成技术的发展,提供一个高效、统一的模型,以解决现有模型在复杂运动捕捉、质量和开源可用性方面的不足。 Method: 引入了Hybrid Stream DiT架构以提高模态对齐和训练收敛速度;建立了全面的数据整理流程,并使用基于MLLM的视频质量模型筛选高质量样本。 Result: Waver在720p分辨率下直接生成5至10秒的视频,并可升级至1080p;在T2V和I2V排行榜上均位列前三,优于现有开源模型,并与最先进的商业解决方案相媲美。 Conclusion: Waver为视频生成领域提供了高质量、高效的解决方案,推动社区更有效地训练视频生成模型,加速视频生成技术的发展。 Abstract: We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.

[115] ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling

Jinhyung Park,Javier Romero,Shunsuke Saito,Fabian Prada,Takaaki Shiratori,Yichen Xu,Federica Bogo,Shoou-I Yu,Kris Kitani,Rawal Khirodkar

Main category: cs.CV

TL;DR: ATLAS is a high-fidelity 3D human body model that decouples shape and skeleton for better accuracy and customization, outperforming traditional linear approaches in capturing complex poses and body variations.

Details Motivation: Existing body models struggle with capturing detailed variations across diverse poses and shapes due to limited data and restrictive modeling assumptions. ATLAS aims to improve fidelity, control, and customization in 3D human modeling. Method: ATLAS is learned from 600k high-resolution scans using 240 synchronized cameras. It decouples shape and skeleton bases, grounding the mesh representation in the human skeleton to overcome limitations of traditional linear basis approaches. Result: ATLAS outperforms existing methods in fitting unseen subjects across diverse poses. Quantitative evaluations demonstrate that its non-linear pose correctives better capture complex poses than linear models. Conclusion: ATLAS provides a more accurate and customizable 3D human body model by decoupling shape and skeleton bases, enabling enhanced expressivity and fine-grained control over body attributes. Abstract: Parametric body models offer expressive 3D representation of humans across a wide range of poses, shapes, and facial expressions, typically derived by learning a basis over registered 3D meshes. However, existing human mesh modeling approaches struggle to capture detailed variations across diverse body poses and shapes, largely due to limited training data diversity and restrictive modeling assumptions. Moreover, the common paradigm first optimizes the external body surface using a linear basis, then regresses internal skeletal joints from surface vertices. This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. Unlike previous methods, we explicitly decouple the shape and skeleton bases by grounding our mesh representation in the human skeleton. This decoupling enables enhanced shape expressivity, fine-grained customization of body attributes, and keypoint fitting independent of external soft-tissue characteristics. ATLAS outperforms existing methods by fitting unseen subjects in diverse poses more accurately, and quantitative evaluations show that our non-linear pose correctives more effectively capture complex poses compared to linear models.

[116] SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Yanxu Meng,Haoning Wu,Ya Zhang,Weidi Xie

Main category: cs.CV

TL;DR: SceneGen是一个新的框架,可以从单个场景图像中生成多个3D资产,具有高效且稳健的生成能力,有望推动3D内容生成在下游任务中的实际应用。

Details Motivation: 由于3D内容生成在VR/AR和实体AI中的应用,近年来受到了广泛的研究关注。 Method: SceneGen通过一个新颖的特征聚合模块,将视觉和几何编码器中的局部和全局场景信息集成在一起,并结合位置头,在单个前馈过程中生成3D资产及其相对空间位置。 Result: SceneGen在单个场景图像中同时生成多个3D资产,且具有高效且稳健的生成能力。 Conclusion: SceneGen是一个新的框架,用于从单个场景图像中生成多个3D资产,具有高质量的3D内容生成能力,有望推动其在下游任务中的实际应用。 Abstract: 3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

[117] Visual Autoregressive Modeling for Instruction-Guided Image Editing

Qingyang Mao,Qi Cai,Yehao Li,Yingwei Pan,Mingyue Cheng,Ting Yao,Qi Liu,Tao Mei

Main category: cs.CV

TL;DR: 本文提出了一种基于自回归模型的图像编辑框架VAREdit,通过创新的SAR模块解决了现有扩散模型的编辑问题,在编辑准确性和效率方面均有显著提升。

Details Motivation: 扩散模型的全局去噪过程会导致编辑区域与整个图像上下文纠缠,引发意外的虚假修改并损害对编辑指令的遵循。自回归模型提供了一种不同的范式,通过在离散视觉标记上的顺序过程自然规避了这些问题。 Method: VAREdit采用视觉自回归(VAR)框架,将图像编辑重新定义为下一个尺度预测问题,并通过SAR模块注入尺度匹配的条件信息,以解决源图像标记的有效条件问题。 Result: VAREdit在标准基准上比扩散模型表现更好,不仅在GPT-Balance分数上有30%以上的提升,而且编辑一张512×512图像仅需1.2秒,速度更快。 Conclusion: VAREdit实现了图像编辑的显着进步,通过引入SAR模块,在标准基准测试中比基于扩散的方法高出30%以上的GPT-Balance分数,并且比类似大小的UltraEdit快2.2倍。 Abstract: Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a $512\times512$ editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.

[118] Scaling Group Inference for Diverse and High-Quality Generation

Gaurav Parmar,Or Patashnik,Daniil Ostashev,Kuan-Chieh Wang,Kfir Aberman,Srinivasa Narasimhan,Jun-Yan Zhu

Main category: cs.CV

TL;DR: This paper introduces a novel group inference approach for generative models, enhancing the diversity and quality of multiple outputs, making them more useful and cohesive for users.

Details Motivation: The motivation stems from the redundancy in results caused by independent sampling in real-world applications where users are presented with multiple images per prompt, limiting choices and hindering creativity. Method: The method formulates group inference as a quadratic integer assignment problem, modeling candidate outputs as graph nodes and selecting subsets to optimize quality and diversity. Runtime efficiency is improved through progressive pruning of the candidate set using intermediate predictions. Result: Experiments show that the method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. It also generalizes across various tasks like text-to-image, image-to-image, image prompting, and video generation. Conclusion: The introduced scalable group inference method enhances both the diversity and quality of sample groups in generative models, allowing for more cohesive output groups rather than independent samples. Abstract: Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we progressively prune the candidate set using intermediate predictions, allowing our method to scale up to large candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.

[119] CineScale: Free Lunch in High-Resolution Cinematic Visual Generation

Haonan Qiu,Ning Yu,Ziqi Huang,Paul Debevec,Ziwei Liu

Main category: cs.CV

TL;DR: CineScale 是一种新的推理方法,能够在不大量微调的情况下,实现高分辨率的图像和视频生成,解决了现有方法在高分辨率生成时内容质量低和重复模式的问题。

Details Motivation: 视觉扩散模型通常受限于训练数据的分辨率和计算资源,难以生成高保真度的高分辨率图像或视频。现有方法在高分辨率生成时容易产生低质量内容和重复模式,因此需要一种更有效的方法。 Method: 提出了 CineScale 方法,专门针对两种视频生成架构提出了相应的变体,并在最先进的开源视频生成框架上实现了高分辨率的 I2V 和 V2V 合成。 Result: 实验表明,CineScale 在扩展高分辨率视觉生成能力方面具有优势,能够实现 8K 图像生成而无需微调,并通过少量 LoRA 微调实现 4K 视频生成。 Conclusion: CineScale 是一种新的推理范式,能够实现更高分辨率的视觉生成,而无需大量微调。 Abstract: Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.

cs.AI [Back]

[120] SurgWound-Bench: A Benchmark for Surgical Wound Diagnosis

Jiahao Xu,Changchang Yin,Odysseas Chatzipanagiotou,Diamantis Tsilimigras,Kevin Clear,Bingsheng Yao,Dakuo Wang,Timothy Pawlik,Ping Zhang

Main category: cs.AI

TL;DR: 本文介绍了 SurgWound,第一个公开可用的手术伤口数据集,以及基于其开发的 WoundQwen 三阶段诊断框架,旨在推动个性化伤口护理和改善患者预后。

Details Motivation: 由于数据隐私问题和专家注释的高成本,深度学习在初步手术伤口筛查中的应用进展受阻,且目前尚无涵盖各种手术伤口类型的公开数据集或基准。 Method: 提出了一种三阶段学习框架 WoundQwen,用于手术伤口诊断,并基于 SurgWound 数据集建立了首个手术伤口诊断基准。 Result: 提出了首个开源手术伤口数据集 SurgWound,包含 697 张手术伤口图像,并引入了首个手术伤口诊断基准,包括视觉问答和报告生成任务。 Conclusion: SurgWound 和 WoundQwen 的结合为个性化伤口护理、及时干预和改善患者预后铺平了道路。 Abstract: Surgical site infection (SSI) is one of the most common and costly healthcare-associated infections and and surgical wound care remains a significant clinical challenge in preventing SSIs and improving patient outcomes. While recent studies have explored the use of deep learning for preliminary surgical wound screening, progress has been hindered by concerns over data privacy and the high costs associated with expert annotation. Currently, no publicly available dataset or benchmark encompasses various types of surgical wounds, resulting in the absence of an open-source Surgical-Wound screening tool. To address this gap: (1) we present SurgWound, the first open-source dataset featuring a diverse array of surgical wound types. It contains 697 surgical wound images annotated by 3 professional surgeons with eight fine-grained clinical attributes. (2) Based on SurgWound, we introduce the first benchmark for surgical wound diagnosis, which includes visual question answering (VQA) and report generation tasks to comprehensively evaluate model performance. (3) Furthermore, we propose a three-stage learning framework, WoundQwen, for surgical wound diagnosis. In the first stage, we employ five independent MLLMs to accurately predict specific surgical wound characteristics. In the second stage, these predictions serve as additional knowledge inputs to two MLLMs responsible for diagnosing outcomes, which assess infection risk and guide subsequent interventions. In the third stage, we train a MLLM that integrates the diagnostic results from the previous two stages to produce a comprehensive report. This three-stage framework can analyze detailed surgical wound characteristics and provide subsequent instructions to patients based on surgical images, paving the way for personalized wound care, timely intervention, and improved patient outcomes.