Skip to content

Table of Contents

cs.CL [Back]

[1] Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Jianfeng Si,Lin Sun,Zhewen Tan,Xiangzheng Zhang

Main category: cs.CL

TL;DR: A new co-training framework enables efficient and controllable content safety in LLMs, offering multiple safety behaviors through dynamic activation while reducing training complexity.

Details Motivation: Current methods like SFT and RLHF rely on multi-stage pipelines and lack post-deployment controllability; the proposed method aims to address these limitations. Method: A unified co-training framework that integrates multiple safety behaviors within a single SFT stage using system-level instructions or magic tokens for dynamic activation. Result: The co-training strategy creates a Safety Alignment Margin, allowing for fine-grained control and empirical evidence of safety robustness. The method matches the safety alignment quality of SFT+DPO, with reduced training and deployment costs. Conclusion: This work presents a scalable, efficient, and highly controllable solution for LLM content safety. Abstract: Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream moderation signals. This co-training strategy induces a distinct Safety Alignment Margin in the output space, characterized by well-separated response distributions corresponding to each safety mode. The existence of this margin provides empirical evidence for the model's safety robustness and enables unprecedented fine-grained control. Experiments show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance, while significantly reducing both training complexity and deployment costs. This work presents a scalable, efficient, and highly controllable solution for LLM content safety.

[2] Preliminary Ranking of WMT25 General Machine Translation Systems

Tom Kocmi,Eleftherios Avramidis,Rachel Bawden,Ondřej Bojar,Konstantin Dranch,Anton Dvorkovich,Sergey Dukanov,Natalia Fedorova,Mark Fishel,Markus Freitag,Thamme Gowda,Roman Grundkiewicz,Barry Haddow,Marzena Karpinska,Philipp Koehn,Howard Lakougna,Jessica Lundin,Kenton Murray,Masaaki Nagata,Stefano Perrella,Lorenzo Proietti,Martin Popel,Maja Popović,Parker Riley,Mariya Shmatova,Steinþór Steingrímsson,Lisa Yankovskaya,Vilém Zouhar

Main category: cs.CL

TL;DR: 该论文分享了WMT25通用机器翻译任务的初步排名,并指出自动评估可能偏向使用重新排序技术的系统,官方排名将基于更可靠的人类评估。

Details Motivation: 该论文的动机是分享初步结果,以帮助参与者准备他们的系统提交论文。 Method: 该论文的方法是呈现WMT25通用机器翻译共享任务的初步排名,并指出其可能的偏差。 Result: 结果显示自动评估可能偏向某些系统,因此官方排名将基于人类评估。 Conclusion: 该论文的结论是自动排名可能会偏向使用重新排序技术的系统,而人类评估将提供更可靠的排名。 Abstract: We present the preliminary ranking of the WMT25 General Machine Translation Shared Task, in which MT systems have been evaluated using automatic metrics. As this ranking is based on automatic evaluations, it may be biased in favor of systems that employ re-ranking techniques, such as Quality Estimation re-ranking or Minimum Bayes Risk decoding. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede the automatic ranking. The purpose of this report is not to present the final findings of the General MT task, but rather to share preliminary results with task participants, which may be useful when preparing their system submission papers.

[3] Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages

Israel Abebe Azime,Tadesse Destaw Belay,Dietrich Klakow,Philipp Slusallek,Anshuman Chhabra

Main category: cs.CL

TL;DR: This paper proposes an LLM-based framework to generate culturally localized math datasets for low-resource languages, addressing English-centric bias and improving the evaluation and performance of multilingual models.

Details Motivation: The motivation stems from the lack of culturally relevant datasets in low-resource languages, which hampers the development and evaluation of multilingual mathematical reasoning. Existing datasets, created via translation, often retain English-centric entities, limiting the ability to assess true multilingual reasoning capabilities. Method: The paper introduces an LLM-driven framework for cultural localization that automatically generates math word problems with native entities (such as names, organizations, and currencies) in low-resource languages. It compares the performance of models on traditionally translated datasets versus localized datasets through extensive experiments. Result: Experiments show that translated benchmarks may misrepresent multilingual math abilities when socio-cultural contexts are considered. The proposed framework successfully reduces English-centric entity bias and improves model robustness when native entities are used across multiple languages. Conclusion: The paper concludes that the proposed LLM-driven framework effectively addresses the issue of English-centric bias in multilingual math benchmarks by enabling the creation of culturally localized datasets, thereby improving the robustness of models in multilingual settings. Abstract: Large language models (LLMs) have demonstrated significant capabilities in solving mathematical problems expressed in natural language. However, multilingual and culturally-grounded mathematical reasoning in low-resource languages lags behind English due to the scarcity of socio-cultural task datasets that reflect accurate native entities such as person names, organization names, and currencies. Existing multilingual benchmarks are predominantly produced via translation and typically retain English-centric entities, owing to the high cost associated with human annotater-based localization. Moreover, automated localization tools are limited, and hence, truly localized datasets remain scarce. To bridge this gap, we introduce a framework for LLM-driven cultural localization of math word problems that automatically constructs datasets with native names, organizations, and currencies from existing sources. We find that translated benchmarks can obscure true multilingual math ability under appropriate socio-cultural contexts. Through extensive experiments, we also show that our framework can help mitigate English-centric entity bias and improves robustness when native entities are introduced across various languages.

[4] Improving LLMs for Machine Translation Using Synthetic Preference Data

Dario Vajda,Domen Vreš,Marko Robnik-Šikonja

Main category: cs.CL

TL;DR: This paper improves the GaMS-9B-Instruct model for machine translation using DPO training with a limited dataset. The enhanced model outperforms baseline models in translation quality and avoids language and formatting errors more consistently.

Details Motivation: The research aims to improve the performance of general instruction-tuned large language models in machine translation with limited, easily produced data resources. Method: Direct Preference Optimization (DPO) training was used on a curated subset of a public dataset. The training dataset was generated by translating English Wikipedia articles using two LLMs, GaMS-9B-Instruct and EuroLLM-9B-Instruct, and ranking the translations using heuristics and automatic evaluation metrics like COMET. Result: The fine-tuned model outperformed both models used for dataset generation, achieving a COMET score gain of approximately 0.04 and 0.02, respectively. Conclusion: The fine-tuned model consistently avoids language and formatting errors and performs better than the baseline models. Abstract: Large language models have emerged as effective machine translation systems. In this paper, we explore how a general instruction-tuned large language model can be improved for machine translation using relatively few easily produced data resources. Using Slovene as a use case, we improve the GaMS-9B-Instruct model using Direct Preference Optimization (DPO) training on a programmatically curated and enhanced subset of a public dataset. As DPO requires pairs of quality-ranked instances, we generated its training dataset by translating English Wikipedia articles using two LLMs, GaMS-9B-Instruct and EuroLLM-9B-Instruct. We ranked the resulting translations based on heuristics coupled with automatic evaluation metrics such as COMET. The evaluation shows that our fine-tuned model outperforms both models involved in the dataset generation. In comparison to the baseline models, the fine-tuned model achieved a COMET score gain of around 0.04 and 0.02, respectively, on translating Wikipedia articles. It also more consistently avoids language and formatting errors.

[5] Multilingual Datasets for Custom Input Extraction and Explanation Requests Parsing in Conversational XAI Systems

Qianli Wang,Tatiana Anikina,Nils Feldhus,Simon Ostermann,Fedor Splitt,Jiaao Li,Yoana Tsoneva,Sebastian Möller,Vera Schmitt

Main category: cs.CL

TL;DR: 本文介绍了ConvXAI系统在多语言泛化和自由形式输入支持方面的挑战,并提出了MultiCoXQL和Compass两个数据集以及新的解析方法来解决这些问题。

Details Motivation: 现有的ConvXAI系统在英语用户意图识别上表现良好,但在多语言泛化和自由形式输入支持方面存在局限。为了克服这些挑战,作者提出了新的解决方案。 Method: 作者首先引入了MultiCoXQL,这是一个涵盖五种不同语言的多语言数据集。他们还提出了一种新的解析方法,并在MultiCoXQL上使用不同的解析策略评估了三个大语言模型(LLMs)。此外,作者开发了Compass,一个用于ConvXAI系统自定义输入提取的新多语言数据集,并对多个模型进行了单语、跨语言和多语言评估。 Result: 文章展示了MultiCoXQL和Compass数据集的开发成果,并通过评估不同模型的性能,验证了新解析方法在多语言ConvXAI系统中的有效性。 Conclusion: 文章成功解决了ConvXAI系统在多语言泛化和自由形式输入支持方面的挑战,并通过新数据集和解析方法推动了这一领域的发展。 Abstract: Conversational explainable artificial intelligence (ConvXAI) systems based on large language models (LLMs) have garnered considerable attention for their ability to enhance user comprehension through dialogue-based explanations. Current ConvXAI systems often are based on intent recognition to accurately identify the user's desired intention and map it to an explainability method. While such methods offer great precision and reliability in discerning users' underlying intentions for English, a significant challenge in the scarcity of training data persists, which impedes multilingual generalization. Besides, the support for free-form custom inputs, which are user-defined data distinct from pre-configured dataset instances, remains largely limited. To bridge these gaps, we first introduce MultiCoXQL, a multilingual extension of the CoXQL dataset spanning five typologically diverse languages, including one low-resource language. Subsequently, we propose a new parsing approach aimed at enhancing multilingual parsing performance, and evaluate three LLMs on MultiCoXQL using various parsing strategies. Furthermore, we present Compass, a new multilingual dataset designed for custom input extraction in ConvXAI systems, encompassing 11 intents across the same five languages as MultiCoXQL. We conduct monolingual, cross-lingual, and multilingual evaluations on Compass, employing three LLMs of varying sizes alongside BERT-type models.

[6] Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

Bolian Li,Yanran Wu,Xinyu Luo,Ruqi Zhang

Main category: cs.CL

TL;DR: 论文提出了一种新的测试时对齐方法,通过利用奖励调整的推测采样算法,在显著降低推理成本的同时提高了模型的表现。

Details Motivation: 论文动机是解决测试时对齐技术通常会带来巨大推理成本的问题,从而限制了它们的实际应用。 Method: 引入了奖励-Shifted Speculative Sampling (SSS)算法,其中通过修改接受标准和奖励token分布,利用对齐的草稿模型和未对齐的目标模型之间的分布差异来恢复RLHF最优解。 Result: 该算法在测试时弱到强对齐实验中实现了优越的黄金奖励得分,并显著降低了推理成本,验证了其有效性和效率。 Conclusion: 论文得出结论,通过利用奖励调整的推测采样算法,可以在不实际获得RLHF最优解的情况下实现高效的测试时对齐,从而在显著降低推理成本的同时提高模型的表现。 Abstract: Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-Shifted Speculative Sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.

[7] LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

MohamamdJavad Ardestani,Ehsan Kamalloo,Davood Rafiei

Main category: cs.CL

TL;DR: LongRecall is a novel three-stage recall evaluation framework designed to improve the accuracy of assessing completeness in machine-generated text, particularly for critical domains and tasks like long-form QA.

Details Motivation: Ensuring the completeness of machine-generated text is crucial in critical domains like medicine and law, and existing recall metrics have limitations due to lexical overlap errors and LLM hallucinations, necessitating a more robust evaluation framework. Method: LongRecall introduces a three-stage recall evaluation framework that decomposes answers into self-contained facts, narrows plausible candidate matches through lexical and semantic filtering, and verifies alignment through structured entailment checks. Result: LongRecall achieves significant improvements in recall accuracy on three challenging long-form QA benchmarks using both human annotations and LLM-based judges. Conclusion: LongRecall serves as a foundational building block for systematic recall assessment, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines on challenging long-form QA benchmarks. Abstract: LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.

[8] Mapping the Course for Prompt-based Structured Prediction

Matt Pauk,Maria Leonor Pacheco

Main category: cs.CL

TL;DR: 本文探讨了如何通过将大型语言模型(LLMs)与组合推理相结合来解决LLMs在复杂推理任务中的局限性,研究了不同的提示策略,并表明符号推理能够提高预测的准确性和一致性。

Details Motivation: LLMs在许多语言任务中表现出色,但由于其自回归特性,在面对复杂推理问题时容易出现幻觉和错误。因此,本文希望探索如何通过结构化推理解决这些问题。 Method: 研究人员将LLMs与组合推理结合,使用不同的提示策略估计LLMs的置信度,并将其用于符号推理,从而提高预测的准确性。此外,还对模型进行了校准和微调以优化结构化预测任务的表现。 Result: 实验表明,无论使用何种提示策略,符号推理都能在LLMs的基础上提升预测的一致性和准确性。此外,通过校准和结构化目标的微调,模型在处理复杂任务时性能进一步提升。 Conclusion: 尽管LLMs在许多任务中表现优异,但在结构化预测任务中仍然需要结合传统推理方法以提高模型的可靠性和性能。 Abstract: LLMs have been shown to be useful for a variety of language tasks, without requiring task-specific fine-tuning. However, these models often struggle with hallucinations and complex reasoning problems due to their autoregressive nature. We propose to address some of these issues, specifically in the area of structured prediction, by combining LLMs with combinatorial inference in an attempt to marry the predictive power of LLMs with the structural consistency provided by inference methods. We perform exhaustive experiments in an effort to understand which prompting strategies can effectively estimate LLM confidence values for use with symbolic inference, and show that, regardless of the prompting strategy, the addition of symbolic inference on top of prompting alone leads to more consistent and accurate predictions. Additionally, we show that calibration and fine-tuning using structured prediction objectives leads to increased performance for challenging tasks, showing that structured learning is still valuable in the era of LLMs.

[9] Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Rabeeh Karimi Mahabadi,Sanjeev Satheesh,Shrimai Prabhumoye,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro

Main category: cs.CL

TL;DR: This paper presents a novel pipeline for creating high-quality math datasets from Common Crawl, resulting in improved performance in language model reasoning tasks.

Details Motivation: The motivation is to overcome the limitations of existing math-focused datasets, which suffer from poor quality due to inadequate extraction methods and data conversion issues. Method: The method involves constructing a high-quality mathematical corpus from Common Crawl using a domain-agnostic pipeline that recovers math in various formats through layout-aware rendering and LLM-based cleaning. Result: The result is the creation of Nemotron-CC-Math-3+ and Nemotron-CC-Math-4+, large-scale, high-quality datasets that outperform all prior open math datasets, leading to significant performance gains in math and code reasoning tasks. Conclusion: The study concludes that the novel pipeline for extracting scientific content from web-scale data significantly enhances the quality of pretraining corpora for LLMs, leading to improved performance in math, code, and general reasoning tasks. Abstract: Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.

[10] Identifying and Answering Questions with False Assumptions: An Interpretable Approach

Zijie Wang,Eduardo Blanco

Main category: cs.CL

TL;DR: This paper presents an approach to identify and answer questions with false assumptions using external evidence and assumption validation, showing improved performance and interpretable results.

Details Motivation: The motivation is to address the issue of misleading answers generated by LLMs due to hallucinations when answering questions with false assumptions. Method: The method involves reducing the problem to fact verification and leveraging external evidence to mitigate hallucinations in LLMs. This includes generating and validating atomic assumptions. Result: Experiments with five LLMs showed that incorporating retrieved evidence is beneficial, and generating and validating atomic assumptions leads to more improvements and interpretable answers by identifying false assumptions. Conclusion: The study concludes that by using external evidence and validating atomic assumptions, the identification and handling of questions with false assumptions can be significantly improved, providing interpretable answers. Abstract: People often ask questions with false assumptions, a type of question that does not have regular answers. Answering such questions require first identifying the false assumptions. Large Language Models (LLMs) often generate misleading answers because of hallucinations. In this paper, we focus on identifying and answering questions with false assumptions in several domains. We first investigate to reduce the problem to fact verification. Then, we present an approach leveraging external evidence to mitigate hallucinations. Experiments with five LLMs demonstrate that (1) incorporating retrieved evidence is beneficial and (2) generating and validating atomic assumptions yields more improvements and provides an interpretable answer by specifying the false assumptions.

[11] ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

Seungmin Han,Haeun Kwon,Ji-jun Park,Taeyang Yoon

Main category: cs.CL

TL;DR: 本文提出了一种新的多模态对话推理基准MMDR-Bench和一种无需大规模重新训练即可提升LVLMs性能的框架CoLVLM Agent。

Details Motivation: 当前的LLMs和LVLMs在处理复杂的多模态交互任务时面临挑战,而现有的基准测试不能充分反映现实世界的多模态交互的动态性和复杂性。 Method: 提出了CoLVLM Agent框架,通过迭代的“memory-perception-planning-execution”循环增强现有的LVLMs的推理和指令执行能力,并引入了MMDR-Bench数据集用于评估。 Result: 实验表明,CoLVLM Agent在MMDR-Bench上取得了平均4.03的人类评估得分,明显超过了GPT-4o(3.92)和Gemini 1.5 Pro(3.85)。 Conclusion: CoLVLM Agent框架在MMDR-Bench上的表现优于现有的最先进的商业模型,并且在推理深度、指令遵循和错误抑制方面具有显著优势,同时保持了对长期对话轮次的稳健性能。 Abstract: Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a holistic framework that enhances existing LVLMs with advanced reasoning and instruction following capabilities through an iterative "memory-perception-planning-execution" cycle, requiring no extensive re-training of the underlying models. Our extensive experiments on MMDR-Bench demonstrate that CoLVLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03, notably surpassing state-of-the-art commercial models like GPT-4o (3.92) and Gemini 1.5 Pro (3.85). The framework exhibits significant advantages in reasoning depth, instruction adherence, and error suppression, and maintains robust performance over extended dialogue turns, validating the effectiveness of its modular design and iterative approach for complex multi-modal interactions.

[12] SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling

Dong Liu,Yanxuan Yu

Main category: cs.CL

TL;DR: SemToken is a semantic-aware tokenization framework that improves computation efficiency and reduces token redundancy in long-context language modeling by leveraging semantic structure.

Details Motivation: Existing tokenization methods like BPE or WordPiece rely solely on frequency statistics, leading to over-tokenization of semantically redundant spans and underutilization of contextual coherence, especially in long-context scenarios. Method: SemToken uses lightweight encoders to extract semantic embeddings, performs local semantic clustering to merge equivalent tokens, and allocates token granularity based on semantic density. Result: Experiments show that SemToken achieves up to 2.4× reduction in token count and 1.9× speedup without significant degradation in perplexity or downstream accuracy. Conclusion: SemToken provides an effective way to reduce token redundancy and improve computation efficiency in language modeling by leveraging semantic structure. Abstract: Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbf{SemToken}, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can be seamlessly integrated with modern language models and attention acceleration methods. Experiments on long-context language modeling benchmarks such as WikiText-103 and LongBench show that SemToken achieves up to $2.4\times$ reduction in token count and $1.9\times$ speedup, with negligible or no degradation in perplexity and downstream accuracy. Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.

[13] Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

Yuanchen Zhou,Shuo Jiang,Jie Zhu,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang

Main category: cs.CL

TL;DR: Fin-PRM is a domain-specialized model for evaluating financial reasoning tasks, outperforming existing methods and showing significant improvements in various learning settings.

Details Motivation: Existing PRMs are primarily trained on general or STEM domains and fall short in domain-specific contexts such as finance. Method: Fin-PRM integrates step-level and trajectory-level reward supervision to evaluate intermediate reasoning steps in financial tasks. Result: Experimental results show that Fin-PRM outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality, with significant improvements in supervised learning, reinforcement learning, and test-time performance. Conclusion: Fin-PRM demonstrates substantial improvements in financial reasoning tasks, highlighting the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Abstract: Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce \textbf{Fin-PRM}, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9\% in supervised learning, 5.2\% in reinforcement learning, and 5.1\% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.

[14] SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

Huanxuan Liao,Yixing Xu,Shizhu He,Guanchen Li,Xuanwu Yin,Dong Li,Emad Barsoum,Jun Zhao,Kang Liu

Main category: cs.CL

TL;DR: SPARK通过通道级KV缓存稀疏化,有效提升大语言模型长上下文推理的效率与准确性。

Details Motivation: 现有KV缓存压缩方法主要沿时间轴进行压缩,忽视了通道维度上的重要性差异,导致效率和准确性的平衡受限。 Method: 提出SPARK方法,通过动态修剪和恢复KV缓存在注意力得分计算中的通道级冗余,实现对KV缓存的非结构化稀疏处理。 Result: SPARK在相同内存预算下能够处理更长的序列,在相同序列长度下相比基于驱逐的方法减少了超过30%的KV缓存存储,即使在80%的激进剪枝比例下,性能下降也小于5%。 Conclusion: SPARK是一个无需训练的即插即用方法,通过在通道级别上对KV缓存进行非结构化稀疏处理,有效解决了长上下文推理中的KV缓存瓶颈问题,同时保持了模型的准确性和效率。 Abstract: Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.

[15] Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering

Bolei He,Xinran He,Run Shao,Shanfu Shu,Xianwei Xue,Mingquan Cheng,Haifeng Li,Zhenhua Ling

Main category: cs.CL

TL;DR: 提出了一种名为Selct2Know (S2K)的框架,通过内部-外部知识自我选择策略和选择性监督微调,有效整合领域知识,提高了大型语言模型在特定领域问答任务中的表现。

Details Motivation: 大型语言模型在通用问答中表现良好,但在特定领域场景中常常遇到困难。现有的方法如检索增强生成和持续预训练存在幻觉、延迟和成本高等问题。 Method: 提出了Selct2Know (S2K)框架,采用内部-外部知识自我选择策略和选择性监督微调,结合结构化推理数据生成流程和GRPO提升推理能力。 Result: 实验结果表明,S2K在医疗、法律和金融问答基准测试中始终优于现有方法,并以显著更低的成本匹敌领域预训练的大型语言模型。 Conclusion: Selct2Know (S2K)提供了一种成本效益高的解决方案,有效解决了大型语言模型在特定领域知识获取和应用中的挑战。 Abstract: Large Language Models (LLMs) perform well in general QA but often struggle in domain-specific scenarios. Retrieval-Augmented Generation (RAG) introduces external knowledge but suffers from hallucinations and latency due to noisy retrievals. Continued pretraining internalizes domain knowledge but is costly and lacks cross-domain flexibility. We attribute this challenge to the long-tail distribution of domain knowledge, which leaves partial yet useful internal knowledge underutilized. We further argue that knowledge acquisition should be progressive, mirroring human learning: first understanding concepts, then applying them to complex reasoning. To address this, we propose Selct2Know (S2K), a cost-effective framework that internalizes domain knowledge through an internal-external knowledge self-selection strategy and selective supervised fine-tuning. We also introduce a structured reasoning data generation pipeline and integrate GRPO to enhance reasoning ability. Experiments on medical, legal, and financial QA benchmarks show that S2K consistently outperforms existing methods and matches domain-pretrained LLMs with significantly lower cost.

[16] Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall

Sijia Cui,Aiyao He,Shuai Xu,Hongming Zhang,Yanna Wang,Qingyang Zhang,Yajing Wang,Bo Xu

Main category: cs.CL

TL;DR: 本文提出了一种名为SEER的方法,通过逐步检索经验池中的信息来改进多步骤工具使用的效果,相比现有方法更高效且性能更优。

Details Motivation: 现有的多步骤工具使用方法依赖于手动设计任务特定的示例或从人工整理的库中检索信息,这需要大量专家工作,且随着工具多样性和任务复杂度的增加,提示工程变得越来越复杂和低效。 Method: 提出了一种名为Stepwise Experience Recall (SEER) 的方法,通过从持续更新的经验池中进行细粒度、逐步检索,以改进大型语言模型在多步骤工具使用中的表现。 Result: 在ToolQA基准测试中,SEER在简单问题上取得了6.1%的平均提升,在困难问题上取得了4.7%的平均提升;在τ-bench测试中,分别使用Qwen2.5-7B和Qwen2.5-72B模型,SEER展示了7.44%和23.38%的显著准确率提升。 Conclusion: SEER是一个自我指导的方法,通过逐步检索经验池中的信息来改进多步骤工具使用的效果,且在多个基准测试中表现出显著的性能提升。 Abstract: Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substantial expert effort and prompt engineering becomes increasingly complex and inefficient as tool diversity and task difficulty scale. To address these challenges, we propose a self-guided method, Stepwise Experience Recall (SEER), which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of relying on static or manually curated library, SEER incrementally augments the experience pool with past successful trajectories, enabling continuous expansion of the pool and improved model performance over time. Evaluated on the ToolQA benchmark, SEER achieves an average improvement of 6.1\% on easy and 4.7\% on hard questions. We further test SEER on $\tau$-bench, which includes two real-world domains. Powered by Qwen2.5-7B and Qwen2.5-72B models, SEER demonstrates substantial accuracy gains of 7.44\% and 23.38\%, respectively.

[17] Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?

Momoka Furuhashi,Kouta Nakayama,Takashi Kodama,Saku Sugawara

Main category: cs.CL

TL;DR: 研究发现选择性使用检查表可提升生成任务的自动评估效果,但需更明确的客观标准以减少人工和自动评估的不一致性。

Details Motivation: 自动评估生成任务面临模糊标准的挑战,而检查表生成是一个潜在有前景的方法,但其有效性尚未得到充分研究。 Method: 通过六种方法生成检查表,评估其在八种模型规模上的有效性,并分析检查表项与人工评估的相关性。 Result: 选择性使用检查表通常能提高成对比较设置中的评估性能,但在直接评分中的效果不一致;即使与人工评分相关性低的检查表项也常反映人工编写标准,表明人工评估可能存在不一致。 Conclusion: 研究强调了在生成任务的自动评估中需要更明确的客观评估标准,以指导人工和自动评估。 Abstract: Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored. We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations. Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring. Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations. \footnote{Our code is available at~https://github.com/momo0817/checklist-effectiveness-study

[18] VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models

Hanling Zhang,Yayu Zhou,Tongcheng Fang,Zhihang Yuan,Guohao Dai,Yu Wang

Main category: cs.CL

TL;DR: VocabTailor is a novel framework that dynamically manages vocabulary components in Small Language Models to drastically reduce memory usage while maintaining performance, enabling efficient deployment on edge devices.

Details Motivation: Memory constraints in Small Language Models (SLMs), especially from vocabulary-related components, hinder deployment on edge devices. Existing static pruning methods are rigid and cause information loss, necessitating a more flexible and efficient solution. Method: The study introduces VocabTailor, a dynamic vocabulary selection framework that decouples memory handling for embeddings and LM heads, employing offloading and a hybrid static-dynamic strategy for on-demand loading. Result: VocabTailor reduces memory usage of vocabulary-related components by up to 99% across diverse tasks, with little to no decline in performance. Conclusion: VocabTailor substantially outperforms existing static vocabulary pruning methods by achieving significant memory reduction with minimal impact on task performance. Abstract: Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.

[19] WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai

Peerat Limkonchotiwat,Pume Tuchinda,Lalita Lowphansirikul,Surapon Nonesung,Panuthep Tasawong,Alham Fikri Aji,Can Udomcharoenchaikit,Sarana Nutanong

Main category: cs.CL

TL;DR: 本文提出了WangchanThaiInstruct,一个用于评估和指导调优的人工泰语数据集,强调了在低资源、语言多样化环境下,需要文化和职业背景的教学数据来改进大型语言模型的适配性。

Details Motivation: 大型语言模型在英语中的指令跟随表现出色,但在泰语等低资源语言上的表现仍未得到充分探索。现有的基准测试往往依赖于翻译,缺乏在实际应用中所需的特定文化和领域细节。 Method: 通过多阶段的质量控制过程,结合注释者、领域专家和AI研究人员的力量,创建了WangchanThaiInstruct,这是一个用于评估和指导调优的人工泰语数据集。 Result: 使用WangchanThaiInstruct微调的模型在领域内和领域外的基准测试中都优于使用翻译数据的模型。零样本评估显示了在特定文化和职业任务上的性能差距。 Conclusion: WangchanThaiInstruct强调了在低资源、语言多样化环境下,需要文化和职业背景的教学数据来改进大型语言模型的适配性。 Abstract: Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.

[20] UniCoM: A Universal Code-Switching Speech Generator

Sangmin Lee,Woojin Chung,Seyun Um,Hong-Goo Kang

Main category: cs.CL

TL;DR: This paper proposes UniCoM, a novel pipeline using the SWORDS algorithm to generate high-quality code-switching speech data while preserving semantics, leading to the creation of the CS-FLEURS corpus for speech recognition and translation tasks.

Details Motivation: Code-switching is common in multilingual conversations but poses challenges for speech technology due to the lack of suitable datasets. The study aims to address this gap by proposing a method for generating natural CS samples. Method: The paper introduces UniCoM, a novel pipeline for generating code-switching (CS) samples. This includes the SWORDS algorithm, which replaces words with translations based on their parts of speech to maintain sentence semantics. Result: Using UniCoM, the authors constructed the CS-FLEURS dataset for ASR and S2TT tasks. Experimental results show that the dataset has high intelligibility and naturalness, performing comparably to existing datasets on both objective and subjective metrics. Conclusion: The proposed UniCoM approach effectively generates high-quality, natural CS samples, and the constructed CS-FLEURS corpus performs comparably to existing datasets on ASR and S2TT tasks, paving the way for more inclusive multilingual systems. Abstract: Code-switching (CS), the alternation between two or more languages within a single speaker's utterances, is common in real-world conversations and poses significant challenges for multilingual speech technology. However, systems capable of handling this phenomenon remain underexplored, primarily due to the scarcity of suitable datasets. To resolve this issue, we propose Universal Code-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CS samples without altering sentence semantics. Our approach utilizes an algorithm we call Substituting WORDs with Synonyms (SWORDS), which generates CS speech by replacing selected words with their translations while considering their parts of speech. Using UniCoM, we construct Code-Switching FLEURS (CS-FLEURS), a multilingual CS corpus designed for automatic speech recognition (ASR) and speech-to-text translation (S2TT). Experimental results show that CS-FLEURS achieves high intelligibility and naturalness, performing comparably to existing datasets on both objective and subjective metrics. We expect our approach to advance CS speech technology and enable more inclusive multilingual systems.

[21] EMNLP: Educator-role Moral and Normative Large Language Models Profiling

Yilin Jiang,Mingzi Zhang,Sheng Jin,Zengyi Yu,Xiangjie Kong,Binghao Tu

Main category: cs.CL

TL;DR: 本文提出了EMNLP框架,用于评估教师角色大型语言模型的伦理和心理对齐问题,揭示了能力与安全之间的悖论,并构建了首个相关基准。

Details Motivation: 尽管模拟职业(SP)使大型语言模型能够模拟专业角色,但在这些情境下的全面心理和伦理评估仍然缺乏。 Method: EMNLP框架通过扩展现有量表,构建了88个教师特定的道德困境,并使用针对性的软提示注入集来评估合规性和脆弱性。 Result: 实验显示,教师角色的LLM表现出比人类教师更理想化和极端的人格,擅长抽象道德推理,但在情感复杂的场景中表现较差;推理能力越强的模型对有害提示注入越敏感。 Conclusion: EMNLP框架揭示了教师角色的大型语言模型在伦理和心理对齐方面的能力与安全之间的悖论,并介绍了首个评估教育AI教师角色模型伦理和心理对齐的基准。 Abstract: Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 12 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at https://e-m-n-l-p.github.io/.

[22] Conflict-Aware Soft Prompting for Retrieval-Augmented Generation

Eunseong Choi,June Park,Hyeri Lee,Jongwuk Lee

Main category: cs.CL

TL;DR: Conflict-Aware Retrieval-Augmented Generation (CARE) resolves conflicts between external context and internal LLM knowledge, improving reliability in QA and fact-checking tasks.

Details Motivation: RAG systems face context-memory conflicts when external context contradicts the LLM's internal knowledge, limiting their reliability. Method: CARE includes a context assessor and a base LLM, where the context assessor uses grounded/adversarial soft prompting to discern unreliable context and guide reasoning. Result: CARE achieves an average performance gain of 5.0% on QA and fact-checking benchmarks. Conclusion: CARE provides a promising direction for trustworthy and adaptive RAG systems by resolving context-memory conflicts. Abstract: Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM's parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0\% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.

[23] TComQA: Extracting Temporal Commonsense from Text

Lekshmi R Nair,Arun Sankar,Koninika Pal

Main category: cs.CL

TL;DR: 本研究提出了一种利用大型语言模型(LLMs)自动挖掘时间常识的方法,并构建了高质量数据集TComQA,显著提升了模型在时间常识问答任务上的性能。

Details Motivation: 时间常识对于理解事件至关重要,但由于其在文本中很少显式提及,现有LLMs在涉及时间推理的任务中表现不佳。因此,需要自动挖掘时间常识以提升语言模型的鲁棒性。 Method: 提出了一种时间常识抽取流程,利用LLMs自动挖掘时间常识,并构建了TComQA数据集。该数据集基于SAMSum和RealNews语料库,并通过众包验证其准确性。 Result: TComQA数据集在时间常识抽取上的准确率超过80%,并且基于TComQA训练的模型在时间问答任务上优于在现有数据集上微调的LLMs。 Conclusion: TComQA数据集的构建和使用提升了模型在时间常识问答任务上的表现,表明利用LLMs挖掘时间常识是有效的。 Abstract: Understanding events necessitates grasping their temporal context, which is often not explicitly stated in natural language. For example, it is not a trivial task for a machine to infer that a museum tour may last for a few hours, but can not take months. Recent studies indicate that even advanced large language models (LLMs) struggle in generating text that require reasoning with temporal commonsense due to its infrequent explicit mention in text. Therefore, automatically mining temporal commonsense for events enables the creation of robust language models. In this work, we investigate the capacity of LLMs to extract temporal commonsense from text and evaluate multiple experimental setups to assess their effectiveness. Here, we propose a temporal commonsense extraction pipeline that leverages LLMs to automatically mine temporal commonsense and use it to construct TComQA, a dataset derived from SAMSum and RealNews corpora. TComQA has been validated through crowdsourcing and achieves over 80\% precision in extracting temporal commonsense. The model trained with TComQA also outperforms an LLM fine-tuned on existing dataset of temporal question answering task.

[24] CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing

Abdul Rehman,Jian-Jun Zhang,Xiaosong Yang

Main category: cs.CL

TL;DR: CUPE是一种轻量级模型,可在120毫秒内捕捉关键音素特征,实现高效的通用音素识别。

Details Motivation: 许多语音处理任务需要纯净的音素表示,而通用音素识别通常需要分析长语音段和语言特定模式。 Method: CUPE独立处理短的固定宽度窗口,并通过监督和自监督训练在多种语言上进行评估。 Result: CUPE在跨语言任务上表现出较强泛化能力,且在参数少于现有方法的情况下实现了具有竞争力的跨语言性能。 Conclusion: CUPE是一个轻量级模型,通过建模音素长度窗口内的基本声学模式,实现了有效的通用语音处理。 Abstract: Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our development of CUPE - a lightweight model that captures key phoneme features in just 120 milliseconds, about one phoneme's length. CUPE processes short, fixed-width windows independently and, despite fewer parameters than current approaches, achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages. Our extensive evaluation through supervised and self-supervised training on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus, demonstrates strong cross-lingual generalization and reveals that effective universal speech processing is possible through modeling basic acoustic patterns within phoneme-length windows.

[25] KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models

Haji Gul,Abul Ghani Naim,Ajaz Ahmad Bhat

Main category: cs.CL

TL;DR: The paper proposes a new meta-metric called KG Evaluation based on Distance from Average Solution (EDAS) for evaluating Knowledge Graph Completion models, which provides a unified, robust, and generalizable evaluation framework across multiple datasets and metrics.

Details Motivation: The major challenge in evaluating KGC models lies in comparing their performance across multiple datasets and metrics. Different metrics can yield conflicting rankings, making it difficult to determine overall superiority of a model and hindering holistic comparisons. Method: KG Evaluation based on Distance from Average Solution (EDAS) is proposed as a unified meta-metric that integrates performance across all metrics and datasets to enable a more reliable and interpretable evaluation framework. Result: Experimental results on benchmark datasets such as FB15k-237 and WN18RR demonstrate that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking. Conclusion: KG Evaluation based on Distance from Average Solution (EDAS) is a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and diverse evaluation criteria into a single normalized score, offering a consistent, robust, and generalizable framework for evaluating KGC models. Abstract: Knowledge Graphs (KGs) enable applications in various domains such as semantic search, recommendation systems, and natural language processing. KGs are often incomplete, missing entities and relations, an issue addressed by Knowledge Graph Completion (KGC) methods that predict missing elements. Different evaluation metrics, such as Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hit@k, are commonly used to assess the performance of such KGC models. A major challenge in evaluating KGC models, however, lies in comparing their performance across multiple datasets and metrics. A model may outperform others on one dataset but underperform on another, making it difficult to determine overall superiority. Moreover, even within a single dataset, different metrics such as MRR and Hit@1 can yield conflicting rankings, where one model excels in MRR while another performs better in Hit@1, further complicating model selection for downstream tasks. These inconsistencies hinder holistic comparisons and highlight the need for a unified meta-metric that integrates performance across all metrics and datasets to enable a more reliable and interpretable evaluation framework. To address this need, we propose KG Evaluation based on Distance from Average Solution (EDAS), a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and diverse evaluation criteria into a single normalized score ($M_i \in [0,1]$). Unlike traditional metrics that focus on isolated aspects of performance, EDAS offers a global perspective that supports more informed model selection and promotes fairness in cross-dataset evaluation. Experimental results on benchmark datasets such as FB15k-237 and WN18RR demonstrate that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking, offering a consistent, robust, and generalizable framework for evaluating KGC models.

[26] A Survey on Large Language Model Benchmarks

Shiwen Ni,Guhong Chen,Shuaimin Li,Xuanang Chen,Siyi Li,Bingli Wang,Qiyao Wang,Xingjian Wang,Yifan Zhang,Liyang Fan,Chengming Li,Ruifeng Xu,Le Sun,Min Yang

Main category: cs.CL

TL;DR: The paper provides a systematic review of large language model benchmarks, categorizing them into three types and identifying current issues such as data contamination, cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments.

Details Motivation: The motivation for the paper stems from the rapid development of large language models and the increasing number of corresponding evaluation benchmarks. The authors aim to assess the current state of benchmarks, identify their shortcomings, and propose directions for future innovation. Method: The method involves a systematic review of the current status and development of large language model benchmarks, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. Result: The result of the paper is a categorization of 283 benchmarks into three types: general capabilities, domain-specific, and target-specific. The authors identify key issues in current benchmarks and provide a design paradigm to address these problems. Conclusion: The paper concludes that while benchmarks are essential for evaluating and guiding the development of large language models, they currently face issues such as data contamination, cultural and linguistic biases, and a lack of evaluation on process credibility and dynamic environments. The authors provide a design paradigm for future benchmark innovation. Abstract: In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.

[27] Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation

Yichi Zhang,Yao Huang,Yifan Wang,Yitong Sun,Chang Liu,Zhe Zhao,Zhengwei Fang,Huanran Chen,Xiao Yang,Xingxing Wei,Hang Su,Yinpeng Dong,Jun Zhu

Main category: cs.CL

TL;DR: 本文提出了MultiTrust-X,一个全面评估和缓解多模态大语言模型信任问题的基准测试工具,并引入了RESA方法以提高模型在推理过程中的风险发现能力。

Details Motivation: 尽管MLLMs的能力取得了显著进展,但其可信度仍然是一个严重的问题。现有的评估和缓解方法往往只关注狭窄的方面,忽略了多模态带来的风险。 Method: MultiTrust-X基于一个三维框架,涵盖五个可信度方面(真实性、鲁棒性、安全性、公平性和隐私性)、两种新型风险类型(多模态风险和跨模态影响)以及多种缓解策略,包括数据、模型架构、训练和推理算法。 Result: 实验揭示了当前模型的重大漏洞,包括可信度与一般能力之间的差距,以及多模态训练和推理对基础LLMs潜在风险的放大。此外,现有的缓解策略存在关键限制,虽然某些方法在特定方面有所改进,但很少能有效解决整体可信度问题。 Conclusion: MultiTrust-X是一个全面的基准测试工具,用于评估和缓解MLLMs的信任问题,并提出了RESA方法在推理过程中利用推理能力来发现潜在风险。 Abstract: The trustworthiness of Multimodal Large Language Models (MLLMs) remains an intense concern despite the significant progress in their capabilities. Existing evaluation and mitigation approaches often focus on narrow aspects and overlook risks introduced by the multimodality. To tackle these challenges, we propose MultiTrust-X, a comprehensive benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs. We define a three-dimensional framework, encompassing five trustworthiness aspects which include truthfulness, robustness, safety, fairness, and privacy; two novel risk types covering multimodal risks and cross-modal impacts; and various mitigation strategies from the perspectives of data, model architecture, training, and inference algorithms. Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets, enabling holistic evaluations over 30 open-source and proprietary MLLMs and in-depth analysis with 8 representative mitigation methods. Our extensive experiments reveal significant vulnerabilities in current models, including a gap between trustworthiness and general capabilities, as well as the amplification of potential risks in base LLMs by both multimodal training and inference. Moreover, our controlled analysis uncovers key limitations in existing mitigation strategies that, while some methods yield improvements in specific aspects, few effectively address overall trustworthiness, and many introduce unexpected trade-offs that compromise model utility. These findings also provide practical insights for future improvements, such as the benefits of reasoning to better balance safety and performance. Based on these insights, we introduce a Reasoning-Enhanced Safety Alignment (RESA) approach that equips the model with chain-of-thought reasoning ability to discover the underlying risks, achieving state-of-the-art results.

[28] Confidence-Modulated Speculative Decoding for Large Language Models

Jaydip Sen,Subhasis Dasgupta,Hetvi Waghela

Main category: cs.CL

TL;DR: This paper introduces an adaptive speculative decoding framework for large language models that dynamically adjusts token generation and verification based on confidence measures, resulting in faster inference without compromising quality.

Details Motivation: The motivation is to address the limitations of existing speculative decoding methods, which rely on static drafting lengths and rigid verification criteria, thereby limiting their adaptability. Method: The method involves using entropy and margin-based uncertainty measures from the drafter's output distribution to adaptively adjust the number of speculatively generated tokens and modulate the verification process. Result: The experiments showed significant speed improvements over standard speculative decoding while maintaining or enhancing generation quality as measured by BLEU and ROUGE scores. Conclusion: The paper concludes that the proposed information-theoretic framework for speculative decoding improves efficiency and robustness in large language models by dynamically adjusting token generation and verification based on confidence measures. Abstract: Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.

[29] Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Woojin Chung,Jeonghoon Kim

Main category: cs.CL

TL;DR: 本文研究了词汇量对语言模型的影响,发现增加词汇量可以降低tokenized文本的复杂性,从而提高模型性能,特别是在常见词汇上的预测能力。

Details Motivation: 当前实践倾向于使用越来越大的词汇量,但其带来的好处来源尚不清楚。本文旨在通过控制研究来探讨语言模型词汇量扩大的影响。 Method: 本文通过将语言模型的词汇量从24K扩展到196K,并保持数据、计算和优化不变,进行了一项受控研究。此外,还量化了tokenized文本的复杂性,并分析了词汇量对交叉熵的影响。 Result: 研究发现,较大的词汇量主要通过降低常见词汇的不确定性来减少交叉熵,即使罕见词汇的损失有所增加。同时,增加模型参数并保持词汇量固定也能获得类似的好处。 Conclusion: 本文结果表明,扩大词汇量有助于降低tokenized文本的复杂性,从而提升模型性能。这一发现为分词器和模型的协同设计提供了理论依据,并澄清了预训练中语言模型扩展的损失动态。 Abstract: Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but the source of the benefit is unclear. We conduct a controlled study that scales the language model's vocabulary from 24K to 196K while holding data, compute, and optimization fixed. We first quantify the complexity of tokenized text, formalized via Kolmogorov complexity, and show that larger vocabularies reduce this complexity. Above 24K, every common word is already a single token, so further growth mainly deepens the relative token-frequency imbalance. A word-level loss decomposition shows that larger vocabularies reduce cross-entropy almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. Constraining input and output embedding norms to attenuate the effect of token-frequency imbalance reverses the gain, directly showing that the model exploits rather than suffers from imbalance. Because the same frequent words cover roughly 77% of tokens in downstream benchmarks, this training advantage transfers intact. We also show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results reframe "bigger vocabularies help" as "lowering the complexity of tokenized text helps," providing a simple, principled lever for tokenizer-model co-design and clarifying the loss dynamics that govern language-model scaling in pre-training.

[30] Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models

Tobias Schreieder,Tim Schopf,Michael Färber

Main category: cs.CL

TL;DR: 这篇论文旨在解决大型语言模型在证据支持的文本生成领域中由于不一致的术语、孤立的评估实践和缺乏统一基准而导致的碎片化问题,通过系统分析134篇论文,引入统一分类法,并在七个关键维度上调查300个评估指标。

Details Motivation: 论文的动机是由于大型语言模型的可靠性与可信度问题,以及该领域由于不一致的术语、孤立的评估实践和缺乏统一基准而导致的碎片化。 Method: 论文的方法包括系统分析134篇论文,介绍一个统一的证据支持的文本生成分类法,并在七个关键维度上调查300个评估指标。 Result: 论文的结果包括一个统一的证据支持的文本生成分类法,对300个评估指标的调查,以及对该领域独特特征和代表性方法的考察。 Conclusion: 论文得出的结论是,证据支持的文本生成领域需要一个统一的分类法和评估标准,以提高大型语言模型的可靠性和可信度。 Abstract: The increasing adoption of large language models (LLMs) has been accompanied by growing concerns regarding their reliability and trustworthiness. As a result, a growing body of research focuses on evidence-based text generation with LLMs, aiming to link model outputs to supporting evidence to ensure traceability and verifiability. However, the field is fragmented due to inconsistent terminology, isolated evaluation practices, and a lack of unified benchmarks. To bridge this gap, we systematically analyze 134 papers, introduce a unified taxonomy of evidence-based text generation with LLMs, and investigate 300 evaluation metrics across seven key dimensions. Thereby, we focus on approaches that use citations, attribution, or quotations for evidence-based text generation. Building on this, we examine the distinctive characteristics and representative methods in the field. Finally, we highlight open challenges and outline promising directions for future work.

[31] When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models

Cheng Wang,Gelei Deng,Xianglin Yang,Han Qiu,Tianwei Zhang

Main category: cs.CL

TL;DR: This paper introduces MCR-BENCH, a benchmark for evaluating how Large Audio-Language Models (LALMs) handle conflicting audio-text inputs, revealing a significant text bias that impacts performance and reliability in real-world applications.

Details Motivation: The motivation is to examine the largely unexplored issue of how Large Audio-Language Models (LALMs) handle conflicting information between audio and text modalities, aiming to improve their reliability and performance in real-world applications. Method: The authors introduce MCR-BENCH, a comprehensive benchmark for evaluating how LALMs prioritize information in inconsistent audio-text pairs. They conduct extensive evaluations across diverse audio understanding tasks, investigate factors influencing text bias, explore mitigation strategies through supervised finetuning, and analyze model confidence patterns. Result: The evaluation reveals that LALMs show a strong bias towards textual input when inconsistencies exist, often disregarding audio evidence. This leads to significant performance degradation in audio-centric tasks and persistent overconfidence in the face of contradictory inputs. Conclusion: The paper concludes that Large Audio-Language Models (LALMs) exhibit a significant bias toward textual input when inconsistencies exist between audio and text modalities, leading to performance degradation in audio-centric tasks and reliability concerns for real-world applications. Abstract: Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.

[32] LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

Yirong Sun,Yizhong Geng,Peidong Wei,Yanjun Chen,Jinghan Yang,Rongfei Chen,Wei Zhang,Xiaoyu Shen

Main category: cs.CL

TL;DR: LLaSO是一个开放的语音-语言建模框架,包含大规模数据集、多任务训练数据和标准化评估基准,推动了LSLMs的可复现研究和社区协作。

Details Motivation: LSLMs的发展受到碎片化架构和缺乏透明度的阻碍,研究难以进行系统比较和复现。LLaSO旨在填补这一空白,推动语音-语言建模领域的开放与标准化。 Method: LLaSO提供了三个核心资源:LLaSO-Align(大规模语音-文本对齐语料库)、LLaSO-Instruct(多任务指令调优数据集)和LLaSO-Eval(可复现的评估基准),并基于这些数据训练了一个3.8B参数的参考模型LLaSO-Base。 Result: LLaSO-Base在标准化评估中取得了0.72的归一化分数,超过了类似模型,展现出强大的性能,但在未见过的任务和纯音频场景下仍存在泛化差距。 Conclusion: LLaSO提供了一个全面开放的框架,推动了LSLMs的研究进展,通过开放数据、基准和模型,为未来研究提供了基础标准。 Abstract: The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.

[33] A Study of Privacy-preserving Language Modeling Approaches

Pritilata Saha,Abhirup Sinha

Main category: cs.CL

TL;DR: 该论文研究了语言模型中隐私保护的方法,探讨了其优势与局限性,并为未来研究指明了方向。

Details Motivation: 语言模型可能在隐私攻击中泄露敏感信息,因此保护隐私成为关键的研究领域。 Method: 对隐私保护语言建模方法进行了全面研究,概述了这些方法,突出了它们的优势,并探讨了其局限性。 Result: 该研究有助于正在进行的隐私保护语言建模研究,并提供了有价值的见解。 Conclusion: 研究得出隐私保护语言模型是可行的,并为未来研究提供了有价值的见解和方向。 Abstract: Recent developments in language modeling have increased their use in various applications and domains. Language models, often trained on sensitive data, can memorize and disclose this information during privacy attacks, raising concerns about protecting individuals' privacy rights. Preserving privacy in language models has become a crucial area of research, as privacy is one of the fundamental human rights. Despite its significance, understanding of how much privacy risk these language models possess and how it can be mitigated is still limited. This research addresses this by providing a comprehensive study of the privacy-preserving language modeling approaches. This study gives an in-depth overview of these approaches, highlights their strengths, and investigates their limitations. The outcomes of this study contribute to the ongoing research on privacy-preserving language modeling, providing valuable insights and outlining future research directions.

[34] M-HELP: Using Social Media Data to Detect Mental Health Help-Seeking Signals

MSVPJ Sathvik,Zuhair Hasan Shaik,Vivek Gupta

Main category: cs.CL

TL;DR: This paper presents M-Help, a novel dataset for detecting help-seeking behavior related to mental health disorders and their causes on social media.

Details Motivation: There is a critical gap in identifying individuals actively seeking help for mental health disorders, despite the existence of various datasets for detecting such disorders. Method: The paper introduces a novel dataset, M-Help, which identifies help-seeking behavior on social media, specific mental health disorders, and their underlying causes. Result: The M-Help dataset enables AI models to address three key tasks: identifying help-seekers, diagnosing mental health conditions, and uncovering the root causes of issues. Conclusion: AI models trained on the M-Help dataset can effectively identify help-seekers, diagnose mental health conditions, and uncover root causes of mental health issues. Abstract: Mental health disorders are a global crisis. While various datasets exist for detecting such disorders, there remains a critical gap in identifying individuals actively seeking help. This paper introduces a novel dataset, M-Help, specifically designed to detect help-seeking behavior on social media. The dataset goes beyond traditional labels by identifying not only help-seeking activity but also specific mental health disorders and their underlying causes, such as relationship challenges or financial stressors. AI models trained on M-Help can address three key tasks: identifying help-seekers, diagnosing mental health conditions, and uncovering the root causes of issues.

[35] Principle Methods of Rendering Non-equivalent Words from Uzbek and Dari to Russian and English

Mohammad Ibrahim Qani

Main category: cs.CL

TL;DR: 该研究旨在介绍如何专业地将源语言中的不等值词汇转换为目标语言,通过基于图书馆的研究方法,研究者提出了多种转换方法,并成功转换了25个来自达累斯萨拉姆和乌兹别克语的不等值词汇到英语和俄语中。

Details Motivation: 误解主要出现在非等价词汇中,因为每种语言中都存在不同的本地化和内部词汇,如食物、服装、文化和传统词汇等,这些词汇在目标语言中往往没有对应词,因此需要进行研究以消除误解。 Method: 本研究采用了基于图书馆的研究方法。 Result: 研究结果包括了将源语言中的不等值词汇转换为目标语言的不同方法和规则,并展示了25个成功转换的实例。 Conclusion: 研究得出了一些将源语言中的不等值词汇专业地转换为目标语言的方法,并成功转换了25个来自达累斯萨拉姆和乌兹别克语的不等值词汇到英语和俄语中。 Abstract: These pure languages understanding directly relates to translation knowledge where linguists and translators need to work and research to eradicate misunderstanding. Misunderstandings mostly appear in non-equivalent words because there are different local and internal words like food, garment, cultural and traditional words and others in every notion. Truly, most of these words do not have equivalent in the target language and these words need to be worked and find their equivalent in the target language to fully understand the both languages. The purpose of this research is to introduce the methods of rendering non-equivalent words professionally from the source language to the target language and this research has been completed using library-based research. However, some of these non-equivalent words are already professionally rendered to the target language but still there many other words to be rendered. As a result, this research paper includes different ways and rules of rendering non-equivalent words from source language to the target language and 25 non-equvalent words have been rendered from Dar & Uzbek into English and Russian languages.

[36] PyTOD: Programmable Task-Oriented Dialogue with Execution Feedback

Alexandru Coca,Bo-Hsiang Tseng,Pete Boothroyd,Jianpeng Cheng,Mark Gaynor,Zhenxing Zhang,Joe Stacey,Tristan Guigue,Héctor Martinez Alonso,Diarmuid Ó Séaghdha,Anders Johannsen

Main category: cs.CL

TL;DR: PyTOD通过生成可执行代码和反馈机制提升任务导向对话系统中的状态跟踪性能。

Details Motivation: 任务导向型对话代理的有效性依赖于准确的状态跟踪,而现有的方法在对话过程中可能无法有效估计用户目标并保持准确性。 Method: PyTOD采用了一种简单的受限解码方法,利用语言模型而非语法规则来遵循API模式,同时结合策略和执行反馈进行高效错误校正。 Result: PyTOD在具有挑战性的SGD基准测试中实现了最先进的状态跟踪性能,并在准确性和用户目标估计的鲁棒性方面超越了强大的基线方法。 Conclusion: PyTOD通过执行感知的状态跟踪方法在任务导向型对话系统中实现了最先进的状态跟踪性能,证明了其在对话过程中准确性和鲁棒性方面的有效性。 Abstract: Programmable task-oriented dialogue (TOD) agents enable language models to follow structured dialogue policies, but their effectiveness hinges on accurate state tracking. We present PyTOD, an agent that generates executable code to track dialogue state and uses policy and execution feedback for efficient error correction. To this end, PyTOD employs a simple constrained decoding approach, using a language model instead of grammar rules to follow API schemata. This leads to state-of-the-art state tracking performance on the challenging SGD benchmark. Our experiments show that PyTOD surpasses strong baselines in both accuracy and robust user goal estimation as the dialogue progresses, demonstrating the effectiveness of execution-aware state tracking.

[37] RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores

Yingshu Li,Yunyi Liu,Lingqiao Liu,Lei Wang,Luping Zhou

Main category: cs.CL

TL;DR: RadReason是一种可解释的放射学报告评估框架,通过细粒度评分和人类可读解释,超越了现有方法,并与GPT-4的评估结果相当。

Details Motivation: 现有的放射学报告评估方法要么输出粗略的整体分数,要么依赖于不透明的黑箱模型,这限制了它们在真实临床环境中的实用性。 Method: RadReason基于Group Relative Policy Optimization,并引入了两种关键技术:Sub-score Dynamic Weighting 和 Majority-Guided Advantage Scaling。 Result: RadReason不仅能够输出六种临床定义错误类型的细粒度子分数,还能生成人类可读的解释,解释每个分数背后的原因。在ReXVal基准测试中表现优异。 Conclusion: RadReason是一种新型的、具有临床适用性的放射学报告评估框架,它超越了所有先前的离线指标,与GPT-4的评估结果相当,同时具有可解释性和成本效益。 Abstract: Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces human-readable justifications that explain the rationale behind each score. Our method builds on Group Relative Policy Optimization and incorporates two key innovations: (1) Sub-score Dynamic Weighting, which adaptively prioritizes clinically challenging error types based on live F1 statistics; and (2) Majority-Guided Advantage Scaling, which adjusts policy gradient updates based on prompt difficulty derived from sub-score agreement. Together, these components enable more stable optimization and better alignment with expert clinical judgment. Experiments on the ReXVal benchmark show that RadReason surpasses all prior offline metrics and achieves parity with GPT-4-based evaluations, while remaining explainable, cost-efficient, and suitable for clinical deployment. Code will be released upon publication.

[38] SLM4Offer: Personalized Marketing Offer Generation Using Contrastive Learning Based Fine-Tuning

Vedasamhitha Challapalli,Konduru Venkat Sai,Piyush Pratap Singh,Rupesh Prasad,Arvind Maurya,Atul Singh

Main category: cs.CL

TL;DR: SLM4Offer is a generative AI model for personalized offer generation that uses contrastive learning to significantly improve offer acceptance rates.

Details Motivation: Personalized marketing is crucial for enhancing customer engagement and business growth, but there is untapped potential in creating more intelligent, data-driven approaches for offer generation. Method: SLM4Offer was developed by fine-tuning a pre-trained encoder-decoder language model (T5-Small 60M) using a contrastive learning approach and InfoNCE loss to align customer personas with relevant offers. Result: The experimental results showed a 17 percent improvement in offer acceptance rate using SLM4Offer compared to a supervised fine-tuning baseline. Conclusion: The study concludes that the SLM4Offer model, based on contrastive learning, significantly improves offer acceptance rates compared to traditional methods. Abstract: Personalized marketing has emerged as a pivotal strategy for enhancing customer engagement and driving business growth. Academic and industry efforts have predominantly focused on recommendation systems and personalized advertisements. Nonetheless, this facet of personalization holds significant potential for increasing conversion rates and improving customer satisfaction. Prior studies suggest that well-executed personalization strategies can boost revenue by up to 40 percent, underscoring the strategic importance of developing intelligent, data-driven approaches for offer generation. This work introduces SLM4Offer, a generative AI model for personalized offer generation, developed by fine-tuning a pre-trained encoder-decoder language model, specifically Google's Text-to-Text Transfer Transformer (T5-Small 60M) using a contrastive learning approach. SLM4Offer employs InfoNCE (Information Noise-Contrastive Estimation) loss to align customer personas with relevant offers in a shared embedding space. A key innovation in SLM4Offer lies in the adaptive learning behaviour introduced by contrastive loss, which reshapes the latent space during training and enhances the model's generalizability. The model is fine-tuned and evaluated on a synthetic dataset designed to simulate customer behaviour and offer acceptance patterns. Experimental results demonstrate a 17 percent improvement in offer acceptance rate over a supervised fine-tuning baseline, highlighting the effectiveness of contrastive objectives in advancing personalized marketing.

[39] Subjective Behaviors and Preferences in LLM: Language of Browsing

Sai Sundaresan,Harshita Chopra,Atanu R. Sinha,Koustava Goswami,Nagasai Saketh Naidu,Raghav Karan,N Anushka

Main category: cs.CL

TL;DR: A smaller, specialized language model with clusterwise training better captures individual browsing behavior than a large, one-size-fits-all model.

Details Motivation: The motivation is to question the assumption that large language models can effectively represent subjective user behaviors and preferences in web browsing, which lack the structured grammar of natural languages. Method: The researchers introduced a clusterwise language model training method called HeTLM, which accounts for user heterogeneity by using cluster-specific parameters instead of a single model for all users. Result: Results show that a smaller model with a specialized tokenizer outperforms larger models, HeTLM surpasses single models in performance, and there is an improvement in both average performance and reduced performance variance across users. Conclusion: The study concludes that a small language model trained with a page-level tokenizer and using clusterwise training can better represent user browsing behaviors than large language models. This approach leads to better performance alignment at the user level. Abstract: A Large Language Model (LLM) offers versatility across domains and tasks, purportedly benefiting users with a wide variety of behaviors and preferences. We question this perception about an LLM when users have inherently subjective behaviors and preferences, as seen in their ubiquitous and idiosyncratic browsing of websites or apps. The sequential behavior logs of pages, thus generated, form something akin to each user's self-constructed "language", albeit without the structure and grammar imbued in natural languages. We ask: (i) Can a small LM represent the "language of browsing" better than a large LM? (ii) Can an LM with a single set of parameters (or, single LM) adequately capture myriad users' heterogeneous, subjective behaviors and preferences? (iii) Can a single LM with high average performance, yield low variance in performance to make alignment good at user level? We introduce clusterwise LM training, HeTLM (Heterogeneity aware Training of Language Model), appropriate for subjective behaviors. We find that (i) a small LM trained using a page-level tokenizer outperforms large pretrained or finetuned LMs; (ii) HeTLM with heterogeneous cluster specific set of parameters outperforms a single LM of the same family, controlling for the number of parameters; and (iii) a higher mean and a lower variance in generation ensues, implying improved alignment.

[40] Influence-driven Curriculum Learning for Pre-training on Limited Data

Loris Schoenegger,Lukas Thoma,Terra Blevins,Benjamin Roth

Main category: cs.CL

TL;DR: This study demonstrates that curriculum learning improves language model pre-training when example difficulty is determined by training data influence, a model-centric metric.

Details Motivation: The authors aim to explore whether curriculum learning can become competitive in pre-training language models by using a model-centric difficulty metric instead of human-centered ones. Method: The authors sorted training examples by their training data influence and conducted experiments to compare the performance of models trained on these curricula with those trained in random order. Result: Models trained on curricula based on training data influence outperformed randomly ordered training by over 10 percentage points in benchmarks. Conclusion: Curriculum learning can be beneficial for language model pre-training when using a model-centric difficulty metric such as training data influence. Abstract: Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.

[41] SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts -- Extended Version

Nghiem Thanh Pham,Tung Kieu,Duc-Manh Nguyen,Son Ha Xuan,Nghia Duong-Trung,Danh Le-Phuoc

Main category: cs.CL

TL;DR: SLM-Bench 是一个新的评估工具,全面衡量小型语言模型在准确性、计算效率和可持续性方面的表现。

Details Motivation: 为了弥补目前对小型语言模型(SLMs)缺乏系统性评估的空白,特别是在性能和环境影响方面。 Method: 引入了一个名为 SLM-Bench 的新基准测试工具,用于在多种维度上评估 SLMs,包括准确性、计算效率和可持续性指标。 Result: 评估了15个SLMs在9个NLP任务上的表现,并量化了11个指标,揭示了不同模型在准确性和能量效率方面的权衡。 Conclusion: SLM-Bench 提供了一个全面评估小型语言模型(SLMs)的框架,它填补了资源效率与实际应用之间的评估空白。 Abstract: Small Language Models (SLMs) offer computational efficiency and accessibility, yet a systematic evaluation of their performance and environmental impact remains lacking. We introduce SLM-Bench, the first benchmark specifically designed to assess SLMs across multiple dimensions, including accuracy, computational efficiency, and sustainability metrics. SLM-Bench evaluates 15 SLMs on 9 NLP tasks using 23 datasets spanning 14 domains. The evaluation is conducted on 4 hardware configurations, providing a rigorous comparison of their effectiveness. Unlike prior benchmarks, SLM-Bench quantifies 11 metrics across correctness, computation, and consumption, enabling a holistic assessment of efficiency trade-offs. Our evaluation considers controlled hardware conditions, ensuring fair comparisons across models. We develop an open-source benchmarking pipeline with standardized evaluation protocols to facilitate reproducibility and further research. Our findings highlight the diverse trade-offs among SLMs, where some models excel in accuracy while others achieve superior energy efficiency. SLM-Bench sets a new standard for SLM evaluation, bridging the gap between resource efficiency and real-world applicability.

[42] HebID: Detecting Social Identities in Hebrew-language Political Text

Guy Mor-Lan,Naama Rivlin-Angert,Yael R. Kaplan,Tamir Sheafer,Shaul R. Shenhav

Main category: cs.CL

TL;DR: HebID是一个多标签的希伯来语语料库,用于检测社会身份,包含5536条以色列政治家Facebook帖子的手动注释,涵盖十二种社会身份类别,并通过多标签和单标签编码器以及大型语言模型进行基准测试,以分析政治话语中的身份表达。

Details Motivation: 现有的社会身份检测数据集主要以英语为主,单一标签且关注粗略的身份类别,缺乏对非英语环境和更细致身份类别的覆盖。 Method: 创建了一个新的多标签希伯来语语料库HebID,包含5,536条以色列政治家的Facebook帖子,手动标注了十二个社会身份类别,并利用多标签和单标签编码器及大型语言模型进行基准测试。 Result: 使用调优的希伯来语大型语言模型取得了最佳效果(macro-$F_1$ = 0.74),并应用于政治家的Facebook帖子和议会演讲分析,揭示了身份表达的流行度、时间趋势、聚类模式和性别差异。 Conclusion: HebID为研究希伯来语中的社会身份提供了全面的基础,并可作为其他非英语政治背景研究的参考模型。 Abstract: Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians' Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74). We apply our classifier to politicians' Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.

[43] Dream 7B: Diffusion Large Language Models

Jiacheng Ye,Zhihui Xie,Lin Zheng,Jiahui Gao,Zirui Wu,Xin Jiang,Zhenguo Li,Lingpeng Kong

Main category: cs.CL

TL;DR: Dream 7B是一种创新的扩散语言模型,通过并行生成和高效训练技术,在多个任务上实现了领先性能和灵活性。

Details Motivation: 现有的自回归语言模型按顺序生成标记,限制了生成效率和灵活性,因此需要一种新的方法来提升语言模型的性能和应用场景。 Method: Dream 7B采用了离散扩散建模方法,通过迭代去噪并行优化序列生成,并结合了基于自回归模型的初始化和上下文自适应的标记级噪声重调度技术进行训练。 Result: Dream 7B在通用、数学和编程任务上均优于现有的扩散语言模型,并展示了出色的规划能力与推理灵活性,包括任意顺序生成、填充能力和可调节的质量-速度权衡。 Conclusion: Dream 7B是目前最强大的开源扩散大语言模型,能够通过简单的训练技术实现卓越的性能和灵活性。 Abstract: We introduce Dream 7B, the most powerful open diffusion large language model to date. Unlike autoregressive (AR) models that generate tokens sequentially, Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising. Our model consistently outperforms existing diffusion language models on general, mathematical, and coding tasks. Dream 7B demonstrates superior planning abilities and inference flexibility, including arbitrary-order generation, infilling capabilities, and tunable quality-speed trade-offs. These results are achieved through simple yet effective training techniques, including AR-based LLM initialization and context-adaptive token-level noise rescheduling. We release both Dream-Base and Dream-Instruct to facilitate further research in diffusion-based language modeling.

[44] The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech

Naama Rivlin-Angert,Guy Mor-Lan

Main category: cs.CL

TL;DR: This study introduces a computational method to analyze political delegitimization discourse (PDD), revealing its increasing prevalence and demographic and political patterns over three decades.

Details Motivation: The authors aimed to address the lack of large-scale computational studies on political delegitimization discourse (PDD) and to better understand its patterns and implications in democratic societies. Method: A two-stage classification pipeline combining fine-tuned encoder models and decoder LLMs was introduced to analyze a Hebrew-language corpus of 10,410 sentences. The data was annotated for PDD and its characteristics, such as intensity and affective framing. Result: The best model, DictaLM 2.0, achieved an F₁ of 0.74 for binary PDD detection and a macro-F₁ of 0.67 for classifying PDD characteristics. Analysis revealed increasing PDD over three decades, higher prevalence on social media, and stronger tendencies among right-leaning actors. Conclusion: The study concludes that automated analysis of political delegitimization discourse (PDD) is feasible and valuable for understanding democratic discourse, with clear trends observed in its prevalence across platforms, demographics, and political events. Abstract: We present the first large-scale computational study of political delegitimization discourse (PDD), defined as symbolic attacks on the normative validity of political entities. We curate and manually annotate a novel Hebrew-language corpus of 10,410 sentences drawn from Knesset speeches (1993-2023), Facebook posts (2018-2021), and leading news outlets, of which 1,812 instances (17.4\%) exhibit PDD and 642 carry additional annotations for intensity, incivility, target type, and affective framing. We introduce a two-stage classification pipeline combining finetuned encoder models and decoder LLMs. Our best model (DictaLM 2.0) attains an F$_1$ of 0.74 for binary PDD detection and a macro-F$_1$ of 0.67 for classification of delegitimization characteristics. Applying this classifier to longitudinal and cross-platform data, we see a marked rise in PDD over three decades, higher prevalence on social media versus parliamentary debate, greater use by male than female politicians, and stronger tendencies among right-leaning actors - with pronounced spikes during election campaigns and major political events. Our findings demonstrate the feasibility and value of automated PDD analysis for understanding democratic discourse.

[45] SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking

Xiangyang Zhu,Yuan Tian,Chunyi Li,Kaiwei Zhang,Wei Sun,Guangtao Zhai

Main category: cs.CL

TL;DR: 本文提出 SafetyFlow,一种自动化构建 LLM 安全评估基准的系统,显著减少人工参与和资源消耗,生成高质量数据集 SafetyFlowBench,并成功应用于 49 个 LLM 的安全评估。

Details Motivation: 现有的 LLM 安全评估基准测试依赖于耗时耗力的手动构建方式,存在大量冗余和难度不足的问题,因此需要一种自动化的方法来提升效率和质量。 Method: 提出 SafetyFlow,一个基于智能体流程的自动化系统,用于构建 LLM 安全基准测试数据集,通过七个专业智能体在四天内完成构建,且无需人工干预。 Result: SafetyFlowBench 数据集包含 23,446 个低冗余且具有强区分度的查询,并成功用于评估 49 个先进 LLM 的安全性,验证了该方法的有效性和高效性。 Conclusion: SafetyFlow 为 LLM 安全评估提供了一个高效、低冗余且具有区分度的自动化基准测试方案,解决了传统手动构建基准测试方法效率低下和资源消耗大的问题。 Abstract: The rapid proliferation of large language models (LLMs) has intensified the requirement for reliable safety evaluation to uncover model vulnerabilities. To this end, numerous LLM safety evaluation benchmarks are proposed. However, existing benchmarks generally rely on labor-intensive manual curation, which causes excessive time and resource consumption. They also exhibit significant redundancy and limited difficulty. To alleviate these problems, we introduce SafetyFlow, the first agent-flow system designed to automate the construction of LLM safety benchmarks. SafetyFlow can automatically build a comprehensive safety benchmark in only four days without any human intervention by orchestrating seven specialized agents, significantly reducing time and resource cost. Equipped with versatile tools, the agents of SafetyFlow ensure process and cost controllability while integrating human expertise into the automatic pipeline. The final constructed dataset, SafetyFlowBench, contains 23,446 queries with low redundancy and strong discriminative power. Our contribution includes the first fully automated benchmarking pipeline and a comprehensive safety benchmark. We evaluate the safety of 49 advanced LLMs on our dataset and conduct extensive experiments to validate our efficacy and efficiency.

[46] Trained Miniatures: Low cost, High Efficacy SLMs for Sales & Marketing

Ishaan Bhola,Mukunda NS,Sravanth Kurmala,Harsh Nandwani,Arihant Jain

Main category: cs.CL

TL;DR: This paper proposes 'Trained Miniatures', which are Small Language Models fine-tuned for specific high-value applications, as a cost-effective alternative to Large Language Models in targeted areas like sales and marketing outreach.

Details Motivation: Large language models require heavy computation and are costly, which makes them infeasible for targeted applications such as sales and marketing outreach. Method: The paper introduces the concept of 'Trained Miniatures' - Small Language Models (SLMs) that are fine-tuned for specific, high-value applications. Result: The introduced 'Trained Miniatures' are able to generate similar domain-specific responses as large language models but at a fraction of the cost. Conclusion: Trained Miniatures can generate domain-specific responses at a fraction of the cost, making them a feasible solution for targeted applications like sales and marketing outreach. Abstract: Large language models (LLMs) excel in text generation; however, these creative elements require heavy computation and are accompanied by a steep cost. Especially for targeted applications such as sales and marketing outreach, these costs are far from feasible. This paper introduces the concept of "Trained Miniatures" - Small Language Models(SLMs) fine-tuned for specific, high-value applications, generating similar domain-specific responses for a fraction of the cost.

[47] SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Peng Ding,Wen Sun,Dailin Li,Wei Zou,Jiaming Wang,Jiajun Chen,Shujian Huang

Main category: cs.CL

TL;DR: 本文提出SDGO方法,通过强化学习框架利用模型自身的判别能力提升生成安全性,有效应对越狱攻击且无需额外数据或模型。

Details Motivation: 大型语言模型虽然在各种自然语言处理任务中表现出色,但仍然容易受到越狱攻击,产生有害内容。本文揭示了一个关键的安全性不一致现象:LLMs在作为判别器时能够有效识别有害请求,但在作为生成器时却难以抵御这些请求。 Method: 提出了一种名为SDGO(Self-Discrimination-Guided Optimization)的强化学习框架,利用模型自身的判别能力作为奖励信号,通过迭代自我改进来增强生成安全性。 Result: 实验表明,与基于提示和基于训练的基线方法相比,SDGO显著提高了模型的安全性,同时在一般基准测试中保持了实用性。此外,该方法通过少量判别样本就能进一步增强模型的生成能力。 Conclusion: 通过SDGO方法,对齐了大型语言模型的判别和生成能力,提高了模型在面对分布外越狱攻击时的安全性,并且无需额外的注释数据或外部模型。 Abstract: Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.

[48] Benchmarking Computer Science Survey Generation

Weihang Su,Anzhe Xie,Qingyao Ai,Jianming Long,Jiaxin Mao,Ziyi Ye,Yiqun Liu

Main category: cs.CL

TL;DR: The paper introduces SurGE, a new benchmark for evaluating scientific survey generation in computer science, highlighting the challenges still present in automating this process with LLMs.

Details Motivation: The motivation is the increasing difficulty of manually creating scientific survey articles due to the rapid growth of academic literature and the lack of standardized benchmarks and evaluation protocols for automating this process using LLMs. Method: The authors introduced SurGE, a benchmark for evaluating scientific survey generation, consisting of test instances and a large academic corpus. They also proposed an automated evaluation framework assessing four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Result: The result is the creation of the SurGE benchmark and corpus, along with an evaluation framework. The evaluation of various LLM-based approaches shows that survey generation remains a challenging task even for advanced frameworks. Conclusion: The paper concludes that generating scientific surveys is a complex task that requires further research despite the potential offered by large language models (LLMs). Abstract: Scientific survey articles play a vital role in summarizing research progress, yet their manual creation is becoming increasingly infeasible due to the rapid growth of academic literature. While large language models (LLMs) offer promising capabilities for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To address this gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for evaluating scientific survey generation in the computer science domain. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool. In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based approaches shows that survey generation remains highly challenging, even for advanced self-reflection frameworks. These findings highlight the complexity of the task and the necessity for continued research. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

[49] Position Bias Mitigates Position Bias:Mitigate Position Bias Through Inter-Position Knowledge Distillation

Yifei Wang,Feng Xiong,Yong Wang,Linjing Li,Xiangxiang Chu,Daniel Dajun Zeng

Main category: cs.CL

TL;DR: 本文提出Pos2Distill框架,通过位置间知识蒸馏有效缓解位置偏差问题,提高长上下文任务的性能和均匀性。

Details Motivation: 位置偏差严重影响长上下文理解与处理能力,现有的缓解方法仍存在显著的位置偏差问题。 Method: 提出了一种名为Pos2Distill的位置到位置知识蒸馏框架,通过将优势位置的能力转移到不利位置来减少性能差距。 Result: 在长上下文检索和推理任务中,Pos2Distill显著提高了各位置的性能均匀性和整体表现,并设计了针对检索和推理范式的两个特定实例化方法。 Conclusion: Pos2Distill有效减少了位置偏差对长上下文理解和处理任务的影响,提高了不同位置的均匀性和性能,并展示了跨任务的泛化能力。 Abstract: Positional bias (PB), manifesting as non-uniform sensitivity across different contextual locations, significantly impairs long-context comprehension and processing capabilities. While prior work seeks to mitigate PB through modifying the architectures causing its emergence, significant PB still persists. To address PB effectively, we introduce \textbf{Pos2Distill}, a position to position knowledge distillation framework. Pos2Distill transfers the superior capabilities from advantageous positions to less favorable ones, thereby reducing the huge performance gaps. The conceptual principle is to leverage the inherent, position-induced disparity to counteract the PB itself. We identify distinct manifestations of PB under \textbf{\textsc{r}}etrieval and \textbf{\textsc{r}}easoning paradigms, thereby designing two specialized instantiations: \emph{Pos2Distill-R\textsuperscript{1}} and \emph{Pos2Distill-R\textsuperscript{2}} respectively, both grounded in this core principle. By employing the Pos2Distill approach, we achieve enhanced uniformity and significant performance gains across all contextual positions in long-context retrieval and reasoning tasks. Crucially, both specialized systems exhibit strong cross-task generalization mutually, while achieving superior performance on their respective tasks.

[50] Stemming -- The Evolution and Current State with a Focus on Bangla

Abhijit Paul,Mashiat Amin Farin,Sharif Md. Abdullah,Ahmedul Kabir,Zarif Masud,Shebuti Rayana

Main category: cs.CL

TL;DR: This paper surveys stemming approaches for Bangla, a language facing digital under-representation, highlighting challenges and advocating for improved stemmers and ongoing research to enhance language processing.

Details Motivation: Bangla, a widely spoken language, faces digital under-representation due to limited resources and annotated datasets. Stemming is crucial for reducing algorithmic complexity for low-resource, highly-inflectional languages like Bangla. Method: The paper conducts a comprehensive survey of stemming approaches for Bangla, exploring the landscape of existing research and highlighting gaps and discontinuities. Result: The paper identifies a significant gap in the existing literature on Bangla stemming, noting the discontinuity from previous research, scarcity of accessible implementations, and critiques on evaluation methodologies. Conclusion: The paper concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing. Abstract: Bangla, the seventh most widely spoken language worldwide with 300 million native speakers, faces digital under-representation due to limited resources and lack of annotated datasets. Stemming, a critical preprocessing step in language analysis, is essential for low-resource, highly-inflectional languages like Bangla, because it can reduce the complexity of algorithms and models by significantly reducing the number of words the algorithm needs to consider. This paper conducts a comprehensive survey of stemming approaches, emphasizing the importance of handling morphological variants effectively. While exploring the landscape of Bangla stemming, it becomes evident that there is a significant gap in the existing literature. The paper highlights the discontinuity from previous research and the scarcity of accessible implementations for replication. Furthermore, it critiques the evaluation methodologies, stressing the need for more relevant metrics. In the context of Bangla's rich morphology and diverse dialects, the paper acknowledges the challenges it poses. To address these challenges, the paper suggests directions for Bangla stemmer development. It concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing.

[51] EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models

Xinyi Ling,Hanwen Du,Zhihui Zhu,Xia Ning

Main category: cs.CL

TL;DR: This paper introduces EcomMMMU, a large dataset for evaluating how well multimodal models handle e-commerce data, showing that images don't always help and can hurt performance. A new method called SUMEI is proposed to better utilize images by predicting their usefulness.

Details Motivation: The motivation stems from the question of whether product images in e-commerce truly enhance understanding or can sometimes introduce redundancy and degrade performance. Existing datasets are limited, making systematic analysis difficult. The authors aim to address this by creating a comprehensive dataset and developing a method to better utilize visual content. Method: The authors introduced EcomMMMU, a large-scale dataset for evaluating multimodal large language models (MLLMs) in e-commerce tasks. They analyzed the impact of images on performance and proposed SUMEI, a method that predicts visual utilities to improve the use of images in downstream tasks. Result: Experiments on the EcomMMMU dataset showed that product images do not consistently improve performance and can, in some cases, degrade it. The proposed SUMEI method demonstrated effectiveness and robustness in leveraging multiple images for e-commerce tasks, indicating a promising approach to handle visual content. Conclusion: The paper concludes that while e-commerce platforms have abundant multimodal data, product images do not always enhance understanding and can sometimes degrade performance. MLLMs may struggle to effectively utilize visual content, but the proposed SUMEI method can strategically leverage multiple images for better performance. Abstract: E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through https://anonymous.4open.science/r/submission25.

[52] End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Qiaoyu Zheng,Yuze Sun,Chaoyi Wu,Weike Zhao,Pengcheng Qiu,Yongguo Yu,Kun Sun,Yanfeng Wang,Ya Zhang,Weidi Xie

Main category: cs.CL

TL;DR: Deep-DxSearch是一种新的医学诊断系统,通过强化学习训练的RAG方法,显著提高了诊断的准确性,并优于现有的医学诊断框架。

Details Motivation: 现有的检索和工具增强方法由于对外部知识的利用不足和反馈推理的可追溯性较差,限制了其在医学诊断中的效果,因此需要一种新的方法来解决这些挑战。 Method: Deep-DxSearch是一种基于强化学习的RAG系统,将LLM作为核心代理,结合检索语料库作为其环境,通过格式、检索、推理结构和诊断准确性的定制奖励进行训练。 Result: Deep-DxSearch在常见病和罕见病的诊断中都实现了显著的准确性提升,并且在奖励设计和检索语料库组件的消融研究中验证了其独特性和有效性。 Conclusion: Deep-DxSearch通过端到端的代理强化学习框架,在多个数据中心一致优于现有的诊断方法,包括GPT-4o、DeepSeek-R1和其他医学特定框架。 Abstract: Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.

[53] Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis

Yufeng Zhao,Junnan Liu,Hongwei Liu,Dongsheng Zhu,Yuan Shen,Songyang Zhang,Kai Chen

Main category: cs.CL

TL;DR: 本文提出了ReasonZoo基准测试和两个新指标来评估TIR的有效性,结果表明TIR显著增强了LLM的推理能力和效率。

Details Motivation: 评估TIR在不同领域中的有效性,并研究其是否提升了模型的推理行为和思考能力。 Method: 引入ReasonZoo基准测试和两种新指标Performance-Aware Cost (PAC)和Area Under the Performance-Cost Curve (AUC-PCC)。 Result: TIR模型在各种任务中均优于非TIR模型,并在PAC和AUC-PCC指标上有所提升,表明推理效率增强。 Conclusion: TIR通过结合外部工具增强了LLM的推理能力,不仅在数学和非数学任务中表现更佳,还提高了推理效率,减少了过度思考。 Abstract: Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model's reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.

[54] LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin,Dinghan Shen,Silei Xu,Jianbing Han,Sixun Dong,Mian Zhang,Yebowen Hu,Shujian Liu,Simin Ma,Song Wang,Sathish Reddy Indurthi,Xun Wang,Yiran Chen,Kaiqiang Song

Main category: cs.CL

TL;DR: 本文介绍了 LiveMCP-101,一个用于评估 AI 代理在现实世界中解决多步骤任务能力的新基准测试。

Details Motivation: 尽管模型上下文协议(MCP)为工具集成提供了一个强大的标准化框架,但在基准测试 AI 代理在现实、动态场景中使用多种 MCP 工具有效解决多步骤任务方面仍存在显著差距。 Method: 介绍了 LiveMCP-101,一个包含 101 个精心策划的真实世界查询的基准测试,通过迭代的 LLM 重写和人工审核进行优化,需要协调使用多个 MCP 工具。 Result: 实验表明,即使是前沿的 LLM 也未能达到 60% 的成功率,突显了工具编排中的主要挑战。 Conclusion: LiveMCP-101 为评估现实世界代理能力设定了严格的标准,推动了通过工具使用可靠执行复杂任务的自主 AI 系统的发展。 Abstract: Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

cs.CV [Back]

[55] Heatmap Regression without Soft-Argmax for Facial Landmark Detection

Chiao-An Yang,Raymond A. Yeh

Main category: cs.CV

TL;DR: This paper challenges the use of Soft-argmax in facial landmark detection and proposes an alternative training objective based on structured prediction, achieving faster convergence and strong performance on benchmark datasets.

Details Motivation: The motivation of the paper is to challenge the long-standing choice of using Soft-argmax in heatmap regression-based facial landmark detection methods and explore alternative approaches that could lead to better performance and faster convergence. Method: The paper proposes an new training objective based on the classic structured prediction framework, which provides an alternative to the commonly used Soft-argmax approach for facial landmark detection. Result: The proposed method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W), with 2.2x faster convergence during training while maintaining better or competitive accuracy compared to existing methods. Conclusion: The paper concludes that Soft-argmax is not the only way to achieve strong performance in facial landmark detection and introduces a new training objective based on structured prediction, which achieves state-of-the-art results while improving training efficiency. Abstract: Facial landmark detection is an important task in computer vision with numerous applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been widely used to achieve state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. Since argmax is not differentiable, these methods use a differentiable approximation, Soft-argmax, to enable end-to-end training on deep-nets. In this work, we revisit this long-standing choice of using Soft-argmax and demonstrate that it is not the only way to achieve strong performance. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W), converging 2.2x faster during training while maintaining better/competitive accuracy. Our code is available here: https://github.com/ca-joe-yang/regression-without-softarg.

[56] Fast Graph Neural Network for Image Classification

Mustafa Mohammadi Gharasuie,Luis Rueda

Main category: cs.CV

TL;DR: This paper proposes a novel image classification method combining Graph Convolutional Networks (GCNs) with Voronoi diagrams, achieving superior performance over existing approaches.

Details Motivation: The motivation is to overcome the limitations of traditional Convolutional Neural Networks (CNNs) by leveraging the relational data modeling capabilities of GCNs combined with the spatial partitioning strengths of Voronoi diagrams. Method: The study represents images as graphs using Voronoi diagrams and refines them with Delaunay triangulations. These graphs are processed using Graph Convolutional Networks (GCNs) for improved classification. Result: The proposed method achieves significant improvements in preprocessing efficiency and classification accuracy across benchmark datasets, particularly in complex scenarios. Conclusion: The research successfully integrates GCNs with Voronoi diagrams, enhancing image classification performance and providing a new perspective in computer vision and unstructured data analysis. Abstract: The rapid progress in image classification has been largely driven by the adoption of Graph Convolutional Networks (GCNs), which offer a robust framework for handling complex data structures. This study introduces a novel approach that integrates GCNs with Voronoi diagrams to enhance image classification by leveraging their ability to effectively model relational data. Unlike conventional convolutional neural networks (CNNs), our method represents images as graphs, where pixels or regions function as vertices. These graphs are then refined using corresponding Delaunay triangulations, optimizing their representation. The proposed model achieves significant improvements in both preprocessing efficiency and classification accuracy across various benchmark datasets, surpassing state-of-the-art approaches, particularly in challenging scenarios involving intricate scenes and fine-grained categories. Experimental results, validated through cross-validation, underscore the effectiveness of combining GCNs with Voronoi diagrams for advancing image classification. This research not only presents a novel perspective on image classification but also expands the potential applications of graph-based learning paradigms in computer vision and unstructured data analysis.

[57] You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

Hakjin Lee,Junghoon Seo,Jaehoon Sim

Main category: cs.CV

TL;DR: YOPO is a new, efficient method for unified 2D detection and 9-DoF pose estimation using only RGB images, achieving impressive results without additional data.

Details Motivation: The need for a simpler, RGB-only alternative for accurate 9-DoF pose estimation without pseudo-depth, CAD models, or multi-stage processes. Method: YOPO uses a single-stage, query-based framework with a transformer detector, a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. Result: YOPO achieves 79.6% IoU50 and 54.1% under the 10°10cm metric on the REAL275 dataset, outperforming previous RGB-only methods. Conclusion: YOPO offers a unified and straightforward approach for 2D detection and 9-DoF pose estimation, achieving state-of-the-art results on three benchmarks. Abstract: Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% $\rm{IoU}_{50}$ and 54.1% under the $10^\circ$$10{\rm{cm}}$ metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on our project.

[58] Paired-Sampling Contrastive Framework for Joint Physical-Digital Face Attack Detection

Andrei Balykin,Anvar Ganiev,Denis Kondranin,Kirill Polevoda,Nikolai Liudkevich,Artem Petrov

Main category: cs.CV

TL;DR: This paper proposes a unified framework for face anti-spoofing that effectively handles both physical and digital attack vectors, achieving superior performance with low computational cost.

Details Motivation: Modern face recognition systems are vulnerable to both physical presentation attacks and digital forgeries. Traditionally, these attack vectors are handled by separate models, increasing system complexity and inference latency while leaving systems exposed to combined attacks. Method: The authors proposed the Paired-Sampling Contrastive Framework, which uses automatically matched pairs of genuine and attack selfies to learn modality-agnostic liveness cues in a unified training approach. Result: The method achieved an average classification error rate (ACER) of 2.10 percent on the 6th Face Anti-Spoofing Challenge Unified Physical-Digital Attack Detection benchmark, outperforming prior solutions. The framework is lightweight (4.46 GFLOPs) and trains in under one hour. Conclusion: The Paired-Sampling Contrastive Framework is a lightweight and efficient solution for face anti-spoofing that handles both physical and digital attack vectors, achieving state-of-the-art performance on the benchmark. Abstract: Modern face recognition systems remain vulnerable to spoofing attempts, including both physical presentation attacks and digital forgeries. Traditionally, these two attack vectors have been handled by separate models, each targeting its own artifacts and modalities. However, maintaining distinct detectors increases system complexity and inference latency and leaves systems exposed to combined attack vectors. We propose the Paired-Sampling Contrastive Framework, a unified training approach that leverages automatically matched pairs of genuine and attack selfies to learn modality-agnostic liveness cues. Evaluated on the 6th Face Anti-Spoofing Challenge Unified Physical-Digital Attack Detection benchmark, our method achieves an average classification error rate (ACER) of 2.10 percent, outperforming prior solutions. The framework is lightweight (4.46 GFLOPs) and trains in under one hour, making it practical for real-world deployment. Code and pretrained models are available at https://github.com/xPONYx/iccv2025_deepfake_challenge.

[59] TAIGen: Training-Free Adversarial Image Generation via Diffusion Models

Susim Roy,Anubhooti Jain,Mayank Vatsa,Richa Singh

Main category: cs.CV

TL;DR: This paper proposes TAIGen, a fast and efficient method for generating adversarial images using diffusion models, requiring fewer sampling steps and offering high attack success rates while preserving image quality.

Details Motivation: Adversarial attacks from generative models often produce low-quality images and require substantial computational resources. Diffusion models typically need hundreds of sampling steps for adversarial generation, which is computationally expensive. Method: TAIGen uses a selective RGB channel strategy where attention maps are applied to the red channel and GradCAM-guided perturbations are used on the green and blue channels. It injects perturbations during the mixing step interval rather than processing all timesteps. Result: TAIGen achieves 70.6% success against ResNet, 80.8% against MNASNet, and 97.8% against ShuffleNet on ImageNet with VGGNet as the source. It generates adversarial examples 10x faster than existing diffusion-based attacks and maintains visual quality with PSNR above 30 dB across all tested datasets. Conclusion: TAIGen is a training-free black-box method that efficiently generates adversarial images using only 3-20 sampling steps from unconditional diffusion models, achieving high attack success rates while maintaining visual quality and computational efficiency. Abstract: Adversarial attacks from generative models often produce low-quality images and require substantial computational resources. Diffusion models, though capable of high-quality generation, typically need hundreds of sampling steps for adversarial generation. This paper introduces TAIGen, a training-free black-box method for efficient adversarial image generation. TAIGen produces adversarial examples using only 3-20 sampling steps from unconditional diffusion models. Our key finding is that perturbations injected during the mixing step interval achieve comparable attack effectiveness without processing all timesteps. We develop a selective RGB channel strategy that applies attention maps to the red channel while using GradCAM-guided perturbations on green and blue channels. This design preserves image structure while maximizing misclassification in target models. TAIGen maintains visual quality with PSNR above 30 dB across all tested datasets. On ImageNet with VGGNet as source, TAIGen achieves 70.6% success against ResNet, 80.8% against MNASNet, and 97.8% against ShuffleNet. The method generates adversarial examples 10x faster than existing diffusion-based attacks. Our method achieves the lowest robust accuracy, indicating it is the most impactful attack as the defense mechanism is least successful in purifying the images generated by TAIGen.

[60] Reversible Unfolding Network for Concealed Visual Perception with Generative Refinement

Chunming He,Fengyang Xiao,Rihan Zhang,Chengyu Fang,Deng-Ping Fan,Sina Farsiu

Main category: cs.CV

TL;DR: RUN++ is a reversible unfolding network with generative refinement for concealed visual perception, leveraging both mask and RGB domains to reduce uncertainty and improve segmentation accuracy.

Details Motivation: Existing CVP methods mainly focus on the mask domain, leaving the potential of the RGB domain underexplored. The authors aim to address this limitation by proposing a more comprehensive approach. Method: RUN++ formulates the CVP task as a mathematical optimization problem and unfolds the iterative solution into a multi-stage deep network with three modules: CORE, CARE, and FINE. It leverages reversible modeling in both mask and RGB domains and uses a targeted Bernoulli diffusion model for refinement. Result: The proposed RUN++ efficiently directs focus toward ambiguous areas, significantly mitigating false positives and negatives, and achieves efficient fine-detail restoration without prohibitive computational costs. Conclusion: RUN++ provides a new paradigm for building robust CVP systems that remain effective under real-world degradations and extends into a broader bi-level optimization framework. Abstract: Existing methods for concealed visual perception (CVP) often leverage reversible strategies to decrease uncertainty, yet these are typically confined to the mask domain, leaving the potential of the RGB domain underexplored. To address this, we propose a reversible unfolding network with generative refinement, termed RUN++. Specifically, RUN++ first formulates the CVP task as a mathematical optimization problem and unfolds the iterative solution into a multi-stage deep network. This approach provides a principled way to apply reversible modeling across both mask and RGB domains while leveraging a diffusion model to resolve the resulting uncertainty. Each stage of the network integrates three purpose-driven modules: a Concealed Object Region Extraction (CORE) module applies reversible modeling to the mask domain to identify core object regions; a Context-Aware Region Enhancement (CARE) module extends this principle to the RGB domain to foster better foreground-background separation; and a Finetuning Iteration via Noise-based Enhancement (FINE) module provides a final refinement. The FINE module introduces a targeted Bernoulli diffusion model that refines only the uncertain regions of the segmentation mask, harnessing the generative power of diffusion for fine-detail restoration without the prohibitive computational cost of a full-image process. This unique synergy, where the unfolding network provides a strong uncertainty prior for the diffusion model, allows RUN++ to efficiently direct its focus toward ambiguous areas, significantly mitigating false positives and negatives. Furthermore, we introduce a new paradigm for building robust CVP systems that remain effective under real-world degradations and extend this concept into a broader bi-level optimization framework.

[61] GasTwinFormer: A Hybrid Vision Transformer for Livestock Methane Emission Segmentation and Dietary Classification in Optical Gas Imaging

Toqi Tahamid Sarker,Mohamed Embaby,Taminul Islam,Amer AbuGhazaleh,Khaled R Ahmed

Main category: cs.CV

TL;DR: 本文提出了一種名為GasTwinFormer的混合視覺變壓器,用於實時甲烷排放分割和飼料分類,並發布了首個全面的肉牛甲烷排放數據集。

Details Motivation: 牲畜甲烷排放佔人類產生甲烷排放的32%,因此需要自動化監測來幫助氣候緩解策略。 Method: 引入GasTwinFormer,一種混合視覺變壓器架構,使用Mix Twin編碼器和輕量級LR-ASPP解碼器進行多尺度特徵聚合。 Result: GasTwinFormer在分割任務上實現了74.47%的mIoU和83.63%的mF1,在飼料分類上達到了100%的準確率,同時保持了高效能。 Conclusion: GasTwinFormer是一種實用的實時牲畜排放監測解決方案,通過廣泛的消融實驗驗證了其架構的有效性。 Abstract: Livestock methane emissions represent 32% of human-caused methane production, making automated monitoring critical for climate mitigation strategies. We introduce GasTwinFormer, a hybrid vision transformer for real-time methane emission segmentation and dietary classification in optical gas imaging through a novel Mix Twin encoder alternating between spatially-reduced global attention and locally-grouped attention mechanisms. Our architecture incorporates a lightweight LR-ASPP decoder for multi-scale feature aggregation and enables simultaneous methane segmentation and dietary classification in a unified framework. We contribute the first comprehensive beef cattle methane emission dataset using OGI, containing 11,694 annotated frames across three dietary treatments. GasTwinFormer achieves 74.47% mIoU and 83.63% mF1 for segmentation while maintaining exceptional efficiency with only 3.348M parameters, 3.428G FLOPs, and 114.9 FPS inference speed. Additionally, our method achieves perfect dietary classification accuracy (100%), demonstrating the effectiveness of leveraging diet-emission correlations. Extensive ablation studies validate each architectural component, establishing GasTwinFormer as a practical solution for real-time livestock emission monitoring. Please see our project page at gastwinformer.github.io.

[62] CurveFlow: Curvature-Guided Flow Matching for Image Generation

Yan Luo,Drake Du,Hao Huang,Yi Fang,Mengyu Wang

Main category: cs.CV

TL;DR: CurveFlow是一种新的流匹配框架,通过引入曲率引导,改善文本到图像生成中的语义对齐问题,实现了更好的生成效果。

Details Motivation: 现有的线性轨迹模型可能导致图像生成过程经过低概率区域,影响生成图像与文本的语义对齐。因此,需要探索轨迹曲率与指令遵循能力之间的关系。 Method: CurveFlow框架采用曲率正则化技术,学习平滑的非线性轨迹,以避免穿过数据流形的低概率区域。 Result: 在MS COCO 2014和2017上的实验表明,CurveFlow在文本到图像生成任务中显著优于其他方法,尤其在语义一致性指标(如BLEU、METEOR、ROUGE和CLAIR)上表现突出。 Conclusion: CurveFlow通过引入曲率引导,改善了文本到图像生成中的语义对齐问题,实现了最先进的性能。 Abstract: Existing rectified flow models are based on linear trajectories between data and noise distributions. This linearity enforces zero curvature, which can inadvertently force the image generation process through low-probability regions of the data manifold. A key question remains underexplored: how does the curvature of these trajectories correlate with the semantic alignment between generated images and their corresponding captions, i.e., instructional compliance? To address this, we introduce CurveFlow, a novel flow matching framework designed to learn smooth, non-linear trajectories by directly incorporating curvature guidance into the flow path. Our method features a robust curvature regularization technique that penalizes abrupt changes in the trajectory's intrinsic dynamics.Extensive experiments on MS COCO 2014 and 2017 demonstrate that CurveFlow achieves state-of-the-art performance in text-to-image generation, significantly outperforming both standard rectified flow variants and other non-linear baselines like Rectified Diffusion. The improvements are especially evident in semantic consistency metrics such as BLEU, METEOR, ROUGE, and CLAIR. This confirms that our curvature-aware modeling substantially enhances the model's ability to faithfully follow complex instructions while simultaneously maintaining high image quality. The code is made publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/CurveFlow.

[63] HiRQA: Hierarchical Ranking and Quality Alignment for Opinion-Unaware Image Quality Assessment

Vaishnav Ramesh,Haining Wang,Md Jahidul Islam

Main category: cs.CV

TL;DR: HiRQA是一种新的无参考图像质量评估方法,它通过自我监督学习和对比学习提供层次化的质量感知嵌入,提高了图像质量评估的性能和泛化能力。

Details Motivation: 尽管在无参考图像质量评估方面取得了显著进展,但数据集偏差和对主观标签的依赖仍然阻碍了它们的泛化性能。 Method: 提出了一种新的高阶排名损失和嵌入距离损失,并引入了训练时对比对齐损失。 Result: HiRQA仅使用输入图像就能预测质量得分,并且在各种失真情况下都能有效推广到真实降质。 Conclusion: HiRQA是一个自我监督的、意见无关的框架,通过排名和对比学习提供层次化的质量感知嵌入,展示了其最先进的性能、强大的泛化能力和可扩展性。 Abstract: Despite significant progress in no-reference image quality assessment (NR-IQA), dataset biases and reliance on subjective labels continue to hinder their generalization performance. We propose HiRQA, Hierarchical Ranking and Quality Alignment), a self-supervised, opinion-unaware framework that offers a hierarchical, quality-aware embedding through a combination of ranking and contrastive learning. Unlike prior approaches that depend on pristine references or auxiliary modalities at inference time, HiRQA predicts quality scores using only the input image. We introduce a novel higher-order ranking loss that supervises quality predictions through relational ordering across distortion pairs, along with an embedding distance loss that enforces consistency between feature distances and perceptual differences. A training-time contrastive alignment loss, guided by structured textual prompts, further enhances the learned representation. Trained only on synthetic distortions, HiRQA generalizes effectively to authentic degradations, as demonstrated through evaluation on various distortions such as lens flare, haze, motion blur, and low-light conditions. For real-time deployment, we introduce \textbf{HiRQA-S}, a lightweight variant with an inference time of only 3.5 ms per image. Extensive experiments across synthetic and authentic benchmarks validate HiRQA's state-of-the-art (SOTA) performance, strong generalization ability, and scalability.

[64] Reliable Multi-view 3D Reconstruction for `Just-in-time' Edge Environments

Md. Nurul Absur,Abhinav Kumar,Swastik Brahma,Saptarshi Debroy

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的边缘资源管理策略,以提高多视角3D重建在动态环境中的可靠性。

Details Motivation: 动机是为了解决在动态和操作不利的边缘环境中,多视角3D重建应用可能遇到的可靠性问题。 Method: 论文的方法包括使用遗传算法解决投资组合理论优化问题,并利用公开和定制的3D数据集进行验证。 Result: 结果表明,所提出的相机选择策略在时空干扰下能够保证可靠的3D重建,并且遗传算法在实际系统设置中快速收敛。 Conclusion: 论文得出结论,提出的投资组合理论启发的边缘资源管理策略可以保证在可能系统中断下的多视角3D重建的可靠性。 Abstract: Multi-view 3D reconstruction applications are revolutionizing critical use cases that require rapid situational-awareness, such as emergency response, tactical scenarios, and public safety. In many cases, their near-real-time latency requirements and ad-hoc needs for compute resources necessitate adoption of `Just-in-time' edge environments where the system is set up on the fly to support the applications during the mission lifetime. However, reliability issues can arise from the inherent dynamism and operational adversities of such edge environments, resulting in spatiotemporally correlated disruptions that impact the camera operations, which can lead to sustained degradation of reconstruction quality. In this paper, we propose a novel portfolio theory inspired edge resource management strategy for reliable multi-view 3D reconstruction against possible system disruptions. Our proposed methodology can guarantee reconstruction quality satisfaction even when the cameras are prone to spatiotemporally correlated disruptions. The portfolio theoretic optimization problem is solved using a genetic algorithm that converges quickly for realistic system settings. Using publicly available and customized 3D datasets, we demonstrate the proposed camera selection strategy's benefits in guaranteeing reliable 3D reconstruction against traditional baseline strategies, under spatiotemporal disruptions.

[65] XDR-LVLM: An Explainable Vision-Language Large Model for Diabetic Retinopathy Diagnosis

Masato Ito,Kaito Tanaka,Keisuke Matsuda,Aya Nakayama

Main category: cs.CV

TL;DR: 本文提出XDR-LVLM,一种结合视觉-语言大模型的糖尿病视网膜病变诊断框架,具有高诊断精度和自然语言解释能力,提升了模型的临床适用性。

Details Motivation: 糖尿病视网膜病变是致盲的主要原因,深度学习模型在诊断中表现良好,但缺乏可解释性,阻碍了临床应用。因此需要一个高精度且具有自然语言解释能力的诊断框架。 Method: 提出XDR-LVLM框架,包括医学视觉编码器、LVLM核心模块,采用多任务提示工程和多阶段微调技术,对眼底图像进行病理特征分析并生成诊断报告。 Result: 在DDR数据集上的实验表明,XDR-LVLM在疾病诊断中达到84.55%的平衡准确率和79.92%的F1分数,在概念检测中也取得77.95%的平衡准确率和66.88%的F1分数,同时生成的解释在人类评估中表现出高流畅性、准确性和临床实用性。 Conclusion: XDR-LVLM有效结合了视觉-语言大模型与医学图像分析,实现了高性能的糖尿病视网膜病变诊断,并通过自然语言解释增强了模型的可解释性和临床实用性。 Abstract: Diabetic Retinopathy (DR) is a major cause of global blindness, necessitating early and accurate diagnosis. While deep learning models have shown promise in DR detection, their black-box nature often hinders clinical adoption due to a lack of transparency and interpretability. To address this, we propose XDR-LVLM (eXplainable Diabetic Retinopathy Diagnosis with LVLM), a novel framework that leverages Vision-Language Large Models (LVLMs) for high-precision DR diagnosis coupled with natural language-based explanations. XDR-LVLM integrates a specialized Medical Vision Encoder, an LVLM Core, and employs Multi-task Prompt Engineering and Multi-stage Fine-tuning to deeply understand pathological features within fundus images and generate comprehensive diagnostic reports. These reports explicitly include DR severity grading, identification of key pathological concepts (e.g., hemorrhages, exudates, microaneurysms), and detailed explanations linking observed features to the diagnosis. Extensive experiments on the Diabetic Retinopathy (DDR) dataset demonstrate that XDR-LVLM achieves state-of-the-art performance, with a Balanced Accuracy of 84.55% and an F1 Score of 79.92% for disease diagnosis, and superior results for concept detection (77.95% BACC, 66.88% F1). Furthermore, human evaluations confirm the high fluency, accuracy, and clinical utility of the generated explanations, showcasing XDR-LVLM's ability to bridge the gap between automated diagnosis and clinical needs by providing robust and interpretable insights.

[66] MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

Xuyang Chen,Zhijun Zhai,Kaixuan Zhou,Zengmao Wang,Jianan He,Dong Wang,Yanfeng Zhang,mingwei Sun,Rüdiger Westermann,Konrad Schindler,Liqiu Meng

Main category: cs.CV

TL;DR: 本文提出了一种基于网格的场景合成方法MeSS,通过改进图像扩散模型的跨视图一致性,实现高质量、风格一致的城市户外场景生成,并支持多种风格的渲染。

Details Motivation: 网格模型缺乏真实纹理,限制了其在虚拟城市导航和自动驾驶中的应用。 Method: 提出MeSS方法,分为三个阶段:使用级联外绘ControlNets生成几何一致的稀疏视图;通过AGInpaint模块传播更密集的中间视图;利用GCAlign模块全局消除视觉不一致性。同时通过3D高斯随机模型重建场景。 Result: 该方法在几何对齐和生成质量方面优于现有方法,生成的场景可以通过重新照明和风格迁移技术进行多样化渲染。 Conclusion: MeSS方法有效解决了基于网格模型生成真实纹理的问题,具有广泛的应用前景。 Abstract: Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques.

[67] SurgWound-Bench: A Benchmark for Surgical Wound Diagnosis

Jiahao Xu,Changchang Yin,Odysseas Chatzipanagiotou,Diamantis Tsilimigras,Kevin Clear,Bingsheng Yao,Dakuo Wang,Timothy Pawlik,Ping Zhang

Main category: cs.CV

TL;DR: This study presents SurgWound, the first open-source dataset for surgical wound screening, and introduces WoundQwen, a three-stage learning framework for surgical wound diagnosis, aiming to improve patient outcomes through personalized care and timely interventions.

Details Motivation: The motivation of the study is to address the lack of an open-source dataset and benchmark for surgical wound screening, which has hindered progress in preventing surgical site infections and improving patient outcomes due to concerns over data privacy and high costs of expert annotation. Method: A three-stage learning framework called WoundQwen was proposed, which uses five independent MLLMs to predict specific surgical wound characteristics, two MLLMs to diagnose outcomes, and a final MLLM to produce a comprehensive report based on the previous stages' diagnostic results. Result: The study resulted in the creation of SurgWound, the first open-source dataset featuring a diverse array of surgical wound types, and the first benchmark for surgical wound diagnosis, which includes visual question answering and report generation tasks. Additionally, the proposed WoundQwen framework can analyze surgical wound characteristics and provide instructions for patient care. Conclusion: The study concludes that SurgWound and WoundQwen provide a valuable foundation for the development of open-source tools for surgical wound screening and diagnosis, paving the way for personalized wound care, timely intervention, and improved patient outcomes. Abstract: Surgical site infection (SSI) is one of the most common and costly healthcare-associated infections and and surgical wound care remains a significant clinical challenge in preventing SSIs and improving patient outcomes. While recent studies have explored the use of deep learning for preliminary surgical wound screening, progress has been hindered by concerns over data privacy and the high costs associated with expert annotation. Currently, no publicly available dataset or benchmark encompasses various types of surgical wounds, resulting in the absence of an open-source Surgical-Wound screening tool. To address this gap: (1) we present SurgWound, the first open-source dataset featuring a diverse array of surgical wound types. It contains 697 surgical wound images annotated by 3 professional surgeons with eight fine-grained clinical attributes. (2) Based on SurgWound, we introduce the first benchmark for surgical wound diagnosis, which includes visual question answering (VQA) and report generation tasks to comprehensively evaluate model performance. (3) Furthermore, we propose a three-stage learning framework, WoundQwen, for surgical wound diagnosis. In the first stage, we employ five independent MLLMs to accurately predict specific surgical wound characteristics. In the second stage, these predictions serve as additional knowledge inputs to two MLLMs responsible for diagnosing outcomes, which assess infection risk and guide subsequent interventions. In the third stage, we train a MLLM that integrates the diagnostic results from the previous two stages to produce a comprehensive report. This three-stage framework can analyze detailed surgical wound characteristics and provide subsequent instructions to patients based on surgical images, paving the way for personalized wound care, timely intervention, and improved patient outcomes.

[68] Adversarial Agent Behavior Learning in Autonomous Driving Using Deep Reinforcement Learning

Arjun Srinivasan,Anubhav Paras,Aniket Bera

Main category: cs.CV

TL;DR: This paper introduces a learning-based adversarial method to test and expose failure scenarios in rule-based reinforcement learning agents, particularly relevant for safety-critical systems like autonomous driving.

Details Motivation: In safety-critical applications like autonomous driving, accurately modeling the behavior of surrounding agents is crucial. Current strategies use rule-based agents and IDM models, but there is a need to test these models under adversarial conditions to identify potential failure scenarios. Method: The authors present a learning-based method to generate adversarial behavior for rule-based agents and evaluate its impact on cumulative reward. Result: The adversarial agent successfully caused a decrease in the cumulative reward of the rule-based agents, demonstrating its effectiveness in creating failure scenarios. Conclusion: The paper concludes that a learning-based method can effectively derive adversarial behavior for rule-based agents, resulting in decreased cumulative reward when evaluated against these agents. Abstract: Existing approaches in reinforcement learning train an agent to learn desired optimal behavior in an environment with rule based surrounding agents. In safety critical applications such as autonomous driving it is crucial that the rule based agents are modelled properly. Several behavior modelling strategies and IDM models are used currently to model the surrounding agents. We present a learning based method to derive the adversarial behavior for the rule based agents to cause failure scenarios. We evaluate our adversarial agent against all the rule based agents and show the decrease in cumulative reward.

[69] DyMorph-B2I: Dynamic and Morphology-Guided Binary-to-Instance Segmentation for Renal Pathology

Leiyue Zhao,Yuechen Yang,Yanfan Zhu,Haichun Yang,Yuankai Huo,Paul D. Simonson,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng

Main category: cs.CV

TL;DR: 提出DyMorph-B2I,一种针对肾脏病理的动态形态引导二值到实例分割流程,显著提高了实例分割的精度。

Details Motivation: 现有的肾脏病理数据集和自动化方法通常仅提供二值(语义)掩码,限制了下游分析的精度。经典后处理技术如watershed、形态操作和skeletonization在分离语义掩码为实例时受限于肾脏组织的多样形态和复杂连接。 Method: 集成watershed、skeletonization和morphological操作,结合自适应几何优化和可定制超参数调优,用于二值到实例的分割。 Result: 实验结果表明,该方法在实例分割方面优于单独的经典方法和简单组合,实现了更优的实例分离和更精确的形态测量分析。 Conclusion: DyMorph-B2I是一个动态的、形态引导的二值到实例分割流程,能够稳健地分离二值掩码中的粘连和异质结构,优于单独的经典方法和幼稚组合,从而实现更精确的肾脏病理形态测量分析。 Abstract: Accurate morphological quantification of renal pathology functional units relies on instance-level segmentation, yet most existing datasets and automated methods provide only binary (semantic) masks, limiting the precision of downstream analyses. Although classical post-processing techniques such as watershed, morphological operations, and skeletonization, are often used to separate semantic masks into instances, their individual effectiveness is constrained by the diverse morphologies and complex connectivity found in renal tissue. In this study, we present DyMorph-B2I, a dynamic, morphology-guided binary-to-instance segmentation pipeline tailored for renal pathology. Our approach integrates watershed, skeletonization, and morphological operations within a unified framework, complemented by adaptive geometric refinement and customizable hyperparameter tuning for each class of functional unit. Through systematic parameter optimization, DyMorph-B2I robustly separates adherent and heterogeneous structures present in binary masks. Experimental results demonstrate that our method outperforms individual classical approaches and na\"ive combinations, enabling superior instance separation and facilitating more accurate morphometric analysis in renal pathology workflows. The pipeline is publicly available at: https://github.com/ddrrnn123/DyMorph-B2I.

[70] STAGNet: A Spatio-Temporal Graph and LSTM Framework for Accident Anticipation

Vipooshan Vipulananthan,Kumudu Mohottala,Kavindu Chinthana,Nimsara Paramulla,Charith D Chitraranjan

Main category: cs.CV

TL;DR: 本文提出了一种基于行车记录仪视频的事故预测模型STAGNet,该模型通过提取和聚合时空特征,在多个数据集上均优于现有方法。

Details Motivation: 现有的依赖LiDAR、雷达和GPS等传感器的事故预测系统成本较高且不易部署,而仅依靠行车记录仪视频输入则提供了一种更具成本效益和易于部署的解决方案。 Method: 提出了一种名为STAGNet的模型,该模型结合了时空特征并通过循环网络进行聚合,以提高从行车记录仪视频中预测事故的能力。 Result: 实验表明,STAGNet在三个公开数据集上均取得了比现有方法更高的平均精度和平均碰撞时间值。 Conclusion: STAGNet通过整合更好的时空特征并利用循环网络进行聚合,在事故预测方面优于现有的图神经网络方法。 Abstract: Accident prediction and timely warnings play a key role in improving road safety by reducing the risk of injury to road users and minimizing property damage. Advanced Driver Assistance Systems (ADAS) are designed to support human drivers and are especially useful when they can anticipate potential accidents before they happen. While many existing systems depend on a range of sensors such as LiDAR, radar, and GPS, relying solely on dash-cam video input presents a more challenging but a more cost-effective and easily deployable solution. In this work, we incorporate better spatio-temporal features and aggregate them through a recurrent network to improve upon state-of-the-art graph neural networks for predicting accidents from dash-cam videos. Experiments using three publicly available datasets show that our proposed STAGNet model achieves higher average precision and mean time-to-collision values than previous methods, both when cross-validated on a given dataset and when trained and tested on different datasets.

[71] Collaborative Multi-Modal Coding for High-Quality 3D Generation

Ziang Cao,Zhaoxi Chen,Liang Pan,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了TriMM,首个基于多模态数据(如RGB、RGBD、点云)的前馈3D本生生成模型。通过协作多模态编码、辅助2D/3D监督和三平面潜在扩散模型,TriMM在少量数据上实现了高质量3D资产生成,并验证了多模态融合的有效性。

Details Motivation: 3D内容具有多模态特性,每种模态在3D建模中具有互补优势。然而,现有方法多局限于单一模态或3D结构,忽略了多模态数据的协同潜力,并受限于训练数据规模。 Method: 1) 提出协作多模态编码,整合各模态特征并保留其独特优势;2) 引入辅助2D和3D监督提升多模态编码的鲁棒性;3) 基于多模态编码使用三平面潜在扩散模型生成高质量3D资产。 Result: TriMM在多个知名数据集上表现出色,即使使用少量训练数据也能与大规模训练模型竞争。此外,在RGB-D数据集上的实验验证了多模态融合在3D生成中的可行性。 Conclusion: TriMM通过有效融合多模态数据,突破了现有3D生成模型对单一模态或大规模数据的依赖,为高质量3D资产生成提供了新思路。 Abstract: 3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

[72] Center-Oriented Prototype Contrastive Clustering

Shihao Dong,Xiaotong Zhou,Yuhui Zheng,Huiying Xu,Xinzhong Zhu

Main category: cs.CV

TL;DR: This paper proposes a novel contrastive clustering framework that improves clustering accuracy by addressing inter-class conflicts and prototype drift through a soft prototype module and dual consistency learning.

Details Motivation: The motivation stems from the limitations of existing contrastive clustering methods, which struggle with inter-class conflicts and inaccuracies in prototype calculation. This work aims to develop a more robust and accurate clustering approach. Method: The method introduces a soft prototype contrastive module and a dual consistency learning module. The soft prototype module calculates prototypes using probability weights to reduce inter-class conflict, while the dual consistency module aligns sample transformations and neighborhoods to ensure semantic consistency and compact clustering. Result: Experiments on five datasets demonstrate the effectiveness of the proposed method in comparison to state-of-the-art techniques, validating the advantages of the center-oriented framework. Conclusion: The proposed center-oriented prototype contrastive clustering framework effectively addresses inter-class conflicts and prototype drift, demonstrating superior performance compared to existing methods. Abstract: Contrastive learning is widely used in clustering tasks due to its discriminative representation. However, the conflict problem between classes is difficult to solve effectively. Existing methods try to solve this problem through prototype contrast, but there is a deviation between the calculation of hard prototypes and the true cluster center. To address this problem, we propose a center-oriented prototype contrastive clustering framework, which consists of a soft prototype contrastive module and a dual consistency learning module. In short, the soft prototype contrastive module uses the probability that the sample belongs to the cluster center as a weight to calculate the prototype of each category, while avoiding inter-class conflicts and reducing prototype drift. The dual consistency learning module aligns different transformations of the same sample and the neighborhoods of different samples respectively, ensuring that the features have transformation-invariant semantic information and compact intra-cluster distribution, while providing reliable guarantees for the calculation of prototypes. Extensive experiments on five datasets show that the proposed method is effective compared to the SOTA. Our code is published on https://github.com/LouisDong95/CPCC.

[73] AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation

Ruipu Wu,Yige Zhang,Jinyu Chen,Linjiang Huang,Shifeng Zhang,Xu Zhou,Liang Wang,Si Liu

Main category: cs.CV

TL;DR: This paper introduces a new UAV navigation task (DuAl-VLN) and framework (AeroDuo), enabling efficient dual-UAV collaboration with minimal communication for improved navigation performance.

Details Motivation: The motivation is to enhance UAV navigation performance by leveraging high mobility and multi-grained perspectives while maintaining manageable motion complexity. Method: The researchers introduced the DuAl-VLN task involving two UAVs operating at different altitudes and developed the AeroDuo framework. They also constructed the HaL-13k dataset for training and evaluation. Result: A new dataset (HaL-13k) and a dual-UAV collaborative framework (AeroDuo) were developed to improve UAV navigation using minimal coordinate information exchange. Conclusion: The study concludes that the proposed DuAl-VLN task and AeroDuo framework enable efficient collaboration between two UAVs at different altitudes, improving navigation performance with minimal communication. Abstract: Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and complex maneuverability of UAVs, achieving reliable UAV-VLN performance is challenging and often requires human intervention or overly detailed instructions. To harness the advantages of UAVs' high mobility, which could provide multi-grained perspectives, while maintaining a manageable motion space for learning, we introduce a novel task called Dual-Altitude UAV Collaborative VLN (DuAl-VLN). In this task, two UAVs operate at distinct altitudes: a high-altitude UAV responsible for broad environmental reasoning, and a low-altitude UAV tasked with precise navigation. To support the training and evaluation of the DuAl-VLN, we construct the HaL-13k, a dataset comprising 13,838 collaborative high-low UAV demonstration trajectories, each paired with target-oriented language instructions. This dataset includes both unseen maps and an unseen object validation set to systematically evaluate the model's generalization capabilities across novel environments and unfamiliar targets. To consolidate their complementary strengths, we propose a dual-UAV collaborative VLN framework, AeroDuo, where the high-altitude UAV integrates a multimodal large language model (Pilot-LLM) for target reasoning, while the low-altitude UAV employs a lightweight multi-stage policy for navigation and target grounding. The two UAVs work collaboratively and only exchange minimal coordinate information to ensure efficiency.

[74] Pretrained Diffusion Models Are Inherently Skipped-Step Samplers

Wenju Xu

Main category: cs.CV

TL;DR: 本文提出 skipped-step sampling 方法,可在不损失生成质量的前提下显著加速扩散模型的采样过程。

Details Motivation: 扩散模型的序列生成过程较长,现有方法如 DDIM 通过非马尔可夫扩散过程减少采样步骤,但尚未明确原始扩散过程是否可实现相同效率。 Method: 通过跳过中间去噪步骤的机制,结合DDIM方法进行增强生成。 Result: 实验表明,所提方法在多个预训练扩散模型上显著减少了采样步骤并保持高质量生成。 Conclusion: skipped-step sampling 是预训练扩散模型的内在属性,可以实现加速采样,同时保持高质量生成。 Abstract: Diffusion models have been achieving state-of-the-art results across various generation tasks. However, a notable drawback is their sequential generation process, requiring long-sequence step-by-step generation. Existing methods, such as DDIM, attempt to reduce sampling steps by constructing a class of non-Markovian diffusion processes that maintain the same training objective. However, there remains a gap in understanding whether the original diffusion process can achieve the same efficiency without resorting to non-Markovian processes. In this paper, we provide a confirmative answer and introduce skipped-step sampling, a mechanism that bypasses multiple intermediate denoising steps in the iterative generation process, in contrast with the traditional step-by-step refinement of standard diffusion inference. Crucially, we demonstrate that this skipped-step sampling mechanism is derived from the same training objective as the standard diffusion model, indicating that accelerated sampling via skipped-step sampling via a Markovian way is an intrinsic property of pretrained diffusion models. Additionally, we propose an enhanced generation method by integrating our accelerated sampling technique with DDIM. Extensive experiments on popular pretrained diffusion models, including the OpenAI ADM, Stable Diffusion, and Open Sora models, show that our method achieves high-quality generation with significantly reduced sampling steps.

[75] Comp-X: On Defining an Interactive Learned Image Compression Paradigm With Expert-driven LLM Agent

Yixin Gao,Xin Li,Xiaohan Pan,Runsen Feng,Bingchen Li,Yunpeng Qi,Yiting Lu,Zhengxue Cheng,Zhibo Chen,Jörn Ostermann

Main category: cs.CV

TL;DR: Comp-X introduces an intelligently interactive image compression paradigm using an LLM agent, providing efficient request understanding and textual interaction while maintaining compression performance.

Details Motivation: The motivation is to overcome the limitations of traditional image codecs, which have limited coding modes and require manual mode selection, making them unfriendly for unprofessional users. Method: The method involves three key innovations: a multi-functional coding framework, an interactive coding agent powered by an LLM with augmented in-context learning, and the creation of the IIC-bench benchmark for evaluation. Result: Extensive experimental results demonstrate that Comp-X can efficiently understand coding requests and achieve impressive textual interaction capability while maintaining comparable compression performance. Conclusion: Comp-X provides a promising avenue for AGI in image compression by maintaining comparable performance with a single coding framework while efficiently understanding coding requests through textual interaction. Abstract: We present Comp-X, the first intelligently interactive image compression paradigm empowered by the impressive reasoning capability of large language model (LLM) agent. Notably, commonly used image codecs usually suffer from limited coding modes and rely on manual mode selection by engineers, making them unfriendly for unprofessional users. To overcome this, we advance the evolution of image coding paradigm by introducing three key innovations: (i) multi-functional coding framework, which unifies different coding modes of various objective/requirements, including human-machine perception, variable coding, and spatial bit allocation, into one framework. (ii) interactive coding agent, where we propose an augmented in-context learning method with coding expert feedback to teach the LLM agent how to understand the coding request, mode selection, and the use of the coding tools. (iii) IIC-bench, the first dedicated benchmark comprising diverse user requests and the corresponding annotations from coding experts, which is systematically designed for intelligently interactive image compression evaluation. Extensive experimental results demonstrate that our proposed Comp-X can understand the coding requests efficiently and achieve impressive textual interaction capability. Meanwhile, it can maintain comparable compression performance even with a single coding framework, providing a promising avenue for artificial general intelligence (AGI) in image compression.

[76] Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images

Jinsol Song,Jiamu Wang,Anh Tien Nguyen,Keunho Byeon,Sangjeong Ahn,Sung Hak Lee,Jin Tae Kwak

Main category: cs.CV

TL;DR: Ano-NAViLa是一种结合病理知识的轻量级视觉-语言模型,用于病理图像中的异常检测和定位,具有更高的准确性和可解释性。

Details Motivation: 现有的工业异常检测方法在病理学中面临计算限制、多样的组织结构和缺乏可解释性等挑战。 Method: 提出了一种基于预训练视觉-语言模型和轻量级可训练MLP的Ano-NAViLa方法,利用正常和异常病理知识进行异常检测。 Result: 在来自不同器官的两个淋巴结数据集上评估,Ano-NAViLa在异常检测和定位任务中实现了最先进的性能,优于其他竞争模型。 Conclusion: Ano-NAViLa通过结合正常和异常病理知识增强了病理图像异常检测和定位的准确性和鲁棒性,并提供了可解释性。 Abstract: Anomaly detection in computational pathology aims to identify rare and scarce anomalies where disease-related data are often limited or missing. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a Normal and Abnormal pathology knowledge-augmented Vision-Language model for Anomaly detection in pathology images. Ano-NAViLa is built on a pre-trained vision-language model with a lightweight trainable MLP. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.

[77] RATopo: Improving Lane Topology Reasoning via Redundancy Assignment

Han Li,Shaofei Huang,Longfei Xu,Yulu Gao,Beipeng Mu,Si Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为RATopo的冗余分配策略,用于车道拓扑推理,通过重构Transformer解码器并使用多个并行交叉注意力块,实现了更丰富和多样化的拓扑监督。

Details Motivation: 现有的先检测后推理范式在监督拓扑关系时存在有效监督范围有限的问题,导致拓扑推理性能次优。 Method: 通过交换交叉注意力和自注意力层来重构Transformer解码器,并实例化多个具有独立参数的并行交叉注意力块,以保留冗余的车道预测并增强检测车道的多样性。 Result: 在OpenLane-V2上的大量实验表明,RATopo策略是模型无关的,可以无缝集成到现有的拓扑推理框架中,持续提升车道-车道和车道-交通拓扑性能。 Conclusion: RATopo提供了一种更有效的车道拓扑推理方法,具有广泛的应用潜力。 Abstract: Lane topology reasoning plays a critical role in autonomous driving by modeling the connections among lanes and the topological relationships between lanes and traffic elements. Most existing methods adopt a first-detect-then-reason paradigm, where topological relationships are supervised based on the one-to-one assignment results obtained during the detection stage. This supervision strategy results in suboptimal topology reasoning performance due to the limited range of valid supervision. In this paper, we propose RATopo, a Redundancy Assignment strategy for lane Topology reasoning that enables quantity-rich and geometry-diverse topology supervision. Specifically, we restructure the Transformer decoder by swapping the cross-attention and self-attention layers. This allows redundant lane predictions to be retained before suppression, enabling effective one-to-many assignment. We also instantiate multiple parallel cross-attention blocks with independent parameters, which further enhances the diversity of detected lanes. Extensive experiments on OpenLane-V2 demonstrate that our RATopo strategy is model-agnostic and can be seamlessly integrated into existing topology reasoning frameworks, consistently improving both lane-lane and lane-traffic topology performance.

[78] DesignCLIP: Multimodal Learning with CLIP for Design Patent Understanding

Zhu Wang,Homaira Huda Shomee,Sathya N. Ravi,Sourav Medya

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP模型的专利分析统一框架DesignCLIP,在专利分类和检索任务中表现出色,并展示了多模态方法在专利分析中的潜力。

Details Motivation: 传统设计专利分析任务依赖图像数据,但专利图片往往无法传达全面的视觉上下文和语义信息,这可能导致在现有技术检索中的评估歧义。 Method: 利用CLIP模型,结合类别感知分类和对比学习,使用生成的专利图片详细标题和多视角图片学习。 Result: DesignCLIP在各种下游任务中都优于基线和最先进的模型,包括专利分类和专利检索。此外,还探索了多模态专利检索。 Conclusion: DesignCLIP是一个基于CLIP模型的专利分析统一框架,它在专利分类和检索任务中表现出色,并展示了多模态方法在专利分析中的潜力。 Abstract: In the field of design patent analysis, traditional tasks such as patent classification and patent image retrieval heavily depend on the image data. However, patent images -- typically consisting of sketches with abstract and structural elements of an invention -- often fall short in conveying comprehensive visual context and semantic information. This inadequacy can lead to ambiguities in evaluation during prior art searches. Recent advancements in vision-language models, such as CLIP, offer promising opportunities for more reliable and accurate AI-driven patent analysis. In this work, we leverage CLIP models to develop a unified framework DesignCLIP for design patent applications with a large-scale dataset of U.S. design patents. To address the unique characteristics of patent data, DesignCLIP incorporates class-aware classification and contrastive learning, utilizing generated detailed captions for patent images and multi-views image learning. We validate the effectiveness of DesignCLIP across various downstream tasks, including patent classification and patent retrieval. Additionally, we explore multimodal patent retrieval, which provides the potential to enhance creativity and innovation in design by offering more diverse sources of inspiration. Our experiments show that DesignCLIP consistently outperforms baseline and SOTA models in the patent domain on all tasks. Our findings underscore the promise of multimodal approaches in advancing patent analysis. The codebase is available here: https://anonymous.4open.science/r/PATENTCLIP-4661/README.md.

[79] TPA: Temporal Prompt Alignment for Fetal Congenital Heart Defect Classification

Darya Taratynova,Alya Almsouti,Beknur Kalmakhanbet,Numan Saeed,Mohammad Yaqub

Main category: cs.CV

TL;DR: Temporal Prompt Alignment (TPA) 是一种用于胎儿先天性心脏病 (CHD) 分类的框架,结合了时间建模、提示感知对比学习和不确定性量化,取得了优异的性能和校准效果。

Details Motivation: 超声视频中的先天性心脏病 (CHD) 检测受到图像噪声和探针位置变化的影响。虽然自动化方法可以减少操作者依赖性,但目前的机器学习方法往往忽略了时间信息,限制了自身到二分类,并且没有考虑预测校准。 Method: TPA 使用图像编码器提取视频子片段的特征,并使用可训练的时间提取器聚合这些特征,以捕捉心脏运动,并通过 margin-hinge 对比损失将视频表示与特定类别的文本提示对齐。此外,引入了条件变分自编码器风格调制 (CVAESM) 模块来学习潜在风格向量,以调制嵌入并量化分类不确定性。 Result: 在 CHD 诊断的私人数据集和大型公共数据集 EchoNet-Dynamic 上进行评估,TPA 取得了 85.40% 的最佳宏观 F1 分数,同时将预期校准误差减少了 5.38%,自适应 ECE 减少了 6.8%。在 EchoNet-Dynamic 的三分类任务中,宏观 F1 提高了 4.73%(从 53.89% 提高到 58.62%)。 Conclusion: Temporal Prompt Alignment (TPA) 是一种用于胎儿先天性心脏缺陷 (CHD) 分类的框架,整合了时间建模、提示感知对比学习和不确定性量化。 Abstract: Congenital heart defect (CHD) detection in ultrasound videos is hindered by image noise and probe positioning variability. While automated methods can reduce operator dependence, current machine learning approaches often neglect temporal information, limit themselves to binary classification, and do not account for prediction calibration. We propose Temporal Prompt Alignment (TPA), a method leveraging foundation image-text model and prompt-aware contrastive learning to classify fetal CHD on cardiac ultrasound videos. TPA extracts features from each frame of video subclips using an image encoder, aggregates them with a trainable temporal extractor to capture heart motion, and aligns the video representation with class-specific text prompts via a margin-hinge contrastive loss. To enhance calibration for clinical reliability, we introduce a Conditional Variational Autoencoder Style Modulation (CVAESM) module, which learns a latent style vector to modulate embeddings and quantifies classification uncertainty. Evaluated on a private dataset for CHD detection and on a large public dataset, EchoNet-Dynamic, for systolic dysfunction, TPA achieves state-of-the-art macro F1 scores of 85.40% for CHD diagnosis, while also reducing expected calibration error by 5.38% and adaptive ECE by 6.8%. On EchoNet-Dynamic's three-class task, it boosts macro F1 by 4.73% (from 53.89% to 58.62%). Temporal Prompt Alignment (TPA) is a framework for fetal congenital heart defect (CHD) classification in ultrasound videos that integrates temporal modeling, prompt-aware contrastive learning, and uncertainty quantification.

[80] BasketLiDAR: The First LiDAR-Camera Multimodal Dataset for Professional Basketball MOT

Ryunosuke Hayashi,Kohei Torimi,Rokuto Nagata,Kazuma Ikeda,Ozora Sako,Taichi Nakamura,Masaki Tani,Yoshimitsu Aoki,Kentaro Yoshioka

Main category: cs.CV

TL;DR: 本研究构建了首个用于体育MOT领域的多模态数据集BasketLiDAR,并提出了一种新颖的MOT框架,该框架在使用LiDAR和摄像头数据时同时实现了改进的跟踪准确性和降低的计算成本。

Details Motivation: 传统的多摄像头系统受限于视频数据的二维本质和复杂的3D重建处理,使得实时分析充满挑战,特别是在篮球比赛这种MOT领域最具挑战性的场景之一。 Method: 提出了一种新的MOT算法,该算法利用LiDAR的高精度3D空间信息,包含一个仅使用LiDAR的实时跟踪流水线和一个融合LiDAR和摄像头数据的多模态跟踪流水线。 Result: BasketLiDAR数据集包含4,445帧和3,105个球员ID,完全同步了三个LiDAR传感器和三个多视角摄像头的ID;实验结果证明,所提出的方法不仅实现了实时操作,而且在遮挡条件下也达到了优越的跟踪性能。 Conclusion: BasketLiDAR实现了实时运行,克服了传统仅摄像头方法的局限性,同时在遮挡条件下实现了优越的跟踪性能。 Abstract: Real-time 3D trajectory player tracking in sports plays a crucial role in tactical analysis, performance evaluation, and enhancing spectator experience. Traditional systems rely on multi-camera setups, but are constrained by the inherently two-dimensional nature of video data and the need for complex 3D reconstruction processing, making real-time analysis challenging. Basketball, in particular, represents one of the most difficult scenarios in the MOT field, as ten players move rapidly and complexly within a confined court space, with frequent occlusions caused by intense physical contact. To address these challenges, this paper constructs BasketLiDAR, the first multimodal dataset in the sports MOT field that combines LiDAR point clouds with synchronized multi-view camera footage in a professional basketball environment, and proposes a novel MOT framework that simultaneously achieves improved tracking accuracy and reduced computational cost. The BasketLiDAR dataset contains a total of 4,445 frames and 3,105 player IDs, with fully synchronized IDs between three LiDAR sensors and three multi-view cameras. We recorded 5-on-5 and 3-on-3 game data from actual professional basketball players, providing complete 3D positional information and ID annotations for each player. Based on this dataset, we developed a novel MOT algorithm that leverages LiDAR's high-precision 3D spatial information. The proposed method consists of a real-time tracking pipeline using LiDAR alone and a multimodal tracking pipeline that fuses LiDAR and camera data. Experimental results demonstrate that our approach achieves real-time operation, which was difficult with conventional camera-only methods, while achieving superior tracking performance even under occlusion conditions. The dataset is available upon request at: https://sites.google.com/keio.jp/keio-csg/projects/basket-lidar

[81] First RAG, Second SEG: A Training-Free Paradigm for Camouflaged Object Detection

Wutao Liu,YiDan Wang,Pan Gao

Main category: cs.CV

TL;DR: 本文提出RAG-SEG,一种无需训练的COD方法,通过将粗掩码生成与SAM2精炼结合,在降低资源消耗的同时实现了高性能。

Details Motivation: 现有的COD方法通常依赖大量的训练和计算资源,而基础模型如SAM在没有微调的情况下难以处理COD任务,同时需要高质量提示,手动生成提示成本高且效率低。 Method: RAG-SEG通过无监督聚类构建一个紧凑的检索数据库,在推理过程中使用检索到的特征生成伪标签,指导SAM2进行精确的掩码生成。 Result: RAG-SEG在基准COD数据集上的实验表明,其性能与现有最先进方法相当或更优,且所有实验均在个人笔记本上完成,突出了其计算效率和实用性。 Conclusion: RAG-SEG通过将COD任务分解为RAG和SEG两个阶段,提出了一种无需训练的方法,在保持竞争力性能的同时,显著降低了计算资源的需求,并且可以在个人笔记本上高效运行。 Abstract: Camouflaged object detection (COD) poses a significant challenge in computer vision due to the high similarity between objects and their backgrounds. Existing approaches often rely on heavy training and large computational resources. While foundation models such as the Segment Anything Model (SAM) offer strong generalization, they still struggle to handle COD tasks without fine-tuning and require high-quality prompts to yield good performance. However, generating such prompts manually is costly and inefficient. To address these challenges, we propose \textbf{First RAG, Second SEG (RAG-SEG)}, a training-free paradigm that decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks as prompts, followed by SAM-based segmentation (SEG) for refinement. RAG-SEG constructs a compact retrieval database via unsupervised clustering, enabling fast and effective feature retrieval. During inference, the retrieved features produce pseudo-labels that guide precise mask generation using SAM2. Our method eliminates the need for conventional training while maintaining competitive performance. Extensive experiments on benchmark COD datasets demonstrate that RAG-SEG performs on par with or surpasses state-of-the-art methods. Notably, all experiments are conducted on a \textbf{personal laptop}, highlighting the computational efficiency and practicality of our approach. We present further analysis in the Appendix, covering limitations, salient object detection extension, and possible improvements.

[82] VideoEraser: Concept Erasure in Text-to-Video Diffusion Models

Naen Xu,Jinghuai Zhang,Changjiang Li,Zhi Chen,Chunyi Zhou,Qingming Li,Tianyu Du,Shouling Ji

Main category: cs.CV

TL;DR: VideoEraser is a training-free framework designed to prevent text-to-video diffusion models from generating harmful content, offering significant improvements in performance over existing methods.

Details Motivation: The rapid growth of text-to-video (T2V) diffusion models raises concerns about privacy, copyright, and safety due to potential misuse in generating harmful or misleading content. Method: VideoEraser uses a two-stage process: Selective Prompt Embedding Adjustment (SPEA) and Adversarial-Resilient Noise Guidance (ARNG), and operates as a plug-and-play module with existing T2V diffusion models. Result: VideoEraser consistently outperforms prior methods in efficacy, integrity, fidelity, robustness, and generalizability, achieving a 46% reduction on average across four tasks in suppressing undesirable content during T2V generation. Conclusion: VideoEraser is a training-free framework that effectively prevents T2V diffusion models from generating videos with undesirable content, demonstrating superior performance over previous methods. Abstract: The rapid growth of text-to-video (T2V) diffusion models has raised concerns about privacy, copyright, and safety due to their potential misuse in generating harmful or misleading content. These models are often trained on numerous datasets, including unauthorized personal identities, artistic creations, and harmful materials, which can lead to uncontrolled production and distribution of such content. To address this, we propose VideoEraser, a training-free framework that prevents T2V diffusion models from generating videos with undesirable concepts, even when explicitly prompted with those concepts. Designed as a plug-and-play module, VideoEraser can seamlessly integrate with representative T2V diffusion models via a two-stage process: Selective Prompt Embedding Adjustment (SPEA) and Adversarial-Resilient Noise Guidance (ARNG). We conduct extensive evaluations across four tasks, including object erasure, artistic style erasure, celebrity erasure, and explicit content erasure. Experimental results show that VideoEraser consistently outperforms prior methods regarding efficacy, integrity, fidelity, robustness, and generalizability. Notably, VideoEraser achieves state-of-the-art performance in suppressing undesirable content during T2V generation, reducing it by 46% on average across four tasks compared to baselines.

[83] Predicting Road Crossing Behaviour using Pose Detection and Sequence Modelling

Subhasis Dasgupta,Preetam Saha,Agniva Roy,Jaydip Sen

Main category: cs.CV

TL;DR: 本文提出了一個基於深度學習的框架,用於預測行人穿越道路的意圖,通過比較不同的序列建模技術(如GRU、LSTM和1D CNN),發現GRU在預測行人意圖方面優於LSTM,而1D CNN在速度上最佳。

Details Motivation: 隨著AI技術的發展,自動駕駛車輛需要準確預測行人的行為,特別是在遠距離判斷行人是否即將穿越道路,以提高交通安全性與效率。 Method: 研究採用深度學習模型進行姿態檢測與序列建 modelling,分析三種不同的序列建模技術(GRU、LSTM和1D CNN),並整合視頻分析與姿態檢測模型,構建一個端到端的深度學習框架來預測行人的道路穿越意圖。 Result: 研究結果顯示GRU在預測行人穿越道路的意圖上優於LSTM,而1D CNN則在處理速度上表現最佳。 Conclusion: 本研究成功開發了一個用於預測行人道路穿越意圖的端到端深度學習框架,並展示了不同序列建模技術的性能差異,為自動駕駛車輛的行人行為預測提供了實用的解決方案。 Abstract: The world is constantly moving towards AI based systems and autonomous vehicles are now reality in different parts of the world. These vehicles require sensors and cameras to detect objects and maneuver according to that. It becomes important to for such vehicles to also predict from a distant if a person is about to cross a road or not. The current study focused on predicting the intent of crossing the road by pedestrians in an experimental setup. The study involved working with deep learning models to predict poses and sequence modelling for temporal predictions. The study analysed three different sequence modelling to understand the prediction behaviour and it was found out that GRU was better in predicting the intent compared to LSTM model but 1D CNN was the best model in terms of speed. The study involved video analysis, and the output of pose detection model was integrated later on to sequence modelling techniques for an end-to-end deep learning framework for predicting road crossing intents.

[84] RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features

Olga Matykina,Dmitry Yudin

Main category: cs.CV

TL;DR: 本研究提出了一种新的多模态三维物体检测方法RCDINO,通过融合视觉骨干网络和预训练DINOv2模型的特征,在nuScenes数据集上取得了优异的性能。

Details Motivation: 三维物体检测对于自动驾驶和机器人技术至关重要,依赖于相机和雷达多模态数据的有效融合。 Method: 提出了一种基于多模态变压器的模型RCDINO,通过将视觉骨干网络特征与预训练DINOv2基础模型的语义丰富表示进行融合,以增强视觉表示。 Result: 实验结果表明,RCDINO在雷达-相机模型中实现了最先进的性能,在nuScenes数据集上达到了56.4 NDS和48.1 mAP。 Conclusion: RCDINO实现了雷达-相机模型中的最先进性能,在nuScenes数据集上达到了56.4 NDS和48.1 mAP。 Abstract: Three-dimensional object detection is essential for autonomous driving and robotics, relying on effective fusion of multimodal data from cameras and radar. This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features by fusing them with semantically rich representations from the pretrained DINOv2 foundation model. This approach enriches visual representations and improves the model's detection performance while preserving compatibility with the baseline architecture. Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models, with 56.4 NDS and 48.1 mAP. Our implementation is available at https://github.com/OlgaMatykina/RCDINO.

[85] An Empirical Study on How Video-LLMs Answer Video Questions

Chenhui Gou,Ziyu Ma,Zicheng Duan,Haoyu He,Feng Chen,Akide Liu,Bohan Zhuang,Jianfei Cai,Hamid Rezatofighi

Main category: cs.CV

TL;DR: 本文通过注意力机制分析视频大语言模型的内部工作机制,揭示了其信息处理的阶段性特征、关键层的影响以及语言引导在时空建模中的主导作用,为模型的可解释性和效率优化提供了新思路。

Details Motivation: 尽管视频大语言模型在回答视频问题方面表现出色,但现有研究主要关注性能提升,对其内部机制的理解仍有限。因此,论文旨在通过系统性的实证研究填补这一空白。 Method: 论文采用了注意力阻断(attention knockouts)作为主要分析工具,并设计了三种变体:视频时间阻断、视频空间阻断和语言到视频阻断。这些阻断方法被应用于不同的网络层数(层窗口),并分为全局设置和细粒度设置进行分析。 Result: 研究揭示了三个关键发现:(1)全局设置表明视频信息提取主要发生在早期层,形成两个阶段:底层负责感知编码,高层负责抽象推理;(2)在细粒度设置中,某些中间层对视频问答影响显著,成为关键异常点,而大多数其他层贡献较小;(3)无论在哪种设置中,时空建模更依赖语言引导的检索,而非视频标记之间的自注意力机制,尽管后者计算成本较高。此外,这些发现可用于减少Video-LLMs中的注意力计算。 Conclusion: 该论文首次系统地揭示了视频大语言模型(Video-LLMs)内部如何处理和理解视频内容,并通过注意力机制的分析提供了可解释性和效率优化的新视角。 Abstract: Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process -- lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter's high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.

[86] Transfer learning optimization based on evolutionary selective fine tuning

Jacinto Colan,Ana Davila,Yasuhisa Hasegawa

Main category: cs.CV

TL;DR: 本文提出了一种名为BioTune的进化自适应微调技术,通过选择性地微调神经网络中的关键层,提高了迁移学习的效率和性能。

Details Motivation: 传统的微调方法通常涉及更新所有模型参数,可能导致过拟合和较高的计算成本。因此,本文提出了一种进化自适应微调技术来提高迁移学习的效率。 Method: BioTune使用进化算法识别需要微调的层,以优化模型在特定目标任务上的性能。 Result: 在来自不同领域的九个图像分类数据集上的评估表明,与现有的微调方法(如AutoRGN和LoRA)相比,BioTune在准确性和效率方面具有竞争优势。 Conclusion: BioTune通过选择性微调相关层,减少了可训练参数的数量,从而降低了计算成本,并促进了在不同数据特征和分布下的高效迁移学习。 Abstract: Deep learning has shown substantial progress in image analysis. However, the computational demands of large, fully trained models remain a consideration. Transfer learning offers a strategy for adapting pre-trained models to new tasks. Traditional fine-tuning often involves updating all model parameters, which can potentially lead to overfitting and higher computational costs. This paper introduces BioTune, an evolutionary adaptive fine-tuning technique that selectively fine-tunes layers to enhance transfer learning efficiency. BioTune employs an evolutionary algorithm to identify a focused set of layers for fine-tuning, aiming to optimize model performance on a given target task. Evaluation across nine image classification datasets from various domains indicates that BioTune achieves competitive or improved accuracy and efficiency compared to existing fine-tuning methods such as AutoRGN and LoRA. By concentrating the fine-tuning process on a subset of relevant layers, BioTune reduces the number of trainable parameters, potentially leading to decreased computational cost and facilitating more efficient transfer learning across diverse data characteristics and distributions.

[87] Image-Conditioned 3D Gaussian Splat Quantization

Xinshuang Liu,Runfa Blark Li,Keito Suzuki,Truong Nguyen

Main category: cs.CV

TL;DR: 本文提出了一种新的3D高斯点压缩方法,能够在保持视觉质量的同时显著提升压缩效率,并支持长期存档后的场景更新。

Details Motivation: 3DGS压缩方法在大规模场景或长期存档方面存在局限,需要支持场景变化。 Method: 利用跨高斯和跨属性相关性以及共享码本,通过联合训练编码、量化和解码过程实现高效压缩。 Result: 将3DGS存储需求降低到千字节范围,并能根据解码时的图像调整场景。 Conclusion: ICGS-Quantizer实现了高效的3D高斯点压缩,同时支持场景更新,优于现有方法。 Abstract: 3D Gaussian Splatting (3DGS) has attracted considerable attention for enabling high-quality real-time rendering. Although 3DGS compression methods have been proposed for deployment on storage-constrained devices, two limitations hinder archival use: (1) they compress medium-scale scenes only to the megabyte range, which remains impractical for large-scale scenes or extensive scene collections; and (2) they lack mechanisms to accommodate scene changes after long-term archival. To address these limitations, we propose an Image-Conditioned Gaussian Splat Quantizer (ICGS-Quantizer) that substantially enhances compression efficiency and provides adaptability to scene changes after archiving. ICGS-Quantizer improves quantization efficiency by jointly exploiting inter-Gaussian and inter-attribute correlations and by using shared codebooks across all training scenes, which are then fixed and applied to previously unseen test scenes, eliminating the overhead of per-scene codebooks. This approach effectively reduces the storage requirements for 3DGS to the kilobyte range while preserving visual fidelity. To enable adaptability to post-archival scene changes, ICGS-Quantizer conditions scene decoding on images captured at decoding time. The encoding, quantization, and decoding processes are trained jointly, ensuring that the codes, which are quantized representations of the scene, are effective for conditional decoding. We evaluate ICGS-Quantizer on 3D scene compression and 3D scene updating. Experimental results show that ICGS-Quantizer consistently outperforms state-of-the-art methods in compression efficiency and adaptability to scene changes. Our code, model, and data will be publicly available on GitHub.

[88] DriveSplat: Decoupled Driving Scene Reconstruction with Geometry-enhanced Partitioned Neural Gaussians

Cong Wang,Xianda Guo,Wenbo Xu,Wei Tian,Ruiqi Song,Chenming Zhang,Lingxi Li,Long Chen

Main category: cs.CV

TL;DR: DriveSplat improves 3D scene reconstruction for driving scenarios by combining dynamic-static decoupling, region-wise initialization, and deformable Gaussians, achieving high-quality results on major datasets.

Details Motivation: Existing 3D Gaussian Splatting methods have limited robustness and geometric accuracy in driving scenarios due to inadequate background optimization and reliance on view-specific Gaussian fitting. Method: DriveSplat uses neural Gaussian representations with dynamic-static decoupling, a region-wise voxel initialization scheme, and deformable neural Gaussians adjusted by a deformation network, supervised by depth and normal priors. Result: DriveSplat achieves state-of-the-art performance in novel-view synthesis on the Waymo and KITTI datasets, demonstrating improved geometric representation and rendering quality. Conclusion: DriveSplat provides a more robust and accurate method for 3D scene reconstruction in driving scenarios by addressing the limitations of existing 3D Gaussian Splatting techniques. Abstract: In the realm of driving scenarios, the presence of rapidly moving vehicles, pedestrians in motion, and large-scale static backgrounds poses significant challenges for 3D scene reconstruction. Recent methods based on 3D Gaussian Splatting address the motion blur problem by decoupling dynamic and static components within the scene. However, these decoupling strategies overlook background optimization with adequate geometry relationships and rely solely on fitting each training view by adding Gaussians. Therefore, these models exhibit limited robustness in rendering novel views and lack an accurate geometric representation. To address the above issues, we introduce DriveSplat, a high-quality reconstruction method for driving scenarios based on neural Gaussian representations with dynamic-static decoupling. To better accommodate the predominantly linear motion patterns of driving viewpoints, a region-wise voxel initialization scheme is employed, which partitions the scene into near, middle, and far regions to enhance close-range detail representation. Deformable neural Gaussians are introduced to model non-rigid dynamic actors, whose parameters are temporally adjusted by a learnable deformation network. The entire framework is further supervised by depth and normal priors from pre-trained models, improving the accuracy of geometric structures. Our method has been rigorously evaluated on the Waymo and KITTI datasets, demonstrating state-of-the-art performance in novel-view synthesis for driving scenarios.

[89] DIO: Refining Mutual Information and Causal Chain to Enhance Machine Abstract Reasoning Ability

Ruizhuo Song,Beiming Yuan

Main category: cs.CV

TL;DR: 本文通过分析Raven渐进矩阵任务中的因果链,提出了深度学习模型在抽象推理中的瓶颈,并设计了基线模型DIO及其三种改进方法,以提升机器智能的抽象推理能力。

Details Motivation: 当前深度学习模型在各种领域表现出色,但其在抽象推理方面存在根本瓶颈。为了应对这一挑战,学术界引入了Raven渐进矩阵(RPM)问题作为评估深度学习算法抽象推理能力的权威基准。本文旨在解决RPM问题,以促进机器智能抽象推理能力的提升。 Method: 本文采用了“因果链建模”的视角来分析RPM任务中的完整因果链,并基于此设计了基线模型DIO的网络架构。随后通过分析DIO优化目标的局限性,提出了三种改进方法以增强模型的抽象推理能力。 Result: 实验表明,基线模型DIO的优化目标(最大化上下文与正确选项之间互信息的变分下界)未能使模型真正获得预定义的人类推理逻辑。这主要是由于互信息的下界紧度影响了最大化效果,且互信息本身作为统计度量无法捕捉主体与客体之间的因果关系。 Conclusion: 本文通过提出三种改进方法,逐步解决了深度学习模型在Raven渐进矩阵任务中抽象推理能力的不足,从而推动机器智能的抽象推理能力的发展。 Abstract: Despite the outstanding performance of current deep learning models across various domains, their fundamental bottleneck in abstract reasoning remains unresolved. To address this challenge, the academic community has introduced Raven's Progressive Matrices (RPM) problems as an authoritative benchmark for evaluating the abstract reasoning capabilities of deep learning algorithms, with a focus on core intelligence dimensions such as abstract reasoning, pattern recognition, and complex problem-solving. Therefore, this paper centers on solving RPM problems, aiming to contribute to enhancing the abstract reasoning abilities of machine intelligence. Firstly, this paper adopts a ``causal chain modeling'' perspective to systematically analyze the complete causal chain in RPM tasks: image $\rightarrow$ abstract attributes $\rightarrow$ progressive attribute patterns $\rightarrow$ pattern consistency $\rightarrow$ correct answer. Based on this analysis, the network architecture of the baseline model DIO is designed. However, experiments reveal that the optimization objective formulated for DIO, namely maximizing the variational lower bound of mutual information between the context and the correct option, fails to enable the model to genuinely acquire the predefined human reasoning logic. This is attributed to two main reasons: the tightness of the lower bound significantly impacts the effectiveness of mutual information maximization, and mutual information, as a statistical measure, does not capture the causal relationship between subjects and objects. To overcome these limitations, this paper progressively proposes three improvement methods:

[90] Spiking Variational Graph Representation Inference for Video Summarization

Wenrui Li,Wei Han,Liang-Jian Deng,Ruiqin Xiong,Xiaopeng Fan

Main category: cs.CV

TL;DR: 本文提出了一种高效的视频摘要方法SpiVG,通过创新的网络结构解决了现有方法在时间依赖性、语义连贯性和特征融合噪声方面的问题,并在多个数据集上验证了其优越性能。

Details Motivation: 现有的视频摘要方法难以捕捉全局时间依赖性,保持视频内容的语义连贯性,并且在多通道特征融合过程中容易受到噪声影响。 Method: 提出了一种名为SpiVG的网络,包括基于脉冲神经网络的关键帧提取器、动态聚合图推理模块和变分推理重建模块。 Result: 实验结果显示,SpiVG在多个数据集(如SumMe、TVSum、VideoXum和QFVS)上均优于现有方法。 Conclusion: SpiVG Network不仅在多个数据集上超越了现有方法,还通过自主学习关键帧特征、动态聚合图推理和变分推理重建模块有效解决了全局时间依赖性、语义连贯性和多通道特征融合中的噪声问题。 Abstract: With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.

[91] From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations

Anthony Bisulco,Rahul Ramesh,Randall Balestriero,Pratik Chaudhari

Main category: cs.CV

TL;DR: This study explores how Masked Autoencoders (MAEs) learn spatial correlations in images, showing that hyperparameters like masking ratio and patch size can influence feature selection, providing insights for their practical application.

Details Motivation: Despite the effectiveness of MAEs, they require extensive hyperparameter tuning when applied to novel datasets. The connection between MAE hyperparameters and performance on downstream tasks is relatively unexplored. Method: Analytical derivation of the features learned by a linear MAE extended to non-linear MAEs. Result: The study shows that masking ratio and patch size can be used to select for features that capture short- and long-range spatial correlations. MAE representations adapt to spatial correlations in the dataset, beyond second-order statistics. Conclusion: MAE representations adapt to spatial correlations in the dataset, beyond second-order statistics. Abstract: Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models. Despite their effectiveness, they require extensive hyperparameter tuning (masking ratio, patch size, encoder/decoder layers) when applied to novel datasets. While prior theoretical works have analyzed MAEs in terms of their attention patterns and hierarchical latent variable models, the connection between MAE hyperparameters and performance on downstream tasks is relatively unexplored. This work investigates how MAEs learn spatial correlations in the input image. We analytically derive the features learned by a linear MAE and show that masking ratio and patch size can be used to select for features that capture short- and long-range spatial correlations. We extend this analysis to non-linear MAEs to show that MAE representations adapt to spatial correlations in the dataset, beyond second-order statistics. Finally, we discuss some insights on how to select MAE hyper-parameters in practice.

[92] Bidirectional Temporal Information Propagation for Moving Infrared Small Target Detection

Dengyan Luo,Yanping Xiang,Hu Wang,Luping Ji. Shuai Li,Mao Ye

Main category: cs.CV

TL;DR: BIRD improves moving infrared small target detection by leveraging bidirectional temporal information for better performance and efficiency.

Details Motivation: Existing sliding-window-based methods for infrared small target detection are sub-optimal due to redundant computation and neglect of global temporal information outside the window. BIRD is proposed to address these limitations through joint optimization and bidirectional temporal aggregation. Method: BIRD employs a bidirectional temporal information propagation strategy with Local Temporal Motion Fusion (LTMF) and Global Temporal Motion Fusion (GTMF) modules to aggregate local and global temporal information across video frames. Joint optimization is achieved through detection loss and Spatio-Temporal Fusion (STF) loss. Result: The experiments show that BIRD outperforms existing methods in performance and offers faster inference speed. Conclusion: The BIRD method achieves state-of-the-art performance in moving infrared small target detection with fast inference speed. Abstract: Moving infrared small target detection is broadly adopted in infrared search and track systems, and has attracted considerable research focus in recent years. The existing learning-based multi-frame methods mainly aggregate the information of adjacent frames in a sliding window fashion to assist the detection of the current frame. However, the sliding-window-based methods do not consider joint optimization of the entire video clip and ignore the global temporal information outside the sliding window, resulting in redundant computation and sub-optimal performance. In this paper, we propose a Bidirectional temporal information propagation method for moving InfraRed small target Detection, dubbed BIRD. The bidirectional propagation strategy simultaneously utilizes local temporal information of adjacent frames and global temporal information of past and future frames in a recursive fashion. Specifically, in the forward and backward propagation branches, we first design a Local Temporal Motion Fusion (LTMF) module to model local spatio-temporal dependency between a target frame and its two adjacent frames. Then, a Global Temporal Motion Fusion (GTMF) module is developed to further aggregate the global propagation feature with the local fusion feature. Finally, the bidirectional aggregated features are fused and input into the detection head for detection. In addition, the entire video clip is jointly optimized by the traditional detection loss and the additional Spatio-Temporal Fusion (STF) loss. Extensive experiments demonstrate that the proposed BIRD method not only achieves the state-of-the-art performance but also shows a fast inference speed.

[93] A Curated Dataset and Deep Learning Approach for Minor Dent Detection in Vehicles

Danish Zia Baig,Mohsin Kamal

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于深度学习的解决方案,利用YOLOv8框架自动检测汽车外部的微观表面缺陷,特别是微小凹痕。

Details Motivation: 传统的汽车损伤检测过程是手工的、耗时的,并且经常不可靠地检测到微小缺陷,因此需要更快速和精确的检测方法。 Method: 使用YOLOv8对象识别框架和实时数据增强方法训练了YOLOv8m模型及其定制变体YOLOv8m-t4和YOLOv8m-t42,并创建了一个包含各种照明条件、角度和纹理下汽车表面标注照片的定制数据集。 Result: 实验结果表明,该方法具有出色的检测准确性和低推理延迟,适合实时应用。YOLOv8m-t42模型的精确度为0.86,召回率为0.84,F1得分为0.85,PR曲线面积为0.88。 Conclusion: YOLOv8m-t42模型在识别微观表面缺陷方面优于YOLOv8m-t4模型,具有更高的精确度和更适合实际凹痕检测应用的特点,尽管其收敛速度较慢。 Abstract: Conventional car damage inspection techniques are labor-intensive, manual, and frequently overlook tiny surface imperfections like microscopic dents. Machine learning provides an innovative solution to the increasing demand for quicker and more precise inspection methods. The paper uses the YOLOv8 object recognition framework to provide a deep learning-based solution for automatically detecting microscopic surface flaws, notably tiny dents, on car exteriors. Traditional automotive damage inspection procedures are manual, time-consuming, and frequently unreliable at detecting tiny flaws. To solve this, a bespoke dataset containing annotated photos of car surfaces under various lighting circumstances, angles, and textures was created. To improve robustness, the YOLOv8m model and its customized variants, YOLOv8m-t4 and YOLOv8m-t42, were trained employing real-time data augmentation approaches. Experimental results show that the technique has excellent detection accuracy and low inference latency, making it suited for real-time applications such as automated insurance evaluations and automobile inspections. Evaluation parameters such as mean Average Precision (mAP), precision, recall, and F1-score verified the model's efficacy. With a precision of 0.86, recall of 0.84, and F1-score of 0.85, the YOLOv8m-t42 model outperformed the YOLOv8m-t4 model (precision: 0.81, recall: 0.79, F1-score: 0.80) in identifying microscopic surface defects. With a little reduced mAP@0.5:0.95 of 0.20, the mAP@0.5 for YOLOv8m-t42 stabilized at 0.60. Furthermore, YOLOv8m-t42's PR curve area was 0.88, suggesting more consistent performance than YOLOv8m-t4 (0.82). YOLOv8m-t42 has greater accuracy and is more appropriate for practical dent detection applications, even though its convergence is slower.

[94] Aligning Moments in Time using Video Queries

Yogesh Kumar,Uday Agarwal,Manish Gupta,Anand Mishra

Main category: cs.CV

TL;DR: The paper proposes MATR, a transformer-based model for video-to-video moment retrieval, which significantly outperforms existing methods on two datasets.

Details Motivation: The motivation is to address the challenges in video-to-video moment retrieval, such as semantic frame-level alignment and modeling complex dependencies between query and target videos. Method: The paper introduces MATR, which uses a dual-stage sequence alignment to condition target video representations on query video features, guiding foreground/background classification and boundary prediction heads. It also employs a self-supervised pre-training technique. Result: Experiments show that MATR improves performance by 13.1% in R@1 and 8.1% in mIoU on the ActivityNet-VRL dataset, and by 14.7% in R@1 and 14.4% in mIoU on the newly proposed SportsMoments dataset. Conclusion: The paper concludes that MATR, a transformer-based model for video-to-video moment retrieval, achieves significant performance improvements on two datasets compared to state-of-the-art methods. Abstract: Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query video. Additionally, to provide a strong task-specific initialization for MATR, we propose a self-supervised pre-training technique that involves training the model to localize random clips within videos. Extensive experiments demonstrate that MATR achieves notable performance improvements of 13.1% in R@1 and 8.1% in mIoU on an absolute scale compared to state-of-the-art methods on the popular ActivityNet-VRL dataset. Additionally, on our newly proposed dataset, SportsMoments, MATR shows a 14.7% gain in R@1 and a 14.4% gain in mIoU on an absolute scale over strong baselines.

[95] Enhancing Novel View Synthesis from extremely sparse views with SfM-free 3D Gaussian Splatting Framework

Zongqi He,Hanmin Li,Kin-Chung Chan,Yushen Zuo,Hao Xie,Zhe Xiao,Jun Xiao,Kin-Man Lam

Main category: cs.CV

TL;DR: 本研究提出了一种新的3DGS方法,能够在极端稀疏视角输入的情况下重建高质量的3D场景,并显著优于现有方法。

Details Motivation: 3D高斯随机投影(3DGS)依赖于密集多视角输入和已知相机姿态,但在实际场景中往往无法获得,因此需要一种无需SfM的方法来处理稀疏视角输入。 Method: 提出了一种密集立体模块和连贯视角插值模块,用于估计相机姿态和重建密集点云,并引入了多尺度拉普拉斯一致正则化和自适应空间感知多尺度几何正则化。 Result: 实验表明,该方法在极端稀疏视角条件下(仅使用2个训练视角)在PSNR上提升了2.75dB,并且合成的图像具有更少的失真和丰富的高频细节。 Conclusion: 该论文提出了一种无需SfM的3DGS方法,能够在极端稀疏视角输入的情况下,联合估计相机姿态并重建3D场景,显著优于现有的3DGS方法。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated remarkable real-time performance in novel view synthesis, yet its effectiveness relies heavily on dense multi-view inputs with precisely known camera poses, which are rarely available in real-world scenarios. When input views become extremely sparse, the Structure-from-Motion (SfM) method that 3DGS depends on for initialization fails to accurately reconstruct the 3D geometric structures of scenes, resulting in degraded rendering quality. In this paper, we propose a novel SfM-free 3DGS-based method that jointly estimates camera poses and reconstructs 3D scenes from extremely sparse-view inputs. Specifically, instead of SfM, we propose a dense stereo module to progressively estimates camera pose information and reconstructs a global dense point cloud for initialization. To address the inherent problem of information scarcity in extremely sparse-view settings, we propose a coherent view interpolation module that interpolates camera poses based on training view pairs and generates viewpoint-consistent content as additional supervision signals for training. Furthermore, we introduce multi-scale Laplacian consistent regularization and adaptive spatial-aware multi-scale geometry regularization to enhance the quality of geometrical structures and rendered content. Experiments show that our method significantly outperforms other state-of-the-art 3DGS-based approaches, achieving a remarkable 2.75dB improvement in PSNR under extremely sparse-view conditions (using only 2 training views). The images synthesized by our method exhibit minimal distortion while preserving rich high-frequency details, resulting in superior visual quality compared to existing techniques.

[96] LGMSNet: Thinning a medical image segmentation model via dual-level multiscale fusion

Chengqi Dong,Fenghe Tang,Rongge Mao,Xinpei Gao,S. Kevin Zhou

Main category: cs.CV

TL;DR: LGMSNet is a novel lightweight framework for medical image segmentation that achieves high performance and strong generalization while maintaining low computational cost.

Details Motivation: Existing lightweight models for medical image segmentation often sacrifice performance for efficiency and lack global contextual perception due to the absence of attention mechanisms. Additionally, channel redundancy under identical convolutional kernels hinders effective feature extraction, necessitating a more efficient and generalizable solution. Method: LGMSNet uses a dual multiscale architecture with heterogeneous intra-layer kernels for local feature extraction and sparse transformer-convolutional hybrid branches for capturing global context. It addresses channel redundancy and computational inefficiency in existing lightweight models. Result: Extensive experiments on six public datasets show that LGMSNet outperforms state-of-the-art methods. It achieves exceptional zero-shot generalization on four unseen datasets, proving its effectiveness and potential for real-world deployment. Conclusion: LGMSNet is a lightweight and efficient framework for medical image segmentation that demonstrates superior performance and generalization capabilities, making it suitable for resource-limited clinical environments. Abstract: Medical image segmentation plays a pivotal role in disease diagnosis and treatment planning, particularly in resource-constrained clinical settings where lightweight and generalizable models are urgently needed. However, existing lightweight models often compromise performance for efficiency and rarely adopt computationally expensive attention mechanisms, severely restricting their global contextual perception capabilities. Additionally, current architectures neglect the channel redundancy issue under the same convolutional kernels in medical imaging, which hinders effective feature extraction. To address these challenges, we propose LGMSNet, a novel lightweight framework based on local and global dual multiscale that achieves state-of-the-art performance with minimal computational overhead. LGMSNet employs heterogeneous intra-layer kernels to extract local high-frequency information while mitigating channel redundancy. In addition, the model integrates sparse transformer-convolutional hybrid branches to capture low-frequency global information. Extensive experiments across six public datasets demonstrate LGMSNet's superiority over existing state-of-the-art methods. In particular, LGMSNet maintains exceptional performance in zero-shot generalization tests on four unseen datasets, underscoring its potential for real-world deployment in resource-limited medical scenarios. The whole project code is in https://github.com/cq-dong/LGMSNet.

[97] MExECON: Multi-view Extended Explicit Clothed humans Optimized via Normal integration

Fulden Ece Uğur,Rafael Redondo,Albert Barreiro,Stefan Hristov,Roger Marí

Main category: cs.CV

TL;DR: MExECON is a new pipeline for 3D reconstruction of clothed humans using multiple RGB images, combining multi-view consistency and normal map integration to improve accuracy without re-training.

Details Motivation: The motivation is to enhance the geometry and body pose estimation of clothed humans by leveraging multiple viewpoints, which is not possible with single-view methods. Method: MExECON builds upon the single-view method ECON and introduces the Joint Multi-view Body Optimization (JMBO) algorithm to fit a SMPL-X body model across all input views, ensuring multi-view consistency. It integrates normal maps from multiple views for detailed surface reconstruction. Result: Experimental results show that MExECON improves reconstruction fidelity over the single-view baseline and performs competitively compared to modern few-shot 3D reconstruction methods. Conclusion: MExECON is a new and effective pipeline for 3D reconstruction of clothed human avatars from sparse multi-view RGB images, achieving improved fidelity without requiring network re-training. Abstract: This work presents MExECON, a novel pipeline for 3D reconstruction of clothed human avatars from sparse multi-view RGB images. Building on the single-view method ECON, MExECON extends its capabilities to leverage multiple viewpoints, improving geometry and body pose estimation. At the core of the pipeline is the proposed Joint Multi-view Body Optimization (JMBO) algorithm, which fits a single SMPL-X body model jointly across all input views, enforcing multi-view consistency. The optimized body model serves as a low-frequency prior that guides the subsequent surface reconstruction, where geometric details are added via normal map integration. MExECON integrates normal maps from both front and back views to accurately capture fine-grained surface details such as clothing folds and hairstyles. All multi-view gains are achieved without requiring any network re-training. Experimental results show that MExECON consistently improves fidelity over the single-view baseline and achieves competitive performance compared to modern few-shot 3D reconstruction methods.

[98] Task-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion

Mengyu Wang,Zhenyu Liu,Kun Li,Yu Wang,Yuwei Wang,Yanyan Wei,Fei Wang

Main category: cs.CV

TL;DR: AdaSFFuse is a novel framework for task-generalized Multimodal Image Fusion that overcomes modality misalignment and frequency detail loss through Adaptive Approximate Wavelet Transform and Spatial-Frequency Mamba Blocks, ensuring efficient performance.

Details Motivation: Multimodal Image Fusion (MMIF) aims to integrate complementary information from different imaging modalities to enhance image quality and support downstream applications, but current methods face challenges like modality misalignment, high-frequency detail destruction, and task-specific limitations. Method: AdaSFFuse incorporates two innovations: Adaptive Approximate Wavelet Transform (AdaWAT) for frequency decoupling and Spatial-Frequency Mamba Blocks for efficient cross-domain fusion in both spatial and frequency domains. Result: AdaSFFuse addresses challenges in MMIF by improving alignment and integration of multimodal features, reducing frequency loss, and preserving critical details, as demonstrated across four MMIF tasks: IVF, MFF, MEF, and MIF. Conclusion: AdaSFFuse provides superior fusion performance with low computational cost and a compact network, achieving a strong balance between performance and efficiency. Abstract: Multimodal Image Fusion (MMIF) aims to integrate complementary information from different imaging modalities to overcome the limitations of individual sensors. It enhances image quality and facilitates downstream applications such as remote sensing, medical diagnostics, and robotics. Despite significant advancements, current MMIF methods still face challenges such as modality misalignment, high-frequency detail destruction, and task-specific limitations. To address these challenges, we propose AdaSFFuse, a novel framework for task-generalized MMIF through adaptive cross-domain co-fusion learning. AdaSFFuse introduces two key innovations: the Adaptive Approximate Wavelet Transform (AdaWAT) for frequency decoupling, and the Spatial-Frequency Mamba Blocks for efficient multimodal fusion. AdaWAT adaptively separates the high- and low-frequency components of multimodal images from different scenes, enabling fine-grained extraction and alignment of distinct frequency characteristics for each modality. The Spatial-Frequency Mamba Blocks facilitate cross-domain fusion in both spatial and frequency domains, enhancing this process. These blocks dynamically adjust through learnable mappings to ensure robust fusion across diverse modalities. By combining these components, AdaSFFuse improves the alignment and integration of multimodal features, reduces frequency loss, and preserves critical details. Extensive experiments on four MMIF tasks -- Infrared-Visible Image Fusion (IVF), Multi-Focus Image Fusion (MFF), Multi-Exposure Image Fusion (MEF), and Medical Image Fusion (MIF) -- demonstrate AdaSFFuse's superior fusion performance, ensuring both low computational cost and a compact network, offering a strong balance between performance and efficiency. The code will be publicly available at https://github.com/Zhen-yu-Liu/AdaSFFuse.

[99] ExtraGS: Geometric-Aware Trajectory Extrapolation with Uncertainty-Guided Generative Priors

Kaiyuan Tan,Yingying Shen,Haohui Zhu,Zhiwei Zhan,Shan Zhao,Mingfei Tu,Hongcheng Luo,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye

Main category: cs.CV

TL;DR: This paper proposes ExtraGS, a new framework for trajectory extrapolation in autonomous driving simulations, combining geometric and generative priors to enhance realism and consistency.

Details Motivation: Synthesizing extrapolated views from driving logs is crucial for autonomous driving simulations, but existing methods suffer from poor geometric consistency and over-smoothed renderings. Method: The paper proposes ExtraGS, which uses a Road Surface Gaussian (RSG) representation based on a hybrid Gaussian-SDF design and Far Field Gaussians (FFG) with learnable scaling factors to handle distant objects. It also introduces a self-supervised uncertainty estimation framework using spherical harmonics to selectively integrate generative priors. Result: Extensive experiments show that ExtraGS significantly improves realism and geometric consistency of extrapolated views across multiple datasets, multi-camera setups, and various generative priors, while maintaining high fidelity along the original trajectory. Conclusion: ExtraGS provides a holistic framework for trajectory extrapolation that effectively combines geometric and generative priors, significantly enhancing the realism and geometric consistency of extrapolated views while preserving high fidelity. Abstract: Synthesizing extrapolated views from recorded driving logs is critical for simulating driving scenes for autonomous driving vehicles, yet it remains a challenging task. Recent methods leverage generative priors as pseudo ground truth, but often lead to poor geometric consistency and over-smoothed renderings. To address these limitations, we propose ExtraGS, a holistic framework for trajectory extrapolation that integrates both geometric and generative priors. At the core of ExtraGS is a novel Road Surface Gaussian(RSG) representation based on a hybrid Gaussian-Signed Distance Function (SDF) design, and Far Field Gaussians (FFG) that use learnable scaling factors to efficiently handle distant objects. Furthermore, we develop a self-supervised uncertainty estimation framework based on spherical harmonics that enables selective integration of generative priors only where extrapolation artifacts occur. Extensive experiments on multiple datasets, diverse multi-camera setups, and various generative priors demonstrate that ExtraGS significantly enhances the realism and geometric consistency of extrapolated views, while preserving high fidelity along the original trajectory.

[100] Multi-Object Sketch Animation with Grouping and Motion Trajectory Priors

Guotao Liang,Juncheng Hu,Ximing Xing,Jing Zhang,Qian Yu

Main category: cs.CV

TL;DR: GroupSketch introduces a two-stage pipeline for vector sketch animation that effectively handles multi-object interactions and complex motions, leading to high-quality, temporally consistent animations and expanding the practical applications of sketch animation.

Details Motivation: Existing approaches struggle with multi-object interactions and complex motions, either being limited to single-object cases or suffering from temporal inconsistency and poor generalization. GroupSketch was developed to overcome these limitations and improve the quality and applicability of sketch animation. Method: GroupSketch adopts a two-stage pipeline: Motion Initialization, where the input sketch is divided into semantic groups and key frames are defined to generate a coarse animation via interpolation, and Motion Refinement, where a Group-based Displacement Network (GDN) refines the animation by predicting group-specific displacement fields using priors from a text-to-video model. GDN also includes specialized modules like Context-conditioned Feature Enhancement (CCFE) to improve temporal consistency. Result: Extensive experiments demonstrate that GroupSketch significantly outperforms existing methods in generating high-quality, temporally consistent animations for complex, multi-object sketches. Conclusion: GroupSketch is a novel method for vector sketch animation that effectively handles multi-object interactions and complex motions, significantly outperforming existing methods in generating high-quality, temporally consistent animations for complex, multi-object sketches. Abstract: We introduce GroupSketch, a novel method for vector sketch animation that effectively handles multi-object interactions and complex motions. Existing approaches struggle with these scenarios, either being limited to single-object cases or suffering from temporal inconsistency and poor generalization. To address these limitations, our method adopts a two-stage pipeline comprising Motion Initialization and Motion Refinement. In the first stage, the input sketch is interactively divided into semantic groups and key frames are defined, enabling the generation of a coarse animation via interpolation. In the second stage, we propose a Group-based Displacement Network (GDN), which refines the coarse animation by predicting group-specific displacement fields, leveraging priors from a text-to-video model. GDN further incorporates specialized modules, such as Context-conditioned Feature Enhancement (CCFE), to improve temporal consistency. Extensive experiments demonstrate that our approach significantly outperforms existing methods in generating high-quality, temporally consistent animations for complex, multi-object sketches, thus expanding the practical applications of sketch animation.

[101] D3FNet: A Differential Attention Fusion Network for Fine-Grained Road Structure Extraction in Remote Perception Systems

Chang Liu,Yang Xu,Tamas Sziranyi

Main category: cs.CV

TL;DR: D3FNet addresses the challenge of extracting narrow roads from high-resolution remote sensing imagery by using a novel network structure that enhances road features and suppresses background noise, leading to improved performance on challenging road regions.

Details Motivation: Extracting narrow roads from high-resolution remote sensing imagery is challenging due to their limited width, fragmented topology, and frequent occlusions. Method: D3FNet, a Dilated Dual-Stream Differential Attention Fusion Network, was developed with three key innovations: a Differential Attention Dilation Extraction module, a Dual-stream Decoding Fusion Mechanism, and a multi-scale dilation strategy. Result: Extensive experiments on the DeepGlobe and CHN6-CUG benchmarks showed that D3FNet achieves superior IoU and recall on challenging road regions, outperforming state-of-the-art baselines. Conclusion: D3FNet is a robust solution for fine-grained narrow road extraction in complex remote and cooperative perception scenarios. Abstract: Extracting narrow roads from high-resolution remote sensing imagery remains a significant challenge due to their limited width, fragmented topology, and frequent occlusions. To address these issues, we propose D3FNet, a Dilated Dual-Stream Differential Attention Fusion Network designed for fine-grained road structure segmentation in remote perception systems. Built upon the encoder-decoder backbone of D-LinkNet, D3FNet introduces three key innovations:(1) a Differential Attention Dilation Extraction (DADE) module that enhances subtle road features while suppressing background noise at the bottleneck; (2) a Dual-stream Decoding Fusion Mechanism (DDFM) that integrates original and attention-modulated features to balance spatial precision with semantic context; and (3) a multi-scale dilation strategy (rates 1, 3, 5, 9) that mitigates gridding artifacts and improves continuity in narrow road prediction. Unlike conventional models that overfit to generic road widths, D3FNet specifically targets fine-grained, occluded, and low-contrast road segments. Extensive experiments on the DeepGlobe and CHN6-CUG benchmarks show that D3FNet achieves superior IoU and recall on challenging road regions, outperforming state-of-the-art baselines. Ablation studies further verify the complementary synergy of attention-guided encoding and dual-path decoding. These results confirm D3FNet as a robust solution for fine-grained narrow road extraction in complex remote and cooperative perception scenarios.

[102] Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

Youjia Zhang,Youngeun Kim,Young-Geun Choi,Hongyeob Kim,Huiling Liu,Sungeun Hong

Main category: cs.CV

TL;DR: 本文提出了一种新的测试时适应方法ADAPT,通过高斯概率推理和无训练推理策略,解决了现有方法的局限性,并在广泛的分布迁移下实现了先进的性能。

Details Motivation: 解决现有测试时适应方法依赖反向传播或迭代优化的问题,同时缺乏对类条件特征分布的明确建模。 Method: 将TTA重构为高斯概率推理任务,通过建模类条件似然,采用渐进更新的类均值和共享协方差矩阵,实现了闭合形式的无训练推理。 Result: ADAPT方法在多个基准实验中表现出色,展示了良好的适应性和性能提升。 Conclusion: ADAPT方法在广泛的分布迁移场景下实现了最先进的性能,并具有出色的可扩展性和鲁棒性。 Abstract: Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.

[103] High-Frequency First: A Two-Stage Approach for Improving Image INR

Sumit Kumar Dam,Mrityunjoy Gain,Eui-Nam Huh,Choong Seon Hong

Main category: cs.CV

TL;DR: This paper proposes a two-stage training strategy for INRs that adaptively prioritizes high-frequency pixels, improving image reconstruction quality and offering a new solution to the spectral bias problem.

Details Motivation: To address the spectral bias of neural networks in Implicit Neural Representations (INRs), which struggle to capture high-frequency details like sharp edges and fine textures. Method: A two-stage training strategy using a neighbor-aware soft mask to prioritize pixels with strong local variations during training. Result: Experimental results show consistent improvements in reconstruction quality, demonstrating the effectiveness of the proposed method in mitigating spectral bias. Conclusion: The proposed two-stage training strategy effectively mitigates the spectral bias problem in INRs, improving image reconstruction quality and complementing existing methods. Abstract: Implicit Neural Representations (INRs) have emerged as a powerful alternative to traditional pixel-based formats by modeling images as continuous functions over spatial coordinates. A key challenge, however, lies in the spectral bias of neural networks, which tend to favor low-frequency components while struggling to capture high-frequency (HF) details such as sharp edges and fine textures. While prior approaches have addressed this limitation through architectural modifications or specialized activation functions, we propose an orthogonal direction by directly guiding the training process. Specifically, we introduce a two-stage training strategy where a neighbor-aware soft mask adaptively assigns higher weights to pixels with strong local variations, encouraging early focus on fine details. The model then transitions to full-image training. Experimental results show that our approach consistently improves reconstruction quality and complements existing INR methods. As a pioneering attempt to assign frequency-aware importance to pixels in image INR, our work offers a new avenue for mitigating the spectral bias problem.

[104] Fast globally optimal Truncated Least Squares point cloud registration with fixed rotation axis

Ivo Ivanov,Carsten Markgraf

Main category: cs.CV

TL;DR: A fast linear-time method for globally optimal point cloud registration in rotation-only cases, significantly outperforming current SDP solvers in speed.

Details Motivation: Current globally optimal approaches for point cloud registration, such as those based on semidefinite programming (SDP), are computationally expensive, taking hundreds of seconds for just 100 points. A faster solution is needed for practical applications. Method: A novel linear time convex relaxation and a contractor method to accelerate Branch and Bound (BnB) are proposed, focusing on the rotation-only truncated least squares (TLS) problem. Result: The proposed solver achieves provable global optimality for 3D point cloud registration with 100 points in less than half a second when the rotation axis is known, making it two orders of magnitude faster than STRIDE, the state-of-the-art SDP solver. Conclusion: The proposed method offers a significantly faster approach to achieving global optimality for the rotation-only TLS problem in point cloud registration compared to existing SDP solvers, although it cannot yet handle the full 6DoF problem. Abstract: Recent results showed that point cloud registration with given correspondences can be made robust to outlier rates of up to 95\% using the truncated least squares (TLS) formulation. However, solving this combinatorial optimization problem to global optimality is challenging. Provably globally optimal approaches using semidefinite programming (SDP) relaxations take hundreds of seconds for 100 points. In this paper, we propose a novel linear time convex relaxation as well as a contractor method to speed up Branch and Bound (BnB). Our solver can register two 3D point clouds with 100 points to provable global optimality in less than half a second when the axis of rotation is provided. Although it currently cannot solve the full 6DoF problem, it is two orders of magnitude faster than the state-of-the-art SDP solver STRIDE when solving the rotation-only TLS problem. In addition to providing a formal proof for global optimality, we present empirical evidence of global optimality using adversarial instances with local minimas close to the global minimum.

[105] Multi-perspective monitoring of wildlife and human activities from camera traps and drones with deep learning models

Hao Chen,Fang Qiu,Li An,Douglas Stow,Eve Bohnett,Haitao Lyu,Shuang Tian

Main category: cs.CV

TL;DR: 该研究通过结合相机陷阱和无人机图像以及深度学习模型,揭示了尼泊尔奇旺国家公园及其周边地区野生动物和人类活动的潜在冲突区域。

Details Motivation: 了解野生动物和人类活动的空间分布对于评估人与野生动物的相互作用和有效的保护规划至关重要。 Method: 结合相机陷阱和无人机图像进行多视角监测,并使用深度学习模型(如YOLOv11s和Faster RCNN)进行自动物体检测。 Result: YOLOv11s模型在相机陷阱图像检测中表现出最高的性能,精确度为96.2%,召回率为92.3%,mAP50为96.7%和81.3%。 Conclusion: 整合多视角监测与自动物体检测可以增强野生动物监测和景观管理。 Abstract: Wildlife and human activities are key components of landscape systems. Understanding their spatial distribution is essential for evaluating human wildlife interactions and informing effective conservation planning. Multiperspective monitoring of wildlife and human activities by combining camera traps and drone imagery. Capturing the spatial patterns of their distributions, which allows the identification of the overlap of their activity zones and the assessment of the degree of human wildlife conflict. The study was conducted in Chitwan National Park (CNP), Nepal, and adjacent regions. Images collected by visible and nearinfrared camera traps and thermal infrared drones from February to July 2022 were processed to create training and testing datasets, which were used to build deep learning models to automatic identify wildlife and human activities. Drone collected thermal imagery was used for detecting targets to provide a multiple monitoring perspective. Spatial pattern analysis was performed to identify animal and resident activity hotspots and delineation potential human wildlife conflict zones. Among the deep learning models tested, YOLOv11s achieved the highest performance with a precision of 96.2%, recall of 92.3%, mAP50 of 96.7%, and mAP50 of 81.3%, making it the most effective for detecting objects in camera trap imagery. Drone based thermal imagery, analyzed with an enhanced Faster RCNN model, added a complementary aerial viewpoint for camera trap detections. Spatial pattern analysis identified clear hotspots for both wildlife and human activities and their overlapping patterns within certain areas in the CNP and buffer zones indicating potential conflict. This study reveals human wildlife conflicts within the conserved landscape. Integrating multiperspective monitoring with automated object detection enhances wildlife surveillance and landscape management.

[106] When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

Pengcheng Fang,Yuxia Chen,Rui Guo

Main category: cs.CV

TL;DR: This paper proposes Grounded VideoDiT, a Video LLM that enhances temporal perception and alignment through novel architectural innovations, leading to superior performance on video understanding tasks.

Details Motivation: Current Video LLMs struggle with temporal perception, weak frame-level features, and misaligned language-vision representations, prompting the need for a more precise and fine-grained approach. Method: Grounded VideoDiT introduces three innovations: a Diffusion Temporal Latent encoder for temporal consistency, object grounded representations for better alignment, and a mixed token scheme for explicit timestamp modeling. Result: Grounded VideoDiT achieves state-of-the-art results on Charades STA, NExT GQA, and several VideoQA benchmarks. Conclusion: Grounded VideoDiT improves video understanding by overcoming limitations in temporal perception and alignment, achieving state-of-the-art results on multiple benchmarks. Abstract: Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides explicit timestamp modeling, enabling fine grained temporal reasoning. Together, these designs equip Grounded VideoDiT with robust grounding capabilities, as validated by state of the art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks.

[107] Weakly-Supervised Learning for Tree Instances Segmentation in Airborne Lidar Point Clouds

Swann Emilien Céleste Destouches,Jesse Lahaye,Laurent Valentin Jospin,Jan Skaloud

Main category: cs.CV

TL;DR: This paper proposes a weakly supervised method for tree instance segmentation in ALS data, achieving a 34% improvement in accuracy and reducing false positives, though it struggles in areas with small trees or complex environments.

Details Motivation: Tree instance segmentation from ALS data is crucial for forest monitoring but challenging due to variations in data characteristics and the high cost of obtaining precisely labeled datasets for supervised methods. Method: A weakly supervised approach is proposed where an initial segmentation result is labeled by a human operator as a quality rating. These labels train a rating model to classify segmentation outputs, and its feedback is used to finetune the segmentation model. Result: The method improves the correct identification of tree instances by 34% while significantly reducing false positives (non-tree instances predicted as trees). However, performance decreases in areas with small trees or complex surroundings like shrubs and boulders. Conclusion: The proposed weakly supervised approach significantly improves the performance of the segmentation model for airborne laser scanning data by reducing non-tree instances and leveraging feedback from a rating model, although it still faces challenges in certain complex environments. Abstract: Tree instance segmentation of airborne laser scanning (ALS) data is of utmost importance for forest monitoring, but remains challenging due to variations in the data caused by factors such as sensor resolution, vegetation state at acquisition time, terrain characteristics, etc. Moreover, obtaining a sufficient amount of precisely labeled data to train fully supervised instance segmentation methods is expensive. To address these challenges, we propose a weakly supervised approach where labels of an initial segmentation result obtained either by a non-finetuned model or a closed form algorithm are provided as a quality rating by a human operator. The labels produced during the quality assessment are then used to train a rating model, whose task is to classify a segmentation output into the same classes as specified by the human operator. Finally, the segmentation model is finetuned using feedback from the rating model. This in turn improves the original segmentation model by 34\% in terms of correctly identified tree instances while considerably reducing the number of non-tree instances predicted. Challenges still remain in data over sparsely forested regions characterized by small trees (less than two meters in height) or within complex surroundings containing shrubs, boulders, etc. which can be confused as trees where the performance of the proposed method is reduced.

[108] Towards a 3D Transfer-based Black-box Attack via Critical Feature Guidance

Shuchao Pang,Zhenghan Chen,Shen Zhang,Liming Lu,Siyuan Liang,Anan Du,Yongbin Zhou

Main category: cs.CV

TL;DR: This paper proposes CFG, a transfer-based black-box attack method for generating adversarial point clouds that do not require knowledge of the target model, achieving superior performance on benchmark datasets.

Details Motivation: Deep neural networks for 3D point clouds are vulnerable to adversarial examples, but existing attack methods require information about the target model, which is often unavailable in real-world scenarios. This necessitates the development of a transfer-based black-box attack method. Method: CFG computes feature importance and prioritizes the corruption of critical features that are likely to be used across different DNN architectures. It also constrains the deviation in the loss function to maintain imperceptibility. Result: Experiments on ModelNet40 and ScanObjectNN datasets showed that CFG outperforms existing attack methods by a large margin in generating transferable adversarial point clouds. Conclusion: The proposed CFG method significantly outperforms state-of-the-art attack methods in generating transferable adversarial point clouds without requiring any information about the target models. Abstract: Deep neural networks for 3D point clouds have been demonstrated to be vulnerable to adversarial examples. Previous 3D adversarial attack methods often exploit certain information about the target models, such as model parameters or outputs, to generate adversarial point clouds. However, in realistic scenarios, it is challenging to obtain any information about the target models under conditions of absolute security. Therefore, we focus on transfer-based attacks, where generating adversarial point clouds does not require any information about the target models. Based on our observation that the critical features used for point cloud classification are consistent across different DNN architectures, we propose CFG, a novel transfer-based black-box attack method that improves the transferability of adversarial point clouds via the proposed Critical Feature Guidance. Specifically, our method regularizes the search of adversarial point clouds by computing the importance of the extracted features, prioritizing the corruption of critical features that are likely to be adopted by diverse architectures. Further, we explicitly constrain the maximum deviation extent of the generated adversarial point clouds in the loss function to ensure their imperceptibility. Extensive experiments conducted on the ModelNet40 and ScanObjectNN benchmark datasets demonstrate that the proposed CFG outperforms the state-of-the-art attack methods by a large margin.

[109] MapKD: Unlocking Prior Knowledge with Cross-Modal Distillation for Efficient Online HD Map Construction

Ziyang Yan,Ruikai Li,Zhiyong Cui,Bohan Li,Han Jiang,Yilong Ren,Aoyong Li,Zhenning Li,Sijia Wen,Haiyang Yu

Main category: cs.CV

TL;DR: MapKD is a novel knowledge distillation framework for autonomous driving that transfers knowledge from complex multimodal models to an efficient vision-based student model, enhancing performance while reducing computational costs.

Details Motivation: Existing methods for online HD map construction rely on outdated offline maps and expensive multi-modal sensors, causing unnecessary computational overhead. MapKD aims to transfer knowledge efficiently to a lightweight vision-based model. Method: MapKD proposes a multi-level cross-modal knowledge distillation framework with a Teacher-Coach-Student paradigm, incorporating Token-Guided 2D Patch Distillation and Masked Semantic Response Distillation strategies. Result: Experiments on nuScenes dataset show improvements of +6.68 mIoU and +10.94 mAP for the student model, with faster inference speed. Conclusion: MapKD improves student model performance while accelerating inference speed, showing the effectiveness of the proposed knowledge distillation framework. Abstract: Online HD map construction is a fundamental task in autonomous driving systems, aiming to acquire semantic information of map elements around the ego vehicle based on real-time sensor inputs. Recently, several approaches have achieved promising results by incorporating offline priors such as SD maps and HD maps or by fusing multi-modal data. However, these methods depend on stale offline maps and multi-modal sensor suites, resulting in avoidable computational overhead at inference. To address these limitations, we employ a knowledge distillation strategy to transfer knowledge from multimodal models with prior knowledge to an efficient, low-cost, and vision-centric student model. Specifically, we propose MapKD, a novel multi-level cross-modal knowledge distillation framework with an innovative Teacher-Coach-Student (TCS) paradigm. This framework consists of: (1) a camera-LiDAR fusion model with SD/HD map priors serving as the teacher; (2) a vision-centric coach model with prior knowledge and simulated LiDAR to bridge the cross-modal knowledge transfer gap; and (3) a lightweight vision-based student model. Additionally, we introduce two targeted knowledge distillation strategies: Token-Guided 2D Patch Distillation (TGPD) for bird's eye view feature alignment and Masked Semantic Response Distillation (MSRD) for semantic learning guidance. Extensive experiments on the challenging nuScenes dataset demonstrate that MapKD improves the student model by +6.68 mIoU and +10.94 mAP while simultaneously accelerating inference speed. The code is available at:https://github.com/2004yan/MapKD2026.

[110] CM2LoD3: Reconstructing LoD3 Building Models Using Semantic Conflict Maps

Franz Hanke,Antonia Bieringer,Olaf Wysocki,Boris Jutzi

Main category: cs.CV

TL;DR: 该研究提出了一种名为CM2LoD3的新方法,利用冲突图和合成语义冲突图进行LoD3建筑模型重建,实现了更高效的3D城市建模。

Details Motivation: 现有的LoD1和LoD2建筑模型缺乏高级城市分析所需的具体立面元素,而LoD3模型虽然弥补了这一缺陷,但其生成通常依赖手动建模,难以大规模应用。 Method: 提出了一种基于冲突图(CMs)的LoD3建筑模型重建方法,并结合语义冲突图生成器(SCMG)生成的合成CMs进行真实CM的语义分割。 Result: 实验结果表明,CM2LoD3方法在建筑开口的分割和重建方面具有有效性,结合建筑纹理分割的不确定性感知融合实现了61%的性能提升。 Conclusion: 研究为自动化的LoD3模型重建提供了新的方法,推动了可扩展和高效的城市3D建模。 Abstract: Detailed 3D building models are crucial for urban planning, digital twins, and disaster management applications. While Level of Detail 1 (LoD)1 and LoD2 building models are widely available, they lack detailed facade elements essential for advanced urban analysis. In contrast, LoD3 models address this limitation by incorporating facade elements such as windows, doors, and underpasses. However, their generation has traditionally required manual modeling, making large-scale adoption challenging. In this contribution, CM2LoD3, we present a novel method for reconstructing LoD3 building models leveraging Conflict Maps (CMs) obtained from ray-to-model-prior analysis. Unlike previous works, we concentrate on semantically segmenting real-world CMs with synthetically generated CMs from our developed Semantic Conflict Map Generator (SCMG). We also observe that additional segmentation of textured models can be fused with CMs using confidence scores to further increase segmentation performance and thus increase 3D reconstruction accuracy. Experimental results demonstrate the effectiveness of our CM2LoD3 method in segmenting and reconstructing building openings, with the 61% performance with uncertainty-aware fusion of segmented building textures. This research contributes to the advancement of automated LoD3 model reconstruction, paving the way for scalable and efficient 3D city modeling. Our project is available: https://github.com/InFraHank/CM2LoD3

[111] LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions

Yongju Jia,Jiarui Ma,Xiangxian Li,Baiqiao Zhang,Xianhui Cao,Juan Liu,Yulong Bian

Main category: cs.CV

TL;DR: 本文提出了一种用于预训练视觉-语言模型微调的多维动态提示路由框架,以解决类别不平衡问题。

Details Motivation: 为了解决预训练视觉-语言模型在微调时因类别不平衡场景中的偏差积累问题。 Method: 构建了一个包含五个视觉语义维度的综合知识库,并在微调期间使用动态路由机制对齐全局视觉类别,检索最佳提示,并平衡细粒度语义,通过logits融合产生稳定预测。 Result: 在长尾基准数据集上的实验表明,MDPR取得了与当前最先进的方法相当的结果,消融研究进一步证实了我们的语义库对尾部类别的有效性,并显示我们的动态路由只带来最小的计算开销。 Conclusion: MDPR是一种灵活且高效的VLM微调增强方法,适用于数据不平衡的情况。 Abstract: Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive capability in visual tasks, but their fine-tuning often suffers from bias in class-imbalanced scene. Recent works have introduced large language models (LLMs) to enhance VLM fine-tuning with supplementing semantic information. However, they often overlook inherent class imbalance in VLMs' pre-training, which may lead to bias accumulation in downstream tasks. To address this problem, this paper proposes a Multi-dimensional Dynamic Prompt Routing (MDPR) framework. MDPR constructs a comprehensive knowledge base for classes, spanning five visual-semantic dimensions. During fine-tuning, the dynamic routing mechanism aligns global visual classes, retrieves optimal prompts, and balances fine-grained semantics, yielding stable predictions through logits fusion. Extensive experiments on long-tailed benchmarks, including CIFAR-LT, ImageNet-LT, and Places-LT, demonstrate that MDPR achieves comparable results with current SOTA methods. Ablation studies further confirm the effectiveness of our semantic library for tail classes, and show that our dynamic routing incurs minimal computational overhead, making MDPR a flexible and efficient enhancement for VLM fine-tuning under data imbalance.

[112] StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

Yanlai Yang,Zhuokai Zhao,Satya Narayan Shukla,Aashu Singh,Shlok Kumar Mishra,Lizhu Zhang,Mengye Ren

Main category: cs.CV

TL;DR: StreamMem是为流媒体视频理解设计的一种与查询无关的KV缓存内存机制,它在处理长视频时实现了高效的键值缓存压缩,同时保持了问答的准确性。

Details Motivation: 多模态大语言模型(MLLMs)在视觉-语言推理方面取得了显著进展,但它们在高效处理长视频方面仍存在局限性。现有的视觉压缩方法要求在压缩前对整个视觉上下文进行编码,或需要提前访问问题,这对于长视频理解和多轮对话场景来说是不切实际的。 Method: StreamMem以流式方式对新视频帧进行编码,使用视觉标记和通用查询标记之间的注意力得分来压缩KV缓存,同时保持固定大小的KV内存,以在内存受限的长视频场景中实现高效的问答。 Result: 在三个长视频理解和两个流视频问答基准上的评估表明,StreamMem在查询无关的KV缓存压缩方面达到了最先进的性能,并且与查询感知压缩方法具有竞争力。 Conclusion: StreamMem提供了一种有效的解决方案,以解决长视频理解和流媒体视频问答中键值缓存的内存和计算开销问题。 Abstract: Multimodal large language models (MLLMs) have made significant progress in visual-language reasoning, but their ability to efficiently handle long videos remains limited. Despite recent advances in long-context MLLMs, storing and attending to the key-value (KV) cache for long visual contexts incurs substantial memory and computational overhead. Existing visual compression methods require either encoding the entire visual context before compression or having access to the questions in advance, which is impractical for long video understanding and multi-turn conversational settings. In this work, we propose StreamMem, a query-agnostic KV cache memory mechanism for streaming video understanding. Specifically, StreamMem encodes new video frames in a streaming manner, compressing the KV cache using attention scores between visual tokens and generic query tokens, while maintaining a fixed-size KV memory to enable efficient question answering (QA) in memory-constrained, long-video scenarios. Evaluation on three long video understanding and two streaming video question answering benchmarks shows that StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.

[113] WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception

Zhiheng Liu,Xueqing Deng,Shoufa Chen,Angtian Wang,Qiushan Guo,Mingfei Han,Zeyue Xue,Mengzhao Chen,Ping Luo,Linjie Yang

Main category: cs.CV

TL;DR: WorldWeaver improves long video generation by combining RGB and perceptual data, using depth cues and noise scheduling to reduce errors and enhance video quality.

Details Motivation: Current generative video models struggle with structural and temporal consistency over long sequences due to reliance on RGB signals, which accumulate errors over time. Method: WorldWeaver jointly models RGB frames and perceptual conditions within a unified long-horizon scheme, utilizes depth cues for a memory bank, and employs segmented noise scheduling to reduce drift and computational cost. Result: Experiments show that WorldWeaver effectively reduces temporal drift and improves the quality and fidelity of long-horizon video generation across diffusion- and rectified flow-based models. Conclusion: WorldWeaver is an effective framework for long video generation, successfully reducing temporal drift and enhancing video fidelity by integrating RGB frames with perceptual conditions, depth cues, and segmented noise scheduling. Abstract: Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.

[114] Fine-grained Multi-class Nuclei Segmentation with Molecular-empowered All-in-SAM Model

Xueyuan Li,Can Cui,Ruining Deng,Yucheng Tang,Quan Liu,Tianyuan Yao,Shunxing Bao,Naweed Chowdhury,Haichun Yang,Yuankai Huo

Main category: cs.CV

TL;DR: This paper introduces the molecular-empowered All-in-SAM Model to improve computational pathology by reducing reliance on detailed annotations, enhancing segmentation accuracy, and enabling better cell classification performance.

Details Motivation: General vision foundation models face challenges in fine-grained semantic segmentation, such as identifying specific nuclei subtypes or cells, which limits their effectiveness in computational pathology. Method: The paper proposes the molecular-empowered All-in-SAM Model, which integrates molecular-empowered learning, SAM adapter for specific semantic adaptation, and Molecular-Oriented Corrective Learning (MOCL) for segmentation refinement. Result: The All-in-SAM model significantly improves cell classification performance across in-house and public datasets, even with varying annotation quality. Conclusion: The All-in-SAM model reduces annotator workload, enhances biomedical image analysis accessibility, and advances medical diagnostics by automating pathology image analysis. Abstract: Purpose: Recent developments in computational pathology have been driven by advances in Vision Foundation Models, particularly the Segment Anything Model (SAM). This model facilitates nuclei segmentation through two primary methods: prompt-based zero-shot segmentation and the use of cell-specific SAM models for direct segmentation. These approaches enable effective segmentation across a range of nuclei and cells. However, general vision foundation models often face challenges with fine-grained semantic segmentation, such as identifying specific nuclei subtypes or particular cells. Approach: In this paper, we propose the molecular-empowered All-in-SAM Model to advance computational pathology by leveraging the capabilities of vision foundation models. This model incorporates a full-stack approach, focusing on: (1) annotation-engaging lay annotators through molecular-empowered learning to reduce the need for detailed pixel-level annotations, (2) learning-adapting the SAM model to emphasize specific semantics, which utilizes its strong generalizability with SAM adapter, and (3) refinement-enhancing segmentation accuracy by integrating Molecular-Oriented Corrective Learning (MOCL). Results: Experimental results from both in-house and public datasets show that the All-in-SAM model significantly improves cell classification performance, even when faced with varying annotation quality. Conclusions: Our approach not only reduces the workload for annotators but also extends the accessibility of precise biomedical image analysis to resource-limited settings, thereby advancing medical diagnostics and automating pathology image analysis.

[115] Waver: Wave Your Way to Lifelike Video Generation

Yifu Zhang,Hao Yang,Yuqi Zhang,Yifei Hu,Fengda Zhu,Chuang Lin,Xiaofeng Mei,Yi Jiang,Zehuan Yuan,Bingyue Peng

Main category: cs.CV

TL;DR: Waver是一个高效的统一图像和视频生成模型,具备出色的视频生成能力和高质量的数据处理流程。

Details Motivation: 旨在推动视频生成技术的发展,提供一个高效的视频生成模型,以帮助社区更有效地训练高质量的视频生成模型。 Method: Waver采用了Hybrid Stream DiT架构,以增强模态对齐并加速训练收敛。同时,建立了一个全面的数据整理流程,并手动注释和训练了一个基于MLLM的视频质量模型以筛选最高质量的样本。 Result: Waver在Artificial Analysis的T2V和I2V排行榜上均位列前三,持续优于现有的开源模型,并匹配或超越最先进的商业解决方案。 Conclusion: Waver是一个高效的统一图像和视频生成的基础模型,能够直接生成5到10秒的720p视频,并通过升级达到1080p。它在文本到视频、图像到视频和文本到图像生成方面表现出色,具有很高的运动幅度和时间一致性。 Abstract: We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.

[116] ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling

Jinhyung Park,Javier Romero,Shunsuke Saito,Fabian Prada,Takaaki Shiratori,Yichen Xu,Federica Bogo,Shoou-I Yu,Kris Kitani,Rawal Khirodkar

Main category: cs.CV

TL;DR: ATLAS is a high-fidelity body model that decouples shape and skeleton bases, offering enhanced shape expressivity and better performance in fitting unseen subjects in diverse poses.

Details Motivation: Existing human mesh modeling approaches struggle to capture detailed variations across diverse body poses and shapes due to limited training data diversity and restrictive modeling assumptions. Also, the common paradigm introduces dependencies between internal skeletons and outer soft tissue, limiting direct control over body height and bone lengths. Method: ATLAS is learned from 600k high-resolution scans captured using 240 synchronized cameras. It explicitly decouples the shape and skeleton bases by grounding the mesh representation in the human skeleton. Result: ATLAS outperforms existing methods by fitting unseen subjects in diverse poses more accurately. Quantitative evaluations show that ATLAS's non-linear pose correctives more effectively capture complex poses compared to linear models. Conclusion: ATLAS offers a high-fidelity body model that decouples shape and skeleton bases, enabling enhanced shape expressivity, fine-grained customization of body attributes, and keypoint fitting independent of external soft-tissue characteristics. Abstract: Parametric body models offer expressive 3D representation of humans across a wide range of poses, shapes, and facial expressions, typically derived by learning a basis over registered 3D meshes. However, existing human mesh modeling approaches struggle to capture detailed variations across diverse body poses and shapes, largely due to limited training data diversity and restrictive modeling assumptions. Moreover, the common paradigm first optimizes the external body surface using a linear basis, then regresses internal skeletal joints from surface vertices. This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. Unlike previous methods, we explicitly decouple the shape and skeleton bases by grounding our mesh representation in the human skeleton. This decoupling enables enhanced shape expressivity, fine-grained customization of body attributes, and keypoint fitting independent of external soft-tissue characteristics. ATLAS outperforms existing methods by fitting unseen subjects in diverse poses more accurately, and quantitative evaluations show that our non-linear pose correctives more effectively capture complex poses compared to linear models.

[117] SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Yanxu Meng,Haoning Wu,Ya Zhang,Weidi Xie

Main category: cs.CV

TL;DR: SceneGen是一种新的框架,能够从单个场景图像和对象掩码中同时生成多个3D资产,无需优化或资产检索。

Details Motivation: 由于VR/AR和具身智能的应用,3D内容生成最近吸引了大量研究兴趣。本文旨在解决在单个场景图像中合成多个3D资产的挑战性任务。 Method: 提出了SceneGen框架,结合了视觉和几何编码器的局部和全局场景信息,并通过位置头生成3D资产及其相对空间位置。 Result: SceneGen在定量和定性评估中都表现出高效和稳健的生成能力,并且可以直接扩展到多图像输入场景。 Conclusion: SceneGen提供了一种新颖的解决方案,用于高质量的3D内容生成,并可能推动其在下游任务中的实际应用。代码和模型将公开提供。 Abstract: 3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

[118] Visual Autoregressive Modeling for Instruction-Guided Image Editing

Qingyang Mao,Qi Cai,Yehao Li,Yingwei Pan,Mingyue Cheng,Ting Yao,Qi Liu,Tao Mei

Main category: cs.CV

TL;DR: VAREdit 是一种基于自回归模型的图像编辑框架,通过多尺度预测和 SAR 模块提升编辑准确性和效率,解决了扩散模型中的意外修改问题。

Details Motivation: 扩散模型的全局去噪过程会导致图像编辑中出现意外的修改,而自回归模型因其因果和组合机制可以更好地遵循编辑指令,因此提出 VAREdit 来解决扩散模型的这些问题。 Method: VAREdit 使用视觉自回归(VAR)框架,基于源图像特征和文本指令生成多尺度目标特征,通过 Scale-Aligned Reference (SAR) 模块注入尺度匹配的条件信息,以解决源图像标记的有效条件问题。 Result: VAREdit 在标准基准测试中比领先的扩散模型方法的 GPT-Balance 得分高出 30% 以上,且编辑 $512\times512$ 图像仅需 1.2 秒,比 UltraEdit 快 2.2 倍。 Conclusion: VAREdit 框架通过将图像编辑重新定义为下一尺度预测问题,有效解决了扩散模型在全球去噪过程中导致的意外修改问题,并在编辑准确性和效率方面展示了显著进步。 Abstract: Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a $512\times512$ editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.

[119] Scaling Group Inference for Diverse and High-Quality Generation

Gaurav Parmar,Or Patashnik,Daniil Ostashev,Kuan-Chieh Wang,Kfir Aberman,Srinivasa Narasimhan,Jun-Yan Zhu

Main category: cs.CV

TL;DR: This paper proposes a scalable group inference method for generative models that enhances both the diversity and quality of multiple outputs, treating them as cohesive groups rather than independent samples.

Details Motivation: The motivation stems from the limitation of current generative models that sample outputs independently, leading to redundant results when users are presented with multiple images. This redundancy limits user choices and hinders idea exploration. Method: The method involves formulating group inference as a quadratic integer assignment problem, where candidate outputs are modeled as graph nodes, and a subset is selected to optimize quality and diversity. Runtime efficiency is improved through progressive pruning of the candidate set using intermediate predictions. Result: The experiments demonstrate that the proposed method significantly improves group diversity and output quality across various tasks, including text-to-image, image-to-image, image prompting, and video generation. Conclusion: The paper concludes that the introduced scalable group inference method effectively enhances both the diversity and quality of generative model outputs when presented as a group, outperforming independent sampling and recent inference algorithms. Abstract: Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we progressively prune the candidate set using intermediate predictions, allowing our method to scale up to large candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.

[120] CineScale: Free Lunch in High-Resolution Cinematic Visual Generation

Haonan Qiu,Ning Yu,Ziqi Huang,Paul Debevec,Ziwei Liu

Main category: cs.CV

TL;DR: CineScale是一种新的推理范式,能够有效解决高分辨率视觉生成中的重复模式问题,实现高质量的8K图像和4K视频生成。

Details Motivation: 现有的高分辨率视觉生成方法在生成超出训练分辨率的内容时容易产生低质量的视觉内容和重复模式,需要一种新的方法来扩展预训练模型的能力。 Method: 提出CineScale,一种新的推理范式,针对不同类型的视频生成架构设计了专用的变体,以解决高分辨率视觉内容生成中高频信息增加导致的重复模式问题。 Result: 实验表明,CineScale在扩展高分辨率视觉生成能力方面具有优势,能够实现8K图像生成而无需任何微调,并通过少量LoRA微调实现4K视频生成。 Conclusion: CineScale通过专用变体解决视频生成架构引入的各种问题,能够在不进行微调或仅进行少量LoRA微调的情况下实现高分辨率的图像和视频生成,显著扩展了预训练模型的能力。 Abstract: Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.