Table of Contents
cs.CL [Back]
[1] Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training
Jianfeng Si,Lin Sun,Zhewen Tan,Xiangzheng Zhang
Main category: cs.CL
TL;DR: This paper proposes a unified co-training framework for LLM safety, enabling multiple safety behaviors in a single training stage, offering better efficiency, control, and safety performance than existing methods.
Details
Motivation: Current methods like SFT and RLHF rely on multi-stage pipelines and lack post-deployment controllability; the proposed method aims to simplify training and enhance flexibility in managing safety behaviors. Method: A unified co-training framework that integrates multiple safety behaviors (positive, negative, and rejective) within a single Supervised Fine-Tuning (SFT) stage, using system-level instructions or magic tokens for dynamic behavioral activation. Result: The framework achieves safety alignment quality comparable to SFT+DPO, with the 8B model outperforming DeepSeek-R1 (671B) in safety performance while significantly reducing training and deployment costs. Conclusion: The proposed co-training framework offers a scalable, efficient, and highly controllable solution for content safety in LLMs, enabling fine-grained control and robust safety performance. Abstract: Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream moderation signals. This co-training strategy induces a distinct Safety Alignment Margin in the output space, characterized by well-separated response distributions corresponding to each safety mode. The existence of this margin provides empirical evidence for the model's safety robustness and enables unprecedented fine-grained control. Experiments show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance, while significantly reducing both training complexity and deployment costs. This work presents a scalable, efficient, and highly controllable solution for LLM content safety.[2] Preliminary Ranking of WMT25 General Machine Translation Systems
Tom Kocmi,Eleftherios Avramidis,Rachel Bawden,Ondřej Bojar,Konstantin Dranch,Anton Dvorkovich,Sergey Dukanov,Natalia Fedorova,Mark Fishel,Markus Freitag,Thamme Gowda,Roman Grundkiewicz,Barry Haddow,Marzena Karpinska,Philipp Koehn,Howard Lakougna,Jessica Lundin,Kenton Murray,Masaaki Nagata,Stefano Perrella,Lorenzo Proietti,Martin Popel,Maja Popović,Parker Riley,Mariya Shmatova,Steinþór Steingrímsson,Lisa Yankovskaya,Vilém Zouhar
Main category: cs.CL
TL;DR: The WMT25 preliminary ranking, based on automatic metrics, suggests potential bias towards systems with re-ranking techniques; the official human evaluation-based ranking will be more reliable.
Details
Motivation: To share preliminary results with task participants for preparing their system submission papers. Method: Automatic metrics were used to evaluate MT systems for the preliminary ranking. Result: The preliminary ranking may favor systems employing re-ranking techniques like Quality Estimation re-ranking or Minimum Bayes Risk decoding. Conclusion: The preliminary ranking of the WMT25 General Machine Translation Shared Task may be biased towards systems using re-ranking techniques, and the official ranking will be based on more reliable human evaluations. Abstract: We present the preliminary ranking of the WMT25 General Machine Translation Shared Task, in which MT systems have been evaluated using automatic metrics. As this ranking is based on automatic evaluations, it may be biased in favor of systems that employ re-ranking techniques, such as Quality Estimation re-ranking or Minimum Bayes Risk decoding. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede the automatic ranking. The purpose of this report is not to present the final findings of the General MT task, but rather to share preliminary results with task participants, which may be useful when preparing their system submission papers.[3] Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages
Israel Abebe Azime,Tadesse Destaw Belay,Dietrich Klakow,Philipp Slusallek,Anshuman Chhabra
Main category: cs.CL
TL;DR: This paper proposes an LLM-driven framework for creating culturally localized math word problem datasets, reducing reliance on English-centric data and improving multilingual reasoning performance.
Details
Motivation: Multilingual and culturally-grounded mathematical reasoning lags behind English due to a lack of localized datasets. Existing benchmarks are translation-based and retain English-centric entities due to high human annotation costs and limited automated tools. Method: The paper introduces a framework for LLM-driven cultural localization of math word problems, which automatically generates datasets incorporating native entities such as names, organizations, and currencies. Result: Experiments demonstrate that the framework mitigates English-centric bias and improves robustness when native entities are used. It also reveals that translated benchmarks may not accurately reflect true multilingual math abilities in appropriate socio-cultural contexts. Conclusion: The proposed LLM-driven framework effectively addresses the lack of culturally-grounded datasets for multilingual mathematical reasoning in low-resource languages, reducing English-centric entity bias and enhancing robustness in localized contexts. Abstract: Large language models (LLMs) have demonstrated significant capabilities in solving mathematical problems expressed in natural language. However, multilingual and culturally-grounded mathematical reasoning in low-resource languages lags behind English due to the scarcity of socio-cultural task datasets that reflect accurate native entities such as person names, organization names, and currencies. Existing multilingual benchmarks are predominantly produced via translation and typically retain English-centric entities, owing to the high cost associated with human annotater-based localization. Moreover, automated localization tools are limited, and hence, truly localized datasets remain scarce. To bridge this gap, we introduce a framework for LLM-driven cultural localization of math word problems that automatically constructs datasets with native names, organizations, and currencies from existing sources. We find that translated benchmarks can obscure true multilingual math ability under appropriate socio-cultural contexts. Through extensive experiments, we also show that our framework can help mitigate English-centric entity bias and improves robustness when native entities are introduced across various languages.[4] Improving LLMs for Machine Translation Using Synthetic Preference Data
Dario Vajda,Domen Vreš,Marko Robnik-Šikonja
Main category: cs.CL
TL;DR: This paper improves machine translation for Slovene using DPO training on a curated dataset, achieving better performance and fewer errors compared to baseline models.
Details
Motivation: The research aims to enhance the machine translation capabilities of a general instruction-tuned large language model using limited and easily produced data resources, focusing on Slovene. Method: The study uses Direct Preference Optimization (DPO) training on a programmatically curated dataset, enhanced subset of a public dataset, and ranks translations using heuristics and automatic evaluation metrics like COMET. Result: The fine-tuned model outperformed both GaMS-9B-Instruct and EuroLLM-9B-Instruct models, achieving a COMET score gain of about 0.04 and 0.02 respectively. Conclusion: The fine-tuned model consistently avoids language and formatting errors and performs better than the baseline models in translating Wikipedia articles. Abstract: Large language models have emerged as effective machine translation systems. In this paper, we explore how a general instruction-tuned large language model can be improved for machine translation using relatively few easily produced data resources. Using Slovene as a use case, we improve the GaMS-9B-Instruct model using Direct Preference Optimization (DPO) training on a programmatically curated and enhanced subset of a public dataset. As DPO requires pairs of quality-ranked instances, we generated its training dataset by translating English Wikipedia articles using two LLMs, GaMS-9B-Instruct and EuroLLM-9B-Instruct. We ranked the resulting translations based on heuristics coupled with automatic evaluation metrics such as COMET. The evaluation shows that our fine-tuned model outperforms both models involved in the dataset generation. In comparison to the baseline models, the fine-tuned model achieved a COMET score gain of around 0.04 and 0.02, respectively, on translating Wikipedia articles. It also more consistently avoids language and formatting errors.[5] Multilingual Datasets for Custom Input Extraction and Explanation Requests Parsing in Conversational XAI Systems
Qianli Wang,Tatiana Anikina,Nils Feldhus,Simon Ostermann,Fedor Splitt,Jiaao Li,Yoana Tsoneva,Sebastian Möller,Vera Schmitt
Main category: cs.CL
TL;DR: This paper introduces new datasets and parsing methods to enhance multilingual support and custom input handling in ConvXAI systems.
Details
Motivation: The motivation was to overcome the limitations of current ConvXAI systems, particularly the scarcity of training data for multilingual applications and the limited support for free-form custom inputs. Method: The researchers introduced MultiCoXQL, a multilingual dataset, and Compass, a dataset for custom input extraction. They proposed a new parsing approach and conducted monolingual, cross-lingual, and multilingual evaluations using LLMs and BERT-type models. Result: The results showed improved multilingual parsing performance with the proposed parsing approach and demonstrated the effectiveness of the datasets in supporting custom input extraction across multiple languages. Conclusion: The study successfully addresses the challenges of multilingual generalization and custom input support in ConvXAI systems by introducing MultiCoXQL and Compass datasets, and by proposing an effective parsing approach. Abstract: Conversational explainable artificial intelligence (ConvXAI) systems based on large language models (LLMs) have garnered considerable attention for their ability to enhance user comprehension through dialogue-based explanations. Current ConvXAI systems often are based on intent recognition to accurately identify the user's desired intention and map it to an explainability method. While such methods offer great precision and reliability in discerning users' underlying intentions for English, a significant challenge in the scarcity of training data persists, which impedes multilingual generalization. Besides, the support for free-form custom inputs, which are user-defined data distinct from pre-configured dataset instances, remains largely limited. To bridge these gaps, we first introduce MultiCoXQL, a multilingual extension of the CoXQL dataset spanning five typologically diverse languages, including one low-resource language. Subsequently, we propose a new parsing approach aimed at enhancing multilingual parsing performance, and evaluate three LLMs on MultiCoXQL using various parsing strategies. Furthermore, we present Compass, a new multilingual dataset designed for custom input extraction in ConvXAI systems, encompassing 11 intents across the same five languages as MultiCoXQL. We conduct monolingual, cross-lingual, and multilingual evaluations on Compass, employing three LLMs of varying sizes alongside BERT-type models.[6] Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner
Bolian Li,Yanran Wu,Xinyu Luo,Ruqi Zhang
Main category: cs.CL
TL;DR: This paper introduces the reward-Shifted Speculative Sampling (SSS) algorithm, which efficiently aligns large language models with human preferences by using a smaller draft model, achieving high performance at lower computational costs.
Details
Motivation: Test-time alignment techniques for large language models often incur high inference costs, limiting their practicality; this research aims to address that efficiency bottleneck. Method: The SSS algorithm modifies the acceptance criterion and bonus token distribution to exploit the distributional shift between the aligned draft model and the unaligned target model, theoretically demonstrating the recovery of the RLHF optimal solution. Result: The SSS algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments. Conclusion: The reward-Shifted Speculative Sampling (SSS) algorithm effectively and efficiently aligns large language models with human preferences by leveraging a smaller, aligned draft model during inference. Abstract: Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-Shifted Speculative Sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.[7] LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text
MohamamdJavad Ardestani,Ehsan Kamalloo,Davood Rafiei
Main category: cs.CL
TL;DR: LongRecall is a novel framework for evaluating recall in machine-generated text, combining lexical, semantic, and structured entailment checks to improve accuracy and reduce errors.
Details
Motivation: Ensuring the completeness of machine-generated text is crucial in critical domains like medicine and law, and existing recall metrics suffer from inaccuracies due to reliance on lexical overlap or semantic misalignment. Method: LongRecall is a general three-stage recall evaluation framework that decomposes answers into self-contained facts, filters plausible candidate matches through lexical and semantic methods, and verifies alignment via structured entailment checks. Result: LongRecall showed substantial improvements in recall accuracy on three challenging long-form QA benchmarks using both human annotations and LLM-based judges. Conclusion: LongRecall serves as a foundational building block for systematic recall assessment, demonstrating substantial improvements in recall accuracy over strong baselines. Abstract: LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.[8] Mapping the Course for Prompt-based Structured Prediction
Matt Pauk,Maria Leonor Pacheco
Main category: cs.CL
TL;DR: 本文提出将大语言模型与组合推理结合,以提升结构化预测任务的性能,实验证明这种方法更准确且更一致。
Details
Motivation: LLM在复杂推理和结构化预测任务中存在幻觉和困难,需要结合推理方法提高性能。 Method: 通过结合LLM与符号推理方法进行结构化预测,并进行实验评估不同提示策略的效果。 Result: 实验表明,无论提示策略如何,符号推理的加入都能提升预测的一致性和准确性,校准和微调进一步提高了性能。 Conclusion: 结合LLM与组合推理可以提升结构化预测的准确性和一致性,结构化学习在LLM时代仍然有价值。 Abstract: LLMs have been shown to be useful for a variety of language tasks, without requiring task-specific fine-tuning. However, these models often struggle with hallucinations and complex reasoning problems due to their autoregressive nature. We propose to address some of these issues, specifically in the area of structured prediction, by combining LLMs with combinatorial inference in an attempt to marry the predictive power of LLMs with the structural consistency provided by inference methods. We perform exhaustive experiments in an effort to understand which prompting strategies can effectively estimate LLM confidence values for use with symbolic inference, and show that, regardless of the prompting strategy, the addition of symbolic inference on top of prompting alone leads to more consistent and accurate predictions. Additionally, we show that calibration and fine-tuning using structured prediction objectives leads to increased performance for challenging tasks, showing that structured learning is still valuable in the era of LLMs.[9] Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset
Rabeeh Karimi Mahabadi,Sanjeev Satheesh,Shrimai Prabhumoye,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro
Main category: cs.CL
TL;DR: 本研究开发了一个新的、领域无关的管道,用于从 Common Crawl 构建 Nemotron-CC-Math,解决了现有数学数据集的质量问题,并在数学、代码和一般推理方面带来了显著的改进。
Details
Motivation: 现有的以数学为重点的数据集由于易碎的提取启发式方法、有损的 HTML 到文本转换以及未能可靠地保留数学结构,导致质量下降。 Method: 开发了一个新颖的、领域无关的管道,用于从 Common Crawl 构建 Nemotron-CC-Math,利用布局感知的渲染和有针对性的基于大语言模型的清理阶段,从各种格式中恢复数学公式。 Result: 收集了一个大型、高质量的数学语料库 Nemotron-CC-Math-3+(1330 亿个标记)和 Nemotron-CC-Math-4+(520 亿个标记),与以前的数据集相比,标记数量显著增加。 Conclusion: Nemotron-CC-Math 是第一个能够从嘈杂的网络规模数据中可靠提取科学内容(包括数学)的管道,为数学、代码和一般推理带来了可衡量的提升,并在开放数学预训练语料库中树立了新的艺术状态。 Abstract: Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.[10] Identifying and Answering Questions with False Assumptions: An Interpretable Approach
Zijie Wang,Eduardo Blanco
Main category: cs.CL
TL;DR: This paper presents an approach to identify and answer questions with false assumptions by leveraging external evidence and validating atomic assumptions, significantly improving the accuracy and interpretability of answers from LLMs.
Details
Motivation: The motivation is to address the issue of misleading answers generated by LLMs due to hallucinations when answering questions with false assumptions. Method: The method involves reducing the problem to fact verification and leveraging external evidence to mitigate hallucinations in LLMs. This includes generating and validating atomic assumptions. Result: Experiments with five LLMs showed that incorporating retrieved evidence is beneficial, and generating and validating atomic assumptions yields more improvements while providing interpretable answers. Conclusion: The paper concludes that by generating and validating atomic assumptions and incorporating retrieved evidence, the identification and answering of questions with false assumptions can be significantly improved, providing interpretable answers. Abstract: People often ask questions with false assumptions, a type of question that does not have regular answers. Answering such questions require first identifying the false assumptions. Large Language Models (LLMs) often generate misleading answers because of hallucinations. In this paper, we focus on identifying and answering questions with false assumptions in several domains. We first investigate to reduce the problem to fact verification. Then, we present an approach leveraging external evidence to mitigate hallucinations. Experiments with five LLMs demonstrate that (1) incorporating retrieved evidence is beneficial and (2) generating and validating atomic assumptions yields more improvements and provides an interpretable answer by specifying the false assumptions.[11] ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following
Seungmin Han,Haeun Kwon,Ji-jun Park,Taeyang Yoon
Main category: cs.CL
TL;DR: 本文提出了CoLVLM Agent,这是一种无需大量重新训练即可增强LVLMs推理和指令执行能力的框架。通过MMDR-Bench实验,证明其在复杂多模态对话任务中优于GPT-4o和Gemini 1.5 Pro,验证了其模块化设计和迭代方法的有效性。
Details
Motivation: 当前模型在处理需要深度推理、持续上下文理解和多步骤指令跟随的复杂多模态交互任务时仍面临挑战,且现有基准测试难以捕捉真实世界多模态交互的动态性和复杂性。 Method: 提出CoLVLM Agent框架,增强现有LVLMs的推理和指令执行能力,并在MMDR-Bench上进行实验,评估其在多个维度上的表现。 Result: CoLVLM Agent在MMDR-Bench上的平均人类评估得分为4.03,明显超过GPT-4o(3.92)和Gemini 1.5 Pro(3.85),在推理深度、指令遵循和错误抑制方面表现出显著优势,并在长对话轮次中保持稳健性能。 Conclusion: CoLVLM Agent通过迭代的“memory-perception-planning-execution”周期,无需对基础模型进行大量重新训练,便展现出在复杂多模态交互任务中的有效性,验证了其模块化设计和迭代方法的优越性。 Abstract: Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a holistic framework that enhances existing LVLMs with advanced reasoning and instruction following capabilities through an iterative "memory-perception-planning-execution" cycle, requiring no extensive re-training of the underlying models. Our extensive experiments on MMDR-Bench demonstrate that CoLVLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03, notably surpassing state-of-the-art commercial models like GPT-4o (3.92) and Gemini 1.5 Pro (3.85). The framework exhibits significant advantages in reasoning depth, instruction adherence, and error suppression, and maintains robust performance over extended dialogue turns, validating the effectiveness of its modular design and iterative approach for complex multi-modal interactions.[12] SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling
Dong Liu,Yanxuan Yu
Main category: cs.CL
TL;DR: SemToken是一种新的语义感知分词框架,通过减少冗余分词并提高计算效率,在长上下文语言建模中表现出色。
Details
Motivation: 现有分词方法如BPE或WordPiece仅基于频率统计,忽视了文本的潜在语义结构,导致在长上下文场景中出现过度分词和上下文连贯性利用不足的问题。 Method: SemToken首先通过轻量级编码器提取上下文语义嵌入,并进行局部语义聚类以合并语义上等效的标记。随后,根据语义密度分配异构的标记粒度,在内容丰富的区域进行更细粒度的分词,在重复或低熵的区域进行较粗粒度的压缩。 Result: 实验结果表明,SemToken在WikiText-103和LongBench等长上下文语言建模基准上实现了最高2.4倍的分词数量减少和1.9倍的速度提升,同时困惑度和下游任务准确率几乎没有或没有下降。 Conclusion: 语义结构为优化大语言模型中的分词和计算提供了一个有前景的新方向。 Abstract: Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbf{SemToken}, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can be seamlessly integrated with modern language models and attention acceleration methods. Experiments on long-context language modeling benchmarks such as WikiText-103 and LongBench show that SemToken achieves up to $2.4\times$ reduction in token count and $1.9\times$ speedup, with negligible or no degradation in perplexity and downstream accuracy. Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.[13] Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
Yuanchen Zhou,Shuo Jiang,Jie Zhu,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang
Main category: cs.CL
TL;DR: This paper introduces Fin-PRM, a domain-specialized model for evaluating financial reasoning in large language models, which outperforms general models in trajectory selection and improves performance in supervised learning, reinforcement learning, and test-time applications.
Details
Motivation: Existing Process Reward Models (PRMs) are primarily trained on general or STEM domains and perform poorly in domain-specific contexts like finance, where reasoning is structured, symbolic, and sensitive to factual and regulatory correctness. Method: The paper introduces Fin-PRM, a domain-specialized, trajectory-aware Process Reward Model tailored to evaluate intermediate reasoning steps in financial tasks. It integrates step-level and trajectory-level reward supervision and is applied in offline and online reward learning settings. Result: Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, show that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield significant improvements, with gains of 12.9% in supervised learning, 5.2% in reinforcement learning, and 5.1% in test-time performance. Conclusion: Fin-PRM demonstrates substantial improvements in financial reasoning tasks, highlighting the value of domain-specialized reward modeling for aligning large language models with expert-level financial reasoning. Abstract: Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce \textbf{Fin-PRM}, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9\% in supervised learning, 5.2\% in reinforcement learning, and 5.1\% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.[14] SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
Huanxuan Liao,Yixing Xu,Shizhu He,Guanchen Li,Xuanwu Yin,Dong Li,Emad Barsoum,Jun Zhao,Kang Liu
Main category: cs.CL
TL;DR: SPARK通过在通道级别上对KV缓存进行非结构化稀疏处理,有效解决了长上下文推理中的内存和计算瓶颈问题,同时保持或提升了模型准确性。
Details
Motivation: 现有的KV缓存压缩方法主要沿时间轴进行压缩,忽视了通道轴上的细粒度重要性变化,导致效率与模型准确性之间的平衡受限。SPARK旨在通过考虑通道级别的显著性变化来解决这一问题。 Method: SPARK采用了一种非结构化的稀疏方法,在通道级别上修剪KV缓存,并在注意力得分计算过程中动态恢复被修剪的条目,从而减少内存使用和计算开销。 Result: SPARK在相同长度的序列上不仅保持或提高了模型准确性,还比基于淘汰的方法减少了超过30%的KV缓存存储。即使在80%的激进剪枝比例下,SPARK的性能下降也小于5%。 Conclusion: SPARK是一个无需训练的即插即用方法,通过在通道级别对KV缓存进行非结构化稀疏处理,有效解决了长上下文推理中的KV缓存瓶颈问题,并且与现有的KV压缩和量化技术兼容,能够进一步加速处理。 Abstract: Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.[15] Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering
Bolei He,Xinran He,Run Shao,Shanfu Shu,Xianwei Xue,Mingquan Cheng,Haifeng Li,Zhenhua Ling
Main category: cs.CL
TL;DR: Selct2Know (S2K) is a cost-effective framework that improves domain-specific QA performance by efficiently internalizing domain knowledge through self-selection, fine-tuning, and enhanced reasoning.
Details
Motivation: Existing methods like RAG and continued pretraining have limitations such as hallucinations, latency, high cost, and lack of cross-domain flexibility, especially due to the long-tail distribution of domain knowledge. Method: S2K employs an internal-external knowledge self-selection strategy, selective supervised fine-tuning, a structured reasoning data generation pipeline, and integrates GRPO to enhance reasoning. Result: S2K consistently outperforms existing methods on medical, legal, and financial QA benchmarks and matches the performance of domain-pretrained LLMs at a significantly lower cost. Conclusion: Selct2Know (S2K) effectively internalizes domain knowledge in a cost-efficient manner, outperforming existing methods in domain-specific QA tasks. Abstract: Large Language Models (LLMs) perform well in general QA but often struggle in domain-specific scenarios. Retrieval-Augmented Generation (RAG) introduces external knowledge but suffers from hallucinations and latency due to noisy retrievals. Continued pretraining internalizes domain knowledge but is costly and lacks cross-domain flexibility. We attribute this challenge to the long-tail distribution of domain knowledge, which leaves partial yet useful internal knowledge underutilized. We further argue that knowledge acquisition should be progressive, mirroring human learning: first understanding concepts, then applying them to complex reasoning. To address this, we propose Selct2Know (S2K), a cost-effective framework that internalizes domain knowledge through an internal-external knowledge self-selection strategy and selective supervised fine-tuning. We also introduce a structured reasoning data generation pipeline and integrate GRPO to enhance reasoning ability. Experiments on medical, legal, and financial QA benchmarks show that S2K consistently outperforms existing methods and matches domain-pretrained LLMs with significantly lower cost.[16] Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall
Sijia Cui,Aiyao He,Shuai Xu,Hongming Zhang,Yanna Wang,Qingyang Zhang,Yajing Wang,Bo Xu
Main category: cs.CL
TL;DR: 本文提出了一种名为SEER的新方法,通过持续更新的经验池进行细粒度、逐步检索,以提高大型语言模型在多步骤工具使用任务中的性能。
Details
Motivation: 现有方法依赖于手动设计任务特定的演示或从策划库中检索,随着工具多样性和任务难度的增加,这种方法变得越来越复杂和低效。需要一种更自动化和高效的方法。 Method: 提出了一种自引导方法SEER,该方法通过持续更新的经验池进行细粒度、逐步检索,以替代传统的静态或人工策划库。 Result: 在ToolQA基准测试中,SEER在简单问题上平均提升了6.1%,在困难问题上提升了4.7%。在τ-基准测试中,使用Qwen2.5-7B和Qwen2.5-72B模型分别实现了7.44%和23.38%的准确率提升。 Conclusion: SEER通过逐步检索经验池中的成功轨迹,显著提高了LLM在多步骤工具使用任务中的性能,特别是在不同难度和真实领域的测试中表现突出。 Abstract: Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substantial expert effort and prompt engineering becomes increasingly complex and inefficient as tool diversity and task difficulty scale. To address these challenges, we propose a self-guided method, Stepwise Experience Recall (SEER), which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of relying on static or manually curated library, SEER incrementally augments the experience pool with past successful trajectories, enabling continuous expansion of the pool and improved model performance over time. Evaluated on the ToolQA benchmark, SEER achieves an average improvement of 6.1\% on easy and 4.7\% on hard questions. We further test SEER on $\tau$-bench, which includes two real-world domains. Powered by Qwen2.5-7B and Qwen2.5-72B models, SEER demonstrates substantial accuracy gains of 7.44\% and 23.38\%, respectively.[17] Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?
Momoka Furuhashi,Kouta Nakayama,Takashi Kodama,Saku Sugawara
Main category: cs.CL
TL;DR: 本文研究了使用检查表进行生成任务自动评估的有效性,发现选择性使用检查表可提升成对比较效果,但直接评分效果不稳定,同时揭示了人工评估的潜在不一致性。
Details
Motivation: 自动评估生成任务由于标准模糊面临挑战,而自动生成检查表的方法潜力尚未被充分探索。 Method: 通过六种方法生成检查表,评估其在八种模型规模上的有效性,并分析检查表项与人工评估的相关性。 Result: 选择性使用检查表在成对比较中提升评估性能,但在直接评分中效果不一致;一些与人工评分相关性低的检查表项仍反映人工标准,显示人工评估可能存在不一致。 Conclusion: 研究强调需要明确定义客观评估标准,以指导人工和自动评估。 Abstract: Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored. We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations. Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring. Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations. \footnote{Our code is available at~https://github.com/momo0817/checklist-effectiveness-study[18] VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models
Hanling Zhang,Yayu Zhou,Tongcheng Fang,Zhihang Yuan,Guohao Dai,Yu Wang
Main category: cs.CL
TL;DR: VocabTailor is a dynamic vocabulary selection framework that reduces memory usage in Small Language Models without significantly affecting performance.
Details
Motivation: The memory footprint of vocabulary-related components, such as embeddings and LM heads, is a major bottleneck for deploying Small Language Models (SLMs) on edge devices. Existing static pruning methods are rigid and cause information loss. Method: VocabTailor uses a decoupled dynamic vocabulary selection framework, implementing offloading embeddings and a hybrid static-dynamic vocabulary selection strategy for the language modeling (LM) head. Result: VocabTailor achieves up to a 99% reduction in memory usage for vocabulary-related components while maintaining task performance across diverse downstream tasks. Conclusion: VocabTailor significantly reduces memory usage of vocabulary-related components in Small Language Models (SLMs) with minimal or no degradation in task performance, outperforming existing static vocabulary pruning methods. Abstract: Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.[19] WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai
Peerat Limkonchotiwat,Pume Tuchinda,Lalita Lowphansirikul,Surapon Nonesung,Panuthep Tasawong,Alham Fikri Aji,Can Udomcharoenchaikit,Sarana Nutanong
Main category: cs.CL
TL;DR: WangchanThaiInstruct is a new, human-authored dataset for Thai language instruction tuning and evaluation. It demonstrates that models trained on native, culturally grounded data perform better than those using translated data, emphasizing the importance of culturally relevant training in low-resource languages.
Details
Motivation: Large language models perform well in English but underperform in low-resource languages like Thai. Existing benchmarks, often based on translations, fail to capture cultural and domain-specific nuances necessary for real-world applications. Method: The researchers created WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning. It was developed through a multi-stage quality control process involving annotators, domain experts, and AI researchers. Result: Models fine-tuned on WangchanThaiInstruct outperformed those using translated data on both in-domain and out-of-domain benchmarks. Zero-shot evaluations revealed performance gaps in culturally and professionally specific tasks. Conclusion: The study highlights the importance of culturally and professionally grounded instruction data in improving the performance of large language models (LLMs) in low-resource languages like Thai. Abstract: Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.[20] UniCoM: A Universal Code-Switching Speech Generator
Sangmin Lee,Woojin Chung,Seyun Um,Hong-Goo Kang
Main category: cs.CL
TL;DR: The paper introduces UniCoM, a new method for generating natural code-switching speech data, resulting in the CS-FLEURS dataset which improves multilingual speech technology.
Details
Motivation: Code-switching is common in multilingual conversations but poses challenges for speech technology due to the scarcity of suitable datasets. Method: UniCoM uses the SWORDS algorithm to replace words with their translations while preserving parts of speech and sentence semantics. Using this method, the CS-FLEURS corpus was constructed for ASR and S2TT tasks. Result: CS-FLEURS demonstrates high intelligibility and naturalness, performing comparably to existing datasets on objective and subjective metrics. Conclusion: The proposed UniCoM approach effectively generates high-quality, natural CS samples, advancing CS speech technology and enabling more inclusive multilingual systems. Abstract: Code-switching (CS), the alternation between two or more languages within a single speaker's utterances, is common in real-world conversations and poses significant challenges for multilingual speech technology. However, systems capable of handling this phenomenon remain underexplored, primarily due to the scarcity of suitable datasets. To resolve this issue, we propose Universal Code-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CS samples without altering sentence semantics. Our approach utilizes an algorithm we call Substituting WORDs with Synonyms (SWORDS), which generates CS speech by replacing selected words with their translations while considering their parts of speech. Using UniCoM, we construct Code-Switching FLEURS (CS-FLEURS), a multilingual CS corpus designed for automatic speech recognition (ASR) and speech-to-text translation (S2TT). Experimental results show that CS-FLEURS achieves high intelligibility and naturalness, performing comparably to existing datasets on both objective and subjective metrics. We expect our approach to advance CS speech technology and enable more inclusive multilingual systems.[21] EMNLP: Educator-role Moral and Normative Large Language Models Profiling
Yilin Jiang,Mingzi Zhang,Sheng Jin,Zengyi Yu,Xiangjie Kong,Binghao Tu
Main category: cs.CL
TL;DR: 本文提出了EMNLP框架,用于评估教师角色下大型语言模型的心理和伦理特征,发现模型在抽象道德推理上表现优异,但在情感复杂情境中存在不足,并揭示了模型能力与安全性之间的悖论。
Details
Motivation: 尽管Simulating Professions(SP)使LLMs能够模拟专业角色,但在这些情境下的全面心理和伦理评估仍然缺乏,因此需要构建专门针对教育领域LLMs的评估框架。 Method: 该论文通过构建88个教师特定的道德困境,扩展了现有的量表,使用软提示注入集来评估模型的合规性和脆弱性,并在12个LLMs上进行了实验。 Result: 实验表明,教师角色的LLMs比人类教师表现出更理想化和极端的性格,在抽象道德推理方面优于人类,但在处理情感复杂情况时存在困难;更强推理能力的模型更容易受到有害提示注入的影响,且模型温度等超参数的影响有限。 Conclusion: 该论文提出了EMNLP框架,用于评估教师角色下大型语言模型(LLMs)的道德和规范性特征,揭示了模型在抽象道德推理上的优势和在情感复杂情境中的不足,并指出模型能力与安全性之间存在的悖论。 Abstract: Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 12 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at https://e-m-n-l-p.github.io/.[22] Conflict-Aware Soft Prompting for Retrieval-Augmented Generation
Eunseong Choi,June Park,Hyeri Lee,Jongwuk Lee
Main category: cs.CL
TL;DR: CARE方法通过上下文评估器和基础LLM解决检索增强生成中的上下文-记忆冲突问题,提高问答和事实核查任务的表现。
Details
Motivation: 解决检索增强生成中外部上下文与模型参数知识之间的冲突,提高模型的准确性和可靠性。 Method: 引入CARE方法,包括上下文评估器和基础LLM,通过编码紧凑的记忆标记嵌入并通过对抗软提示训练上下文评估器来识别不可靠的上下文。 Result: 实验表明,CARE有效缓解了上下文-记忆冲突,在问答和事实核查基准上平均性能提高了5.0%。 Conclusion: CARE为构建可信和自适应的RAG系统提供了有希望的方向。 Abstract: Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM's parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0\% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.[23] TComQA: Extracting Temporal Commonsense from Text
Lekshmi R Nair,Arun Sankar,Koninika Pal
Main category: cs.CL
TL;DR: 本研究提出了一种利用大型语言模型自动提取时间常识的方法,并构建了一个高质量的时间常识问答数据集TComQA,有效提升了相关任务的性能。
Details
Motivation: 事件理解需要掌握其时间背景,而现有LLMs在生成需要时间常识推理的文本时存在困难,因此需要自动挖掘时间常识以构建更稳健的语言模型。 Method: 提出了一种基于LLMs的时间常识提取流水线,从SAMSum和RealNews语料库中构建了TComQA数据集,并通过众包验证其有效性。 Result: TComQA数据集在时间常识提取任务中达到了超过80%的准确率,使用TComQA训练的模型在时间问答任务中表现优于现有数据集微调的LLM。 Conclusion: 利用LLMs自动挖掘时间常识并构建TComQA数据集,提高了时间常识提取的准确率,并在时间问答任务中优于现有数据集微调的LLM。 Abstract: Understanding events necessitates grasping their temporal context, which is often not explicitly stated in natural language. For example, it is not a trivial task for a machine to infer that a museum tour may last for a few hours, but can not take months. Recent studies indicate that even advanced large language models (LLMs) struggle in generating text that require reasoning with temporal commonsense due to its infrequent explicit mention in text. Therefore, automatically mining temporal commonsense for events enables the creation of robust language models. In this work, we investigate the capacity of LLMs to extract temporal commonsense from text and evaluate multiple experimental setups to assess their effectiveness. Here, we propose a temporal commonsense extraction pipeline that leverages LLMs to automatically mine temporal commonsense and use it to construct TComQA, a dataset derived from SAMSum and RealNews corpora. TComQA has been validated through crowdsourcing and achieves over 80\% precision in extracting temporal commonsense. The model trained with TComQA also outperforms an LLM fine-tuned on existing dataset of temporal question answering task.[24] CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing
Abdul Rehman,Jian-Jun Zhang,Xiaosong Yang
Main category: cs.CL
TL;DR: 本文提出了一种轻量级模型 CUPE,能够在短时间内捕捉音素特征,实现跨语言语音处理。
Details
Motivation: 许多语音处理任务需要纯粹的音素表示,而不受上下文影响,这促使我们开发 CUPE。 Method: CUPE 独立处理短的、固定宽度的语音窗口,并通过监督和自监督训练在多种语言上进行评估。 Result: CUPE 参数较少,但在跨语言任务中表现出色,证明了在音素长度窗口内建模基本声学模式的有效性。 Conclusion: CUPE 是一种轻量级模型,通过建模音素长度窗口内的基本声学模式,实现了有效的跨语言语音处理。 Abstract: Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our development of CUPE - a lightweight model that captures key phoneme features in just 120 milliseconds, about one phoneme's length. CUPE processes short, fixed-width windows independently and, despite fewer parameters than current approaches, achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages. Our extensive evaluation through supervised and self-supervised training on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus, demonstrates strong cross-lingual generalization and reveals that effective universal speech processing is possible through modeling basic acoustic patterns within phoneme-length windows.[25] KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models
Haji Gul,Abul Ghani Naim,Ajaz Ahmad Bhat
Main category: cs.CL
TL;DR: This paper proposes EDAS, a new meta-metric for evaluating Knowledge Graph Completion models by synthesizing performance across multiple datasets and metrics into a single score, providing a more reliable and interpretable evaluation framework.
Details
Motivation: The motivation is to address the challenge of comparing KGC models across multiple datasets and metrics, where different models may excel in different aspects, complicating model selection for downstream tasks. Method: EDAS integrates model performance across various metrics and datasets, providing a unified meta-metric that offers a global perspective on model performance. Result: Experimental results on benchmark datasets demonstrate that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking, supporting more informed model selection and promoting fairness in cross-dataset evaluation. Conclusion: KG Evaluation based on Distance from Average Solution (EDAS) is a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and evaluation criteria into a single normalized score, offering a consistent and generalizable framework for evaluating KGC models. Abstract: Knowledge Graphs (KGs) enable applications in various domains such as semantic search, recommendation systems, and natural language processing. KGs are often incomplete, missing entities and relations, an issue addressed by Knowledge Graph Completion (KGC) methods that predict missing elements. Different evaluation metrics, such as Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hit@k, are commonly used to assess the performance of such KGC models. A major challenge in evaluating KGC models, however, lies in comparing their performance across multiple datasets and metrics. A model may outperform others on one dataset but underperform on another, making it difficult to determine overall superiority. Moreover, even within a single dataset, different metrics such as MRR and Hit@1 can yield conflicting rankings, where one model excels in MRR while another performs better in Hit@1, further complicating model selection for downstream tasks. These inconsistencies hinder holistic comparisons and highlight the need for a unified meta-metric that integrates performance across all metrics and datasets to enable a more reliable and interpretable evaluation framework. To address this need, we propose KG Evaluation based on Distance from Average Solution (EDAS), a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and diverse evaluation criteria into a single normalized score ($M_i \in [0,1]$). Unlike traditional metrics that focus on isolated aspects of performance, EDAS offers a global perspective that supports more informed model selection and promotes fairness in cross-dataset evaluation. Experimental results on benchmark datasets such as FB15k-237 and WN18RR demonstrate that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking, offering a consistent, robust, and generalizable framework for evaluating KGC models.[26] A Survey on Large Language Model Benchmarks
Shiwen Ni,Guhong Chen,Shuaimin Li,Xuanang Chen,Siyi Li,Bingli Wang,Qiyao Wang,Xingjian Wang,Yifan Zhang,Liyang Fan,Chengming Li,Ruifeng Xu,Le Sun,Min Yang
Main category: cs.CL
TL;DR: This paper reviews the current benchmarks for large language models, identifies their shortcomings, and proposes a design paradigm for future improvements.
Details
Motivation: The motivation of the paper is to address the rapid development of large language models and the need for effective evaluation benchmarks to guide future model development and innovation. Method: The paper systematically reviews and categorizes 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. Result: The paper identifies problems in current benchmarks, such as data contamination, cultural and linguistic biases, and a lack of evaluation on process credibility and dynamic environments. Conclusion: The paper concludes that while current benchmarks for large language models have significant issues, they provide a foundation for future innovations in model evaluation. Abstract: In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.[27] Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation
Yichi Zhang,Yao Huang,Yifan Wang,Yitong Sun,Chang Liu,Zhe Zhao,Zhengwei Fang,Huanran Chen,Xiao Yang,Xingxing Wei,Hang Su,Yinpeng Dong,Jun Zhu
Main category: cs.CL
TL;DR: The paper introduces MultiTrust-X, a comprehensive benchmark for evaluating and mitigating trustworthiness issues in MLLMs, revealing vulnerabilities in current models and proposing a new method (RESA) to enhance safety and performance.
Details
Motivation: The motivation is to address the trustworthiness concerns of Multimodal Large Language Models (MLLMs), which are often overlooked in existing evaluation and mitigation approaches that focus on narrow aspects. Method: The authors propose MultiTrust-X, a comprehensive benchmark with a three-dimensional framework for evaluating trustworthiness across five aspects (truthfulness, robustness, safety, fairness, privacy), two novel risk types (multimodal risks and cross-modal impacts), and various mitigation strategies. They conduct extensive experiments on 30+ MLLMs and analyze 8 mitigation methods. Result: The experiments reveal significant vulnerabilities in current MLLMs, including a gap between trustworthiness and general capabilities, amplification of potential risks in base LLMs, and limitations in existing mitigation strategies. The proposed RESA approach achieves state-of-the-art results in balancing safety and performance. Conclusion: The paper concludes that while there are significant vulnerabilities in current MLLMs, the proposed MultiTrust-X benchmark provides a comprehensive framework for evaluating and mitigating trustworthiness issues. The RESA approach improves model safety and performance. Abstract: The trustworthiness of Multimodal Large Language Models (MLLMs) remains an intense concern despite the significant progress in their capabilities. Existing evaluation and mitigation approaches often focus on narrow aspects and overlook risks introduced by the multimodality. To tackle these challenges, we propose MultiTrust-X, a comprehensive benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs. We define a three-dimensional framework, encompassing five trustworthiness aspects which include truthfulness, robustness, safety, fairness, and privacy; two novel risk types covering multimodal risks and cross-modal impacts; and various mitigation strategies from the perspectives of data, model architecture, training, and inference algorithms. Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets, enabling holistic evaluations over 30 open-source and proprietary MLLMs and in-depth analysis with 8 representative mitigation methods. Our extensive experiments reveal significant vulnerabilities in current models, including a gap between trustworthiness and general capabilities, as well as the amplification of potential risks in base LLMs by both multimodal training and inference. Moreover, our controlled analysis uncovers key limitations in existing mitigation strategies that, while some methods yield improvements in specific aspects, few effectively address overall trustworthiness, and many introduce unexpected trade-offs that compromise model utility. These findings also provide practical insights for future improvements, such as the benefits of reasoning to better balance safety and performance. Based on these insights, we introduce a Reasoning-Enhanced Safety Alignment (RESA) approach that equips the model with chain-of-thought reasoning ability to discover the underlying risks, achieving state-of-the-art results.[28] Confidence-Modulated Speculative Decoding for Large Language Models
Jaydip Sen,Subhasis Dasgupta,Hetvi Waghela
Main category: cs.CL
TL;DR: This paper introduces a confidence-modulated speculative decoding framework that adaptively adjusts token generation and verification based on model uncertainty, improving decoding efficiency and quality.
Details
Motivation: Existing speculative decoding methods use static drafting lengths and rigid verification criteria, which limit their adaptability. This work addresses these limitations through an information-theoretic, adaptive framework. Method: The method uses entropy and margin-based uncertainty measures from the drafter's output distribution to dynamically adjust the number of speculatively generated tokens and modulate the verification process. Result: Experiments show significant speed improvements over standard speculative decoding while maintaining or improving BLEU and ROUGE scores on machine translation and summarization tasks. Conclusion: The proposed confidence-modulated speculative decoding method enhances decoding efficiency and robustness in large language models by dynamically adjusting drafting length and verification criteria based on uncertainty measures. Abstract: Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.[29] Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Woojin Chung,Jeonghoon Kim
Main category: cs.CL
TL;DR: This paper explores the effects of increasing vocabulary size in language models. It finds that larger vocabularies help by reducing the complexity of tokenized text, which primarily benefits the model through reduced uncertainty on frequent words. The study provides insights into the dynamics of vocabulary size, token-frequency imbalance, and model performance, suggesting that lowering text complexity is key to enhancing language models.
Details
Motivation: The motivation for this study stems from the observation that recent practices in training language models favor ever-larger vocabularies, yet the source of the benefit from these larger vocabularies is not clear. The research aims to understand the underlying reasons for this phenomenon and how it affects model performance. Method: A controlled study was conducted to scale the vocabulary of a language model from 24K to 196K while keeping data, compute, and optimization constant. The complexity of tokenized text was quantified using Kolmogorov complexity. A word-level loss decomposition was also performed to understand where the gains from larger vocabularies were realized. Additionally, the effect of constraining input and output embedding norms was examined. Result: The study found that larger vocabularies reduce the complexity of tokenized text, which in turn benefits the language model. It was shown that increasing the vocabulary size mainly reduces uncertainty on the most frequent words, even though the loss on rare words increases. Constraining the embedding norms to mitigate token-frequency imbalance reversed the gains, indicating that the model benefits from the imbalance. Expanding model parameters with a fixed vocabulary also resulted in similar benefits. Conclusion: The study concludes that the benefit of larger vocabularies in language models is not due to their size per se, but rather due to the reduction in complexity of the tokenized text they bring. This insight reframes the understanding of why larger vocabularies are beneficial, suggesting that the lowering of text complexity is what enhances model performance. Abstract: Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but the source of the benefit is unclear. We conduct a controlled study that scales the language model's vocabulary from 24K to 196K while holding data, compute, and optimization fixed. We first quantify the complexity of tokenized text, formalized via Kolmogorov complexity, and show that larger vocabularies reduce this complexity. Above 24K, every common word is already a single token, so further growth mainly deepens the relative token-frequency imbalance. A word-level loss decomposition shows that larger vocabularies reduce cross-entropy almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. Constraining input and output embedding norms to attenuate the effect of token-frequency imbalance reverses the gain, directly showing that the model exploits rather than suffers from imbalance. Because the same frequent words cover roughly 77% of tokens in downstream benchmarks, this training advantage transfers intact. We also show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results reframe "bigger vocabularies help" as "lowering the complexity of tokenized text helps," providing a simple, principled lever for tokenizer-model co-design and clarifying the loss dynamics that govern language-model scaling in pre-training.[30] Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models
Tobias Schreieder,Tim Schopf,Michael Färber
Main category: cs.CL
TL;DR: 这篇论文对基于证据的大型语言模型文本生成进行了系统分析,提出了统一的分类法,并评估了300个指标,以解决该领域术语不一致和缺乏统一基准的问题。
Details
Motivation: 由于大型语言模型的日益普及,人们对其可靠性和可信度的关注日益增加,这导致了对基于证据的文本生成的研究。然而,由于术语不一致、评估实践孤立以及缺乏统一基准,该领域存在碎片化问题。 Method: 系统分析了134篇论文,并在七个关键维度上调查了300个评估指标。 Result: 引入了基于证据的大型语言模型文本生成的统一分类法,调查了评估指标,并分析了该领域的代表方法和特征。 Conclusion: 该论文总结了基于证据的大型语言模型文本生成的研究现状,提出了一个统一的分类法,并确定了未来研究的有希望的方向。 Abstract: The increasing adoption of large language models (LLMs) has been accompanied by growing concerns regarding their reliability and trustworthiness. As a result, a growing body of research focuses on evidence-based text generation with LLMs, aiming to link model outputs to supporting evidence to ensure traceability and verifiability. However, the field is fragmented due to inconsistent terminology, isolated evaluation practices, and a lack of unified benchmarks. To bridge this gap, we systematically analyze 134 papers, introduce a unified taxonomy of evidence-based text generation with LLMs, and investigate 300 evaluation metrics across seven key dimensions. Thereby, we focus on approaches that use citations, attribution, or quotations for evidence-based text generation. Building on this, we examine the distinctive characteristics and representative methods in the field. Finally, we highlight open challenges and outline promising directions for future work.[31] When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models
Cheng Wang,Gelei Deng,Xianglin Yang,Han Qiu,Tianwei Zhang
Main category: cs.CL
TL;DR: This paper introduces MCR-BENCH, a benchmark for evaluating how Large Audio-Language Models (LALMs) handle conflicting audio-text information. It finds that LALMs show a strong bias toward text, often ignoring audio evidence, which degrades performance in audio-centric tasks. The study explores mitigation strategies and emphasizes the need for better modality balance and fusion mechanisms in training LALMs.
Details
Motivation: The motivation behind this study is to examine the largely unexplored area of how Large Audio-Language Models (LALMs) handle conflicting information between audio and text modalities. Understanding this can improve the reliability of such models in real-world applications. Method: The researchers introduced MCR-BENCH, a comprehensive benchmark for evaluating how LALMs prioritize information when faced with inconsistent audio-text pairs. They conducted extensive evaluations across various audio understanding tasks, investigated factors influencing text bias, explored mitigation strategies through supervised finetuning, and analyzed model confidence patterns. Result: The evaluation revealed that LALMs display a significant bias toward textual input when inconsistencies exist between modalities, often disregarding audio evidence. This leads to performance degradation in audio-centric tasks and raises reliability concerns. The study also found that models often remain overconfident even when inputs are contradictory. Conclusion: The study concludes that LALMs exhibit a significant bias toward textual input when inconsistencies exist between audio and text modalities, which affects performance in audio-centric tasks. This highlights the need for improved modality balance and more sophisticated fusion mechanisms in training LALMs. Abstract: Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.[32] LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
Yirong Sun,Yizhong Geng,Peidong Wei,Yanjun Chen,Jinghan Yang,Rongfei Chen,Wei Zhang,Xiaoyu Shen
Main category: cs.CL
TL;DR: LLaSO提供了一个完全开放的端到端框架,包括数据集、基准测试和模型,旨在推动大规模语音-语言模型的研究和标准化。
Details
Motivation: LSLM的发展因架构分散和缺乏透明度而受阻,且模型权重常在没有训练数据和配置的情况下发布,导致研究难以系统比较和复现。 Method: 构建并发布LLaSO-Base,一个基于公共数据训练的38亿参数参考模型,并通过LLaSO-Eval进行标准化评估。 Result: LLaSO-Base在标准化评估中取得了0.72的归一化分数,优于同类模型,但未见任务上仍存在显著泛化差距。 Conclusion: LLaSO建立了完整开放的标准,以统一研究工作并加速社区驱动的LSLM进展。 Abstract: The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.[33] A Study of Privacy-preserving Language Modeling Approaches
Pritilata Saha,Abhirup Sinha
Main category: cs.CL
TL;DR: 本文综述了隐私保护语言模型的研究进展,分析了其优势和局限性,并提出了未来研究的方向。
Details
Motivation: 语言模型在各种应用和领域中的使用增加,但其可能在隐私攻击中泄露敏感信息,引发隐私权保护的担忧。因此,研究如何减轻隐私风险变得至关重要。 Method: 对隐私保护语言建模方法进行了深入的综述和分析,包括它们的优势和局限性。 Result: 提供了隐私保护语言模型方法的深入概述,探讨了其优势和局限性,为持续研究提供了贡献。 Conclusion: 隐私保护语言模型的研究仍处于发展阶段,本文通过全面研究现有的隐私保护方法,为未来的研究提供了宝贵的见解和方向。 Abstract: Recent developments in language modeling have increased their use in various applications and domains. Language models, often trained on sensitive data, can memorize and disclose this information during privacy attacks, raising concerns about protecting individuals' privacy rights. Preserving privacy in language models has become a crucial area of research, as privacy is one of the fundamental human rights. Despite its significance, understanding of how much privacy risk these language models possess and how it can be mitigated is still limited. This research addresses this by providing a comprehensive study of the privacy-preserving language modeling approaches. This study gives an in-depth overview of these approaches, highlights their strengths, and investigates their limitations. The outcomes of this study contribute to the ongoing research on privacy-preserving language modeling, providing valuable insights and outlining future research directions.[34] M-HELP: Using Social Media Data to Detect Mental Health Help-Seeking Signals
MSVPJ Sathvik,Zuhair Hasan Shaik,Vivek Gupta
Main category: cs.CL
TL;DR: 本文介绍了一个名为M-Help的新数据集,用于检测社交媒体上的帮助寻求行为以及心理健康障碍及其原因。
Details
Motivation: 当前在识别主动寻求帮助的个体方面存在关键空白,因此需要一个专门设计的数据集来填补这一空白。 Method: 开发了一个名为M-Help的新数据集,该数据集专注于社交媒体上的帮助寻求行为,并对心理健康障碍及其原因进行详细分类。 Result: 创建了M-Help数据集,证明了其在识别求助者、诊断心理健康状况和发现根本问题方面的有效性。 Conclusion: M-Help可以用于训练AI模型以更好地识别和理解心理健康问题及求助行为,从而提供更有效的支持。 Abstract: Mental health disorders are a global crisis. While various datasets exist for detecting such disorders, there remains a critical gap in identifying individuals actively seeking help. This paper introduces a novel dataset, M-Help, specifically designed to detect help-seeking behavior on social media. The dataset goes beyond traditional labels by identifying not only help-seeking activity but also specific mental health disorders and their underlying causes, such as relationship challenges or financial stressors. AI models trained on M-Help can address three key tasks: identifying help-seekers, diagnosing mental health conditions, and uncovering the root causes of issues.[35] Principle Methods of Rendering Non-equivalent Words from Uzbek and Dari to Russian and English
Mohammad Ibrahim Qani
Main category: cs.CL
TL;DR: 本研究探讨了如何专业地将非对应词汇从源语言翻译成目标语言,总结了翻译方法和规则,并以25个来自Dar和Uzbek语的词汇翻译为英语和俄语的实例进行了展示。
Details
Motivation: 非对应词汇(如食物、服装、文化和传统词汇等)在翻译过程中容易导致误解,因为这些词汇在目标语言中可能没有直接对应的翻译,需要研究如何专业地进行翻译。 Method: 基于图书馆的研究方法。 Result: 研究完成并总结了将非对应词汇从源语言翻译成目标语言的方法和规则,并实际翻译了25个来自Dar和Uzbek语的非对应词汇到英语和俄语。 Conclusion: 该研究得出了将源语言中的非对应词汇专业地翻译成目标语言的不同方法和规则,并展示了将25个来自Dar和Uzbek语的非对应词汇翻译成英语和俄语的实例。 Abstract: These pure languages understanding directly relates to translation knowledge where linguists and translators need to work and research to eradicate misunderstanding. Misunderstandings mostly appear in non-equivalent words because there are different local and internal words like food, garment, cultural and traditional words and others in every notion. Truly, most of these words do not have equivalent in the target language and these words need to be worked and find their equivalent in the target language to fully understand the both languages. The purpose of this research is to introduce the methods of rendering non-equivalent words professionally from the source language to the target language and this research has been completed using library-based research. However, some of these non-equivalent words are already professionally rendered to the target language but still there many other words to be rendered. As a result, this research paper includes different ways and rules of rendering non-equivalent words from source language to the target language and 25 non-equvalent words have been rendered from Dar & Uzbek into English and Russian languages.[36] PyTOD: Programmable Task-Oriented Dialogue with Execution Feedback
Alexandru Coca,Bo-Hsiang Tseng,Pete Boothroyd,Jianpeng Cheng,Mark Gaynor,Zhenxing Zhang,Joe Stacey,Tristan Guigue,Héctor Martinez Alonso,Diarmuid Ó Séaghdha,Anders Johannsen
Main category: cs.CL
TL;DR: PyTOD是一种创新的任务导向对话代理,它通过生成可执行代码进行对话状态跟踪,并通过策略和执行反馈进行错误校正,实验证明其在准确性与鲁棒性方面优于现有方法。
Details
Motivation: 可编程任务导向对话(TOD)代理的有效性依赖于准确的状态跟踪,这是提出PyTOD的主要动机。 Method: PyTOD采用了一种简单的受限解码方法,使用语言模型而不是语法规则来遵循API模式。 Result: 实验表明,PyTOD在准确性和鲁棒的用户目标估计方面超过了强大的基线,证明了执行感知状态跟踪的有效性。 Conclusion: PyTOD是一个有效的面向任务的对话代理,它通过生成可执行代码来跟踪对话状态,并利用策略和执行反馈进行高效的错误校正。 Abstract: Programmable task-oriented dialogue (TOD) agents enable language models to follow structured dialogue policies, but their effectiveness hinges on accurate state tracking. We present PyTOD, an agent that generates executable code to track dialogue state and uses policy and execution feedback for efficient error correction. To this end, PyTOD employs a simple constrained decoding approach, using a language model instead of grammar rules to follow API schemata. This leads to state-of-the-art state tracking performance on the challenging SGD benchmark. Our experiments show that PyTOD surpasses strong baselines in both accuracy and robust user goal estimation as the dialogue progresses, demonstrating the effectiveness of execution-aware state tracking.[37] RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores
Yingshu Li,Yunyi Liu,Lingqiao Liu,Lei Wang,Luping Zhou
Main category: cs.CL
TL;DR: 本文提出了一种新的放射学报告评估框架 RadReason,其性能达到与 GPT-4 相当的水平,同时保持可解释性、成本效益和适合临床部署。
Details
Motivation: 由于缺乏临床基础、可解释和细粒度的度量标准,自动生成的放射学报告的评估仍然是一个根本性的挑战。现有方法要么产生粗略的总体评分,要么依赖于不透明的黑箱模型,限制了它们在现实世界临床工作流程中的实用性。 Method: RadReason 基于 Group Relative Policy Optimization,并引入了两种创新:子评分动态加权和多数指导优势缩放。 Result: 在 ReXVal 基准测试中进行的实验表明,RadReason 超过了所有先前的离线指标,并达到了与 GPT-4 评估相当的水平。 Conclusion: RadReason 是一种新的放射学报告评估框架,它实现了与 GPT-4 评估相当的性能,同时保持可解释性、成本效益和适合临床部署。 Abstract: Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces human-readable justifications that explain the rationale behind each score. Our method builds on Group Relative Policy Optimization and incorporates two key innovations: (1) Sub-score Dynamic Weighting, which adaptively prioritizes clinically challenging error types based on live F1 statistics; and (2) Majority-Guided Advantage Scaling, which adjusts policy gradient updates based on prompt difficulty derived from sub-score agreement. Together, these components enable more stable optimization and better alignment with expert clinical judgment. Experiments on the ReXVal benchmark show that RadReason surpasses all prior offline metrics and achieves parity with GPT-4-based evaluations, while remaining explainable, cost-efficient, and suitable for clinical deployment. Code will be released upon publication.[38] SLM4Offer: Personalized Marketing Offer Generation Using Contrastive Learning Based Fine-Tuning
Vedasamhitha Challapalli,Konduru Venkat Sai,Piyush Pratap Singh,Rupesh Prasad,Arvind Maurya,Atul Singh
Main category: cs.CL
TL;DR: This paper presents SLM4Offer, a generative AI model for personalized offer generation using a contrastive learning approach, which improves offer acceptance rates by 17 percent over traditional methods.
Details
Motivation: Personalized marketing is crucial for enhancing customer engagement and driving business growth, with potential for increasing conversion rates and improving customer satisfaction. Prior studies suggest that personalization strategies can boost revenue by up to 40 percent. Method: SLM4Offer employs InfoNCE loss to align customer personas with relevant offers in a shared embedding space, and uses a contrastive learning approach to fine-tune a pre-trained encoder-decoder language model (T5-Small 60M). Result: The experimental results show that the SLM4Offer model, which uses contrastive learning for adaptive learning behavior, significantly improves the offer acceptance rate compared to a supervised fine-tuning baseline. Conclusion: SLM4Offer demonstrates a 17 percent improvement in offer acceptance rate over a supervised fine-tuning baseline, highlighting the effectiveness of contrastive objectives in advancing personalized marketing. Abstract: Personalized marketing has emerged as a pivotal strategy for enhancing customer engagement and driving business growth. Academic and industry efforts have predominantly focused on recommendation systems and personalized advertisements. Nonetheless, this facet of personalization holds significant potential for increasing conversion rates and improving customer satisfaction. Prior studies suggest that well-executed personalization strategies can boost revenue by up to 40 percent, underscoring the strategic importance of developing intelligent, data-driven approaches for offer generation. This work introduces SLM4Offer, a generative AI model for personalized offer generation, developed by fine-tuning a pre-trained encoder-decoder language model, specifically Google's Text-to-Text Transfer Transformer (T5-Small 60M) using a contrastive learning approach. SLM4Offer employs InfoNCE (Information Noise-Contrastive Estimation) loss to align customer personas with relevant offers in a shared embedding space. A key innovation in SLM4Offer lies in the adaptive learning behaviour introduced by contrastive loss, which reshapes the latent space during training and enhances the model's generalizability. The model is fine-tuned and evaluated on a synthetic dataset designed to simulate customer behaviour and offer acceptance patterns. Experimental results demonstrate a 17 percent improvement in offer acceptance rate over a supervised fine-tuning baseline, highlighting the effectiveness of contrastive objectives in advancing personalized marketing.[39] Subjective Behaviors and Preferences in LLM: Language of Browsing
Sai Sundaresan,Harshita Chopra,Atanu R. Sinha,Koustava Goswami,Nagasai Saketh Naidu,Raghav Karan,N Anushka
Main category: cs.CL
TL;DR: 本文提出了一种针对用户主观浏览行为的语言模型训练方法HeTLM,结果显示小模型比大模型更有效,并且HeTLM提升了对用户个性化偏好的捕捉能力。
Details
Motivation: 质疑大语言模型是否能够适应用户主观行为和偏好,尤其是在浏览行为中形成的个性化“语言”。 Method: 提出了适用于主观行为的语言模型训练方法——HeTLM(Heterogeneity aware Training of Language Model),并进行了实验比较。 Result: 1)使用页面级分词器训练的小模型优于预训练或微调的大模型;2)HeTLM在控制参数数量的情况下优于单一模型;3)生成结果具有更高的平均性能和更低的方差,表明对齐效果更好。 Conclusion: 通过引入HeTLM方法,小语言模型在用户浏览行为预测上优于大模型,并且能够更好地捕捉用户的异质性和主观性。 Abstract: A Large Language Model (LLM) offers versatility across domains and tasks, purportedly benefiting users with a wide variety of behaviors and preferences. We question this perception about an LLM when users have inherently subjective behaviors and preferences, as seen in their ubiquitous and idiosyncratic browsing of websites or apps. The sequential behavior logs of pages, thus generated, form something akin to each user's self-constructed "language", albeit without the structure and grammar imbued in natural languages. We ask: (i) Can a small LM represent the "language of browsing" better than a large LM? (ii) Can an LM with a single set of parameters (or, single LM) adequately capture myriad users' heterogeneous, subjective behaviors and preferences? (iii) Can a single LM with high average performance, yield low variance in performance to make alignment good at user level? We introduce clusterwise LM training, HeTLM (Heterogeneity aware Training of Language Model), appropriate for subjective behaviors. We find that (i) a small LM trained using a page-level tokenizer outperforms large pretrained or finetuned LMs; (ii) HeTLM with heterogeneous cluster specific set of parameters outperforms a single LM of the same family, controlling for the number of parameters; and (iii) a higher mean and a lower variance in generation ensues, implying improved alignment.[40] Influence-driven Curriculum Learning for Pre-training on Limited Data
Loris Schoenegger,Lukas Thoma,Terra Blevins,Benjamin Roth
Main category: cs.CL
TL;DR: This paper demonstrates that curriculum learning can be effective for language model pre-training when using a model-centric measure of example difficulty called training data influence.
Details
Motivation: Curriculum learning has shown limited success for pre-training language models with human-centered difficulty metrics, prompting the investigation of a model-centric approach. Method: Training examples were sorted by their training data influence, a score estimating the effect of individual examples on the model's output, and models trained on these curricula were compared against those trained in random order. Result: Models trained on curricula based on training data influence outperformed randomly trained models by over 10 percentage points in benchmarks. Conclusion: Curriculum learning is beneficial for language model pre-training if a more model-centric notion of difficulty, such as training data influence, is adopted. Abstract: Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.[41] SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts -- Extended Version
Nghiem Thanh Pham,Tung Kieu,Duc-Manh Nguyen,Son Ha Xuan,Nghia Duong-Trung,Danh Le-Phuoc
Main category: cs.CL
TL;DR: SLM-Bench is a new benchmark for evaluating small language models across multiple dimensions like accuracy, efficiency, and sustainability, highlighting their trade-offs and enabling reproducible research.
Details
Motivation: There is a lack of systematic evaluation of small language models (SLMs) in terms of performance and environmental impact, which SLM-Bench aims to address. Method: The authors designed SLM-Bench, which evaluates 15 SLMs on 9 NLP tasks using 23 datasets across 14 domains, and tested them on 4 hardware configurations while measuring 11 different metrics. Result: The benchmark enables a holistic assessment of SLMs, revealing that some models excel in accuracy while others are more energy-efficient, and provides a reproducible, open-source pipeline for future research. Conclusion: SLM-Bench provides a comprehensive and standardized framework for evaluating SLMs, enabling researchers to assess trade-offs between accuracy, computational efficiency, and sustainability. Abstract: Small Language Models (SLMs) offer computational efficiency and accessibility, yet a systematic evaluation of their performance and environmental impact remains lacking. We introduce SLM-Bench, the first benchmark specifically designed to assess SLMs across multiple dimensions, including accuracy, computational efficiency, and sustainability metrics. SLM-Bench evaluates 15 SLMs on 9 NLP tasks using 23 datasets spanning 14 domains. The evaluation is conducted on 4 hardware configurations, providing a rigorous comparison of their effectiveness. Unlike prior benchmarks, SLM-Bench quantifies 11 metrics across correctness, computation, and consumption, enabling a holistic assessment of efficiency trade-offs. Our evaluation considers controlled hardware conditions, ensuring fair comparisons across models. We develop an open-source benchmarking pipeline with standardized evaluation protocols to facilitate reproducibility and further research. Our findings highlight the diverse trade-offs among SLMs, where some models excel in accuracy while others achieve superior energy efficiency. SLM-Bench sets a new standard for SLM evaluation, bridging the gap between resource efficiency and real-world applicability.[42] HebID: Detecting Social Identities in Hebrew-language Political Text
Guy Mor-Lan,Naama Rivlin-Angert,Yael R. Kaplan,Tamir Sheafer,Shaul R. Shenhav
Main category: cs.CL
TL;DR: 本文介绍了 HebID,一个用于社会身份检测的新型多标签希伯来语语料库,并展示了其在分析以色列政治语言中的应用。
Details
Motivation: 现有的群体和身份检测数据集主要是以英语为中心的、单一标签的,并且关注粗糙的身份类别,而社会身份通常是由特定文化背景塑造并通过特定语言使用表达的。 Method: 引入了 HebID,第一个多标签希伯来语语料库用于社会身份检测,并对多标签和单标签编码器以及生成式大型语言模型进行了基准测试。 Result: 发现希伯来语调优的大型语言模型提供了最佳结果,宏观 F1 分数为 0.74。 Conclusion: HebID 为希伯来语中的社会身份研究提供了全面的基础,并为其他非英语政治环境中的类似研究树立了榜样。 Abstract: Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians' Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74). We apply our classifier to politicians' Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.[43] Dream 7B: Diffusion Large Language Models
Jiacheng Ye,Zhihui Xie,Lin Zheng,Jiahui Gao,Zirui Wu,Xin Jiang,Zhenguo Li,Lingpeng Kong
Main category: cs.CL
TL;DR: The paper introduces Dream 7B, an open diffusion large language model that outperforms existing models by refining sequences in parallel through iterative denoising.
Details
Motivation: To overcome the limitations of autoregressive models that generate tokens sequentially, the authors introduced Dream 7B, which refines sequences in parallel. Method: Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising, using AR-based LLM initialization and context-adaptive token-level noise rescheduling. Result: Dream 7B consistently outperforms existing diffusion language models on general, mathematical, and coding tasks while offering tunable quality-speed trade-offs. Conclusion: Dream 7B is a powerful open diffusion large language model that outperforms existing models and demonstrates superior planning abilities and inference flexibility. Abstract: We introduce Dream 7B, the most powerful open diffusion large language model to date. Unlike autoregressive (AR) models that generate tokens sequentially, Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising. Our model consistently outperforms existing diffusion language models on general, mathematical, and coding tasks. Dream 7B demonstrates superior planning abilities and inference flexibility, including arbitrary-order generation, infilling capabilities, and tunable quality-speed trade-offs. These results are achieved through simple yet effective training techniques, including AR-based LLM initialization and context-adaptive token-level noise rescheduling. We release both Dream-Base and Dream-Instruct to facilitate further research in diffusion-based language modeling.[44] The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech
Naama Rivlin-Angert,Guy Mor-Lan
Main category: cs.CL
TL;DR: 本文介绍了一种新的自动化分析方法,用于研究政治非正当化话语,结果显示了其在民主话语中的可行性和价值。
Details
Motivation: 该研究旨在对政治非正当化话语进行大规模计算分析,以理解其在民主话语中的作用和趋势。 Method: 研究团队创建了一个两阶段分类流程,结合了微调的编码器模型和解码器LLMs。 Result: 研究发现PDD在过去三十年中显著增加,社交媒体上的使用率高于议会辩论,男性政治家使用更多,右倾政治人物倾向更明显,且在选举活动和重大政治事件期间会出现显著高峰。 Conclusion: 研究得出自动化分析政治非正当化话语(PDD)是可行且有价值的,有助于理解民主话语。 Abstract: We present the first large-scale computational study of political delegitimization discourse (PDD), defined as symbolic attacks on the normative validity of political entities. We curate and manually annotate a novel Hebrew-language corpus of 10,410 sentences drawn from Knesset speeches (1993-2023), Facebook posts (2018-2021), and leading news outlets, of which 1,812 instances (17.4\%) exhibit PDD and 642 carry additional annotations for intensity, incivility, target type, and affective framing. We introduce a two-stage classification pipeline combining finetuned encoder models and decoder LLMs. Our best model (DictaLM 2.0) attains an F$_1$ of 0.74 for binary PDD detection and a macro-F$_1$ of 0.67 for classification of delegitimization characteristics. Applying this classifier to longitudinal and cross-platform data, we see a marked rise in PDD over three decades, higher prevalence on social media versus parliamentary debate, greater use by male than female politicians, and stronger tendencies among right-leaning actors - with pronounced spikes during election campaigns and major political events. Our findings demonstrate the feasibility and value of automated PDD analysis for understanding democratic discourse.[45] SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking
Xiangyang Zhu,Yuan Tian,Chunyi Li,Kaiwei Zhang,Wei Sun,Guangtao Zhai
Main category: cs.CL
TL;DR: SafetyFlow introduces a fully automated agent-based system for generating LLM safety benchmarks, significantly improving efficiency and benchmark quality.
Details
Motivation: Existing LLM safety evaluation benchmarks are labor-intensive, time-consuming, redundant, and lack difficulty, necessitating an automated solution to improve efficiency and effectiveness. Method: SafetyFlow uses an agent-flow system with seven specialized agents to automate the creation of safety benchmarks for LLMs, incorporating versatile tools to maintain control over the process and integrate human expertise. Result: SafetyFlow constructs a comprehensive safety benchmark (SafetyFlowBench) containing 23,446 queries in just four days without human intervention, demonstrating low redundancy and strong discriminative power. Conclusion: SafetyFlow provides an efficient and automated approach to constructing LLM safety benchmarks, reducing time and resource costs while ensuring quality and discriminative power in the resulting datasets. Abstract: The rapid proliferation of large language models (LLMs) has intensified the requirement for reliable safety evaluation to uncover model vulnerabilities. To this end, numerous LLM safety evaluation benchmarks are proposed. However, existing benchmarks generally rely on labor-intensive manual curation, which causes excessive time and resource consumption. They also exhibit significant redundancy and limited difficulty. To alleviate these problems, we introduce SafetyFlow, the first agent-flow system designed to automate the construction of LLM safety benchmarks. SafetyFlow can automatically build a comprehensive safety benchmark in only four days without any human intervention by orchestrating seven specialized agents, significantly reducing time and resource cost. Equipped with versatile tools, the agents of SafetyFlow ensure process and cost controllability while integrating human expertise into the automatic pipeline. The final constructed dataset, SafetyFlowBench, contains 23,446 queries with low redundancy and strong discriminative power. Our contribution includes the first fully automated benchmarking pipeline and a comprehensive safety benchmark. We evaluate the safety of 49 advanced LLMs on our dataset and conduct extensive experiments to validate our efficacy and efficiency.[46] Trained Miniatures: Low cost, High Efficacy SLMs for Sales & Marketing
Ishaan Bhola,Mukunda NS,Sravanth Kurmala,Harsh Nandwani,Arihant Jain
Main category: cs.CL
TL;DR: This paper proposes the use of Small Language Models (SLMs), referred to as 'Trained Miniatures,' fine-tuned for specific applications to achieve cost-effective and domain-specific text generation, as opposed to using expensive Large Language Models (LLMs).
Details
Motivation: The motivation behind the paper is to address the high computational costs and infeasibility of using Large Language Models for targeted applications like sales and marketing outreach, by proposing a more cost-effective alternative. Method: The method involves creating Small Language Models (SLMs) that are specifically fine-tuned for high-value applications, as opposed to using general-purpose Large Language Models (LLMs). Result: The result is the development of 'Trained Miniatures' which are Small Language Models that can generate similar domain-specific responses as Large Language Models but at a fraction of the cost. Conclusion: The paper concludes that by using 'Trained Miniatures' or Small Language Models fine-tuned for specific applications, it is possible to generate domain-specific responses at a much lower cost compared to using Large Language Models. Abstract: Large language models (LLMs) excel in text generation; however, these creative elements require heavy computation and are accompanied by a steep cost. Especially for targeted applications such as sales and marketing outreach, these costs are far from feasible. This paper introduces the concept of "Trained Miniatures" - Small Language Models(SLMs) fine-tuned for specific, high-value applications, generating similar domain-specific responses for a fraction of the cost.[47] SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
Peng Ding,Wen Sun,Dailin Li,Wei Zou,Jiaming Wang,Jiajun Chen,Shujian Huang
Main category: cs.CL
TL;DR: 本文提出SDGO,一种利用大型语言模型自身判别能力来提升生成内容安全性的强化学习框架,无需额外数据或模型,且在分布外越狱攻击下表现出色。
Details
Motivation: 大型语言模型(LLMs)在各种自然语言处理任务中表现出色,但容易受到越狱攻击的影响,导致生成有害内容。本文揭示了一个关键的安全不一致现象:LLMs在作为判别器时能更有效地识别有害请求,但在作为生成器时却难以有效防御。 Method: SDGO(Self-Discrimination-Guided Optimization)是一种基于强化学习的框架,通过迭代自我改进来增强生成安全性,训练过程中不需要额外的标注数据或外部模型。 Result: 实验表明,与基于提示和基于训练的基线方法相比,SDGO显著提高了模型的安全性,同时保持了在通用基准任务中的性能。该方法通过少量的判别样本就能进一步提升模型的生成能力,并实现了判别与生成能力之间的紧密耦合。 Conclusion: SDGO通过利用模型自身的判别能力作为奖励信号,有效提升了生成内容的安全性,同时保持了模型在通用基准任务中的有用性,并且对分布外(OOD)越狱攻击具有较强的鲁棒性。 Abstract: Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.[48] Benchmarking Computer Science Survey Generation
Weihang Su,Anzhe Xie,Qingyao Ai,Jianming Long,Jiaxin Mao,Ziyi Ye,Yiqun Liu
Main category: cs.CL
TL;DR: This paper introduces SurGE, a benchmark for evaluating automated scientific survey generation using large language models, highlighting the ongoing challenges in this area.
Details
Motivation: The motivation is to address the infeasibility of manually creating scientific survey articles due to the rapid growth of academic literature and the lack of standardized benchmarks for evaluating automated survey generation. Method: The paper introduces SurGE, which includes a collection of test instances and a retrieval pool of over one million academic papers. It also proposes an automated evaluation framework that assesses generated surveys on four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Result: The result is the creation of the SurGE benchmark and the finding that survey generation remains challenging for LLM-based approaches, even with advanced self-reflection frameworks. Conclusion: The paper concludes that generating scientific surveys is a complex task that requires further research despite the potential offered by large language models (LLMs). Abstract: Scientific survey articles play a vital role in summarizing research progress, yet their manual creation is becoming increasingly infeasible due to the rapid growth of academic literature. While large language models (LLMs) offer promising capabilities for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To address this gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for evaluating scientific survey generation in the computer science domain. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool. In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based approaches shows that survey generation remains highly challenging, even for advanced self-reflection frameworks. These findings highlight the complexity of the task and the necessity for continued research. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE[49] Position Bias Mitigates Position Bias:Mitigate Position Bias Through Inter-Position Knowledge Distillation
Yifei Wang,Feng Xiong,Yong Wang,Linjing Li,Xiangxiang Chu,Daniel Dajun Zeng
Main category: cs.CL
TL;DR: 本文提出Pos2Distill,通过位置到位置的知识蒸馏减少位置偏差,提升了长上下文任务的性能。
Details
Motivation: 位置偏差严重影响长上下文理解和处理能力,尽管已有工作试图通过修改架构来减轻其影响,但仍存在显著的位置偏差。 Method: Pos2Distill框架通过从优势位置向不利位置转移知识,设计了针对检索和推理范式的特定实例。 Result: Pos2Distill在所有上下文位置上实现了性能的显著提升,并展示了跨任务的泛化能力。 Conclusion: Pos2Distill有效减少了位置偏差,提高了长上下文检索和推理任务的均匀性和性能。 Abstract: Positional bias (PB), manifesting as non-uniform sensitivity across different contextual locations, significantly impairs long-context comprehension and processing capabilities. While prior work seeks to mitigate PB through modifying the architectures causing its emergence, significant PB still persists. To address PB effectively, we introduce \textbf{Pos2Distill}, a position to position knowledge distillation framework. Pos2Distill transfers the superior capabilities from advantageous positions to less favorable ones, thereby reducing the huge performance gaps. The conceptual principle is to leverage the inherent, position-induced disparity to counteract the PB itself. We identify distinct manifestations of PB under \textbf{\textsc{r}}etrieval and \textbf{\textsc{r}}easoning paradigms, thereby designing two specialized instantiations: \emph{Pos2Distill-R\textsuperscript{1}} and \emph{Pos2Distill-R\textsuperscript{2}} respectively, both grounded in this core principle. By employing the Pos2Distill approach, we achieve enhanced uniformity and significant performance gains across all contextual positions in long-context retrieval and reasoning tasks. Crucially, both specialized systems exhibit strong cross-task generalization mutually, while achieving superior performance on their respective tasks.[50] Stemming -- The Evolution and Current State with a Focus on Bangla
Abhijit Paul,Mashiat Amin Farin,Sharif Md. Abdullah,Ahmedul Kabir,Zarif Masud,Shebuti Rayana
Main category: cs.CL
TL;DR: 这篇论文探讨了孟加拉语词干提取的现状与挑战,指出了现有研究的不足之处,并提出了未来发展的方向。
Details
Motivation: 孟加拉语作为一种使用人数众多但数字资源匮乏的语言,其自然语言处理面临挑战,特别是在词干提取方面。 Method: 本文对现有的孟加拉语词干提取方法进行了全面调查,并探讨了评估方法论的问题。 Result: 研究发现,现有文献中存在显著的空白,且缺乏可访问的实现方法用于复制和进一步研究。 Conclusion: 该论文强调了开发强大的孟加拉语词干提取工具的重要性,并呼吁继续在该领域开展研究,以改进语言分析和处理。 Abstract: Bangla, the seventh most widely spoken language worldwide with 300 million native speakers, faces digital under-representation due to limited resources and lack of annotated datasets. Stemming, a critical preprocessing step in language analysis, is essential for low-resource, highly-inflectional languages like Bangla, because it can reduce the complexity of algorithms and models by significantly reducing the number of words the algorithm needs to consider. This paper conducts a comprehensive survey of stemming approaches, emphasizing the importance of handling morphological variants effectively. While exploring the landscape of Bangla stemming, it becomes evident that there is a significant gap in the existing literature. The paper highlights the discontinuity from previous research and the scarcity of accessible implementations for replication. Furthermore, it critiques the evaluation methodologies, stressing the need for more relevant metrics. In the context of Bangla's rich morphology and diverse dialects, the paper acknowledges the challenges it poses. To address these challenges, the paper suggests directions for Bangla stemmer development. It concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing.[51] EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models
Xinyi Ling,Hanwen Du,Zhihui Zhu,Xia Ning
Main category: cs.CL
TL;DR: This paper introduces EcomMMMU, a large e-commerce multimodal dataset, and finds that product images do not always improve performance for MLLMs. It proposes SUMEI, a method to better utilize visual content based on predicted visual utility.
Details
Motivation: The motivation stems from the question of whether product images in e-commerce always improve understanding or can introduce redundancy and degrade performance, which is difficult to study due to limitations in existing datasets. Method: The authors introduced EcomMMMU, a large-scale dataset for evaluating multimodal large language models (MLLMs) in e-commerce tasks, and proposed SUMEI, a method that predicts visual utility to strategically use multiple images. Result: Experiments on EcomMMMU show that product images can sometimes degrade performance, and the proposed SUMEI method demonstrates effectiveness and robustness in utilizing visual content. Conclusion: The paper concludes that while e-commerce platforms have abundant multimodal data, product images do not always enhance performance and can sometimes degrade it, indicating that MLLMs may struggle to utilize visual content effectively. Abstract: E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through https://anonymous.4open.science/r/submission25.[52] End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Qiaoyu Zheng,Yuze Sun,Chaoyi Wu,Weike Zhao,Pengcheng Qiu,Yongguo Yu,Kun Sun,Yanfeng Wang,Ya Zhang,Weidi Xie
Main category: cs.CL
TL;DR: Deep-DxSearch是一种端到端训练的代理RAG系统,通过强化学习提高医学诊断的准确性,解决了现有方法在知识利用和推理可追溯性方面的不足。
Details
Motivation: 准确的医学诊断受到知识差距和幻觉的阻碍,现有的检索和工具增强方法因对外部知识的弱使用和较差的反馈推理可追溯性而受限。 Method: 引入Deep-DxSearch,一个通过强化学习端到端训练的代理RAG系统,利用患者记录和可靠的医学知识来源构建大规模医学检索语料库。 Result: 实验表明,Deep-DxSearch在常见病和罕见病的诊断中均实现了显著的准确性提升,并通过消融研究和案例研究验证了其方法的有效性和独特性。 Conclusion: Deep-DxSearch通过端到端的代理强化学习框架,在多个数据中心一致优于现有的提示工程和无需训练的RAG方法。 Abstract: Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.[53] Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis
Yufeng Zhao,Junnan Liu,Hongwei Liu,Dongsheng Zhu,Yuan Shen,Songyang Zhang,Kai Chen
Main category: cs.CL
TL;DR: ReasonZoo提出了一个全面的基准测试和两种新度量标准来评估工具集成推理(TIR)在各种领域中的有效性,证明了TIR在提高大语言模型(LLM)的推理能力和效率方面的有效性。
Details
Motivation: 尽管大型语言模型在推理任务中取得了显著进展,但在需要精确计算的任务上仍然存在不足。工具集成推理(TIR)被提出作为解决方案,但其在提高LLM推理能力方面的泛化能力尚不清楚。 Method: 引入了一个名为ReasonZoo的综合基准测试,涵盖九个不同的推理类别,并提出了两种新度量标准,性能感知成本(PAC)和性能-成本曲线下的面积(AUC-PCC),以评估推理效率。 Result: 通过实证评估发现,启用TIR的模型在数学和非数学任务中都持续优于非TIR模型。此外,TIR提高了推理效率,如改进的PAC和AUC-PCC所示,表明过度思考减少,推理更加高效。 Conclusion: 研究结果强调了TIR在提高LLM处理复杂推理任务方面的通用域益处和潜力。 Abstract: Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model's reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.[54] LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Ming Yin,Dinghan Shen,Silei Xu,Jianbing Han,Sixun Dong,Mian Zhang,Yebowen Hu,Shujian Liu,Simin Ma,Song Wang,Sathish Reddy Indurthi,Xun Wang,Yiran Chen,Kaiqiang Song
Main category: cs.CL
TL;DR: 本研究提出了LiveMCP-101基准测试集和新的评估方法,用于评估AI代理在现实世界中使用工具解决多步骤任务的能力。
Details
Motivation: 尽管模型上下文协议(MCP)为工具集成提供了一个强大的标准化框架,但在评估AI代理在现实、动态场景中使用多种MCP工具解决多步骤任务的能力方面存在显著差距。 Method: 引入了一种新的评估方法,利用真实执行计划而非原始API输出,并提出了一个包含101个真实世界查询的基准测试集LiveMCP-101。 Result: 实验显示,即使是前沿的LLMs的成功率也低于60%,突显了工具编排中的重大挑战。 Conclusion: LiveMCP-101提供了一个严格的评估标准,用于测试现实世界中AI代理的能力,推动了能够通过使用工具可靠执行复杂任务的自主AI系统的发展。 Abstract: Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.cs.CV [Back]
[55] Heatmap Regression without Soft-Argmax for Facial Landmark Detection
Chiao-An Yang,Raymond A. Yeh
Main category: cs.CV
TL;DR: 本文提出了一种不依赖Soft-argmax的面部关键点检测方法,在保持或提高性能的同时加快了训练速度。
Details
Motivation: 为了克服Soft-argmax方法在面部关键点检测任务中的局限性,重新审视了这一长期使用的选择。 Method: 提出了一种基于经典结构预测框架的替代训练目标,用于面部关键点检测。 Result: 该方法在三个面部关键点基准数据集(WFLW, COFW, 和 300W)上实现了最先进的性能,同时训练速度提高了2.2倍。 Conclusion: 使用经典结构预测框架的替代训练目标可以实现与Soft-argmax方法相当或更好的性能,并且训练速度更快。 Abstract: Facial landmark detection is an important task in computer vision with numerous applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been widely used to achieve state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. Since argmax is not differentiable, these methods use a differentiable approximation, Soft-argmax, to enable end-to-end training on deep-nets. In this work, we revisit this long-standing choice of using Soft-argmax and demonstrate that it is not the only way to achieve strong performance. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W), converging 2.2x faster during training while maintaining better/competitive accuracy. Our code is available here: https://github.com/ca-joe-yang/regression-without-softarg.[56] Fast Graph Neural Network for Image Classification
Mustafa Mohammadi Gharasuie,Luis Rueda
Main category: cs.CV
TL;DR: This study proposes a new method for image classification by integrating Graph Convolutional Networks with Voronoi diagrams, enhancing performance and expanding the potential of graph-based learning in computer vision.
Details
Motivation: The rapid progress in image classification has been largely driven by the adoption of Graph Convolutional Networks (GCNs), which offer a robust framework for handling complex data structures. Method: This study introduces a novel approach that integrates GCNs with Voronoi diagrams to enhance image classification by leveraging their ability to effectively model relational data. Unlike conventional convolutional neural networks (CNNs), our method represents images as graphs, where pixels or regions function as vertices. These graphs are then refined using corresponding Delaunay triangulations, optimizing their representation. Result: The proposed model achieves significant improvements in both preprocessing efficiency and classification accuracy across various benchmark datasets, surpassing state-of-the-art approaches, particularly in challenging scenarios involving intricate scenes and fine-grained categories. Experimental results, validated through cross-validation, underscore the effectiveness of combining GCNs with Voronoi diagrams for advancing image classification. Conclusion: This research not only presents a novel perspective on image classification but also expands the potential applications of graph-based learning paradigms in computer vision and unstructured data analysis. Abstract: The rapid progress in image classification has been largely driven by the adoption of Graph Convolutional Networks (GCNs), which offer a robust framework for handling complex data structures. This study introduces a novel approach that integrates GCNs with Voronoi diagrams to enhance image classification by leveraging their ability to effectively model relational data. Unlike conventional convolutional neural networks (CNNs), our method represents images as graphs, where pixels or regions function as vertices. These graphs are then refined using corresponding Delaunay triangulations, optimizing their representation. The proposed model achieves significant improvements in both preprocessing efficiency and classification accuracy across various benchmark datasets, surpassing state-of-the-art approaches, particularly in challenging scenarios involving intricate scenes and fine-grained categories. Experimental results, validated through cross-validation, underscore the effectiveness of combining GCNs with Voronoi diagrams for advancing image classification. This research not only presents a novel perspective on image classification but also expands the potential applications of graph-based learning paradigms in computer vision and unstructured data analysis.[57] You Only Pose Once: A Minimalist's Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation
Hakjin Lee,Junghoon Seo,Jaehoon Sim
Main category: cs.CV
TL;DR: YOPO是一种单阶段方法,通过统一物体检测和9-DoF姿态估计,实现了高效的类别级别姿态估计,仅使用RGB图像。
Details
Motivation: 需要一种仅使用RGB图像的简单方法,直接在类别级别上进行9-DoF姿态估计,而不依赖伪深度、CAD模型或多阶段方法。 Method: YOPO 采用基于查询的框架,结合增强的transformer检测器,包括轻量级姿态头、边界框条件平移模块以及6D感知的匈牙利匹配成本。 Result: 在REAL275数据集上,YOPO取得了79.6%的IoU50和54.1%的10°10cm指标,达到了新的先进水平。 Conclusion: YOPO 提供了一个更简单、高效的解决方案,通过单阶段方法统一了物体检测和9-DoF姿态估计,实现了先进的性能。 Abstract: Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% $\rm{IoU}_{50}$ and 54.1% under the $10^\circ$$10{\rm{cm}}$ metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on our project.[58] Paired-Sampling Contrastive Framework for Joint Physical-Digital Face Attack Detection
Andrei Balykin,Anvar Ganiev,Denis Kondranin,Kirill Polevoda,Nikolai Liudkevich,Artem Petrov
Main category: cs.CV
TL;DR: The paper introduces a new framework for face anti-spoofing that effectively detects both physical and digital attacks with improved performance and efficiency.
Details
Motivation: Modern face recognition systems are vulnerable to spoofing attempts, and traditional approaches that use separate models for physical and digital attacks increase system complexity and leave systems exposed to combined attack vectors. Method: The method involves a Paired-Sampling Contrastive Framework that uses automatically matched pairs of genuine and attack selfies to learn modality-agnostic liveness cues. Result: The framework achieved an average classification error rate (ACER) of 2.10 percent on the 6th Face Anti-Spoofing Challenge benchmark, outperforming previous solutions. Conclusion: The paper concludes that the proposed Paired-Sampling Contrastive Framework is effective in detecting both physical and digital spoofing attacks in face recognition systems, demonstrating superior performance with a low average classification error rate. Abstract: Modern face recognition systems remain vulnerable to spoofing attempts, including both physical presentation attacks and digital forgeries. Traditionally, these two attack vectors have been handled by separate models, each targeting its own artifacts and modalities. However, maintaining distinct detectors increases system complexity and inference latency and leaves systems exposed to combined attack vectors. We propose the Paired-Sampling Contrastive Framework, a unified training approach that leverages automatically matched pairs of genuine and attack selfies to learn modality-agnostic liveness cues. Evaluated on the 6th Face Anti-Spoofing Challenge Unified Physical-Digital Attack Detection benchmark, our method achieves an average classification error rate (ACER) of 2.10 percent, outperforming prior solutions. The framework is lightweight (4.46 GFLOPs) and trains in under one hour, making it practical for real-world deployment. Code and pretrained models are available at https://github.com/xPONYx/iccv2025_deepfake_challenge.[59] TAIGen: Training-Free Adversarial Image Generation via Diffusion Models
Susim Roy,Anubhooti Jain,Mayank Vatsa,Richa Singh
Main category: cs.CV
TL;DR: TAIGen是一种高效的对抗图像生成方法,利用扩散模型的少量采样步骤和选择性RGB通道扰动策略,在保持图像质量的同时实现快速且成功的攻击。
Details
Motivation: 传统的对抗攻击方法基于生成模型,通常需要大量计算资源且生成的图像质量较低。尽管扩散模型能够生成高质量的图像,但其对抗生成往往需要数百个采样步骤。TAIGen旨在解决这些问题,提供更高效的对抗样本生成方法。 Method: TAIGen利用无条件扩散模型,仅需3-20个采样步骤生成对抗样本。通过在混合步骤间隔中注入扰动,并采用选择性RGB通道策略,将注意力图应用于红色通道,绿色和蓝色通道使用GradCAM引导扰动。 Result: TAIGen在ImageNet数据集上使用VGGNet作为源模型,对ResNet的成功攻击率为70.6%,对MNASNet为80.8%,对ShuffleNet为97.8%。同时,TAIGen的生成速度比现有的基于扩散模型的攻击快10倍,并且在所有测试数据集上保持PSNR超过30 dB的视觉质量。 Conclusion: TAIGen是一种无需训练的黑盒方法,能够在保持图像质量的同时高效生成对抗样本,且在多个目标模型上表现出较高的攻击成功率。 Abstract: Adversarial attacks from generative models often produce low-quality images and require substantial computational resources. Diffusion models, though capable of high-quality generation, typically need hundreds of sampling steps for adversarial generation. This paper introduces TAIGen, a training-free black-box method for efficient adversarial image generation. TAIGen produces adversarial examples using only 3-20 sampling steps from unconditional diffusion models. Our key finding is that perturbations injected during the mixing step interval achieve comparable attack effectiveness without processing all timesteps. We develop a selective RGB channel strategy that applies attention maps to the red channel while using GradCAM-guided perturbations on green and blue channels. This design preserves image structure while maximizing misclassification in target models. TAIGen maintains visual quality with PSNR above 30 dB across all tested datasets. On ImageNet with VGGNet as source, TAIGen achieves 70.6% success against ResNet, 80.8% against MNASNet, and 97.8% against ShuffleNet. The method generates adversarial examples 10x faster than existing diffusion-based attacks. Our method achieves the lowest robust accuracy, indicating it is the most impactful attack as the defense mechanism is least successful in purifying the images generated by TAIGen.[60] Reversible Unfolding Network for Concealed Visual Perception with Generative Refinement
Chunming He,Fengyang Xiao,Rihan Zhang,Chengyu Fang,Deng-Ping Fan,Sina Farsiu
Main category: cs.CV
TL;DR: RUN++ 提出了一种结合可逆建模和扩散模型的多阶段网络结构,用于改进隐蔽视觉感知任务,实现了更准确的分割和更高的计算效率。
Details
Motivation: 现有CVP方法通常局限于掩码域的可逆策略,而忽略了RGB域的潜力。因此,需要一种新的方法来探索两个域的协同作用,并有效减少分割中的不确定性。 Method: RUN++ 将CVP任务建模为数学优化问题,并展开迭代求解过程,形成多阶段深度网络。结合了可逆建模和扩散模型,通过CORE、CARE和FINE模块分别处理掩码域和RGB域的问题,并引入伯努利扩散模型进行局部细化。 Result: RUN++ 能够在隐蔽视觉感知任务中有效降低假阳性和假阴性,显著提升在真实世界退化条件下的鲁棒性,并通过局部扩散模型减少计算成本。 Conclusion: RUN++ 提出了一个结合可逆建模和扩散模型的多层次优化框架,用于隐蔽视觉感知任务,表现出对模糊区域的高效处理能力,并引入了新的构建鲁棒CVP系统的研究范式。 Abstract: Existing methods for concealed visual perception (CVP) often leverage reversible strategies to decrease uncertainty, yet these are typically confined to the mask domain, leaving the potential of the RGB domain underexplored. To address this, we propose a reversible unfolding network with generative refinement, termed RUN++. Specifically, RUN++ first formulates the CVP task as a mathematical optimization problem and unfolds the iterative solution into a multi-stage deep network. This approach provides a principled way to apply reversible modeling across both mask and RGB domains while leveraging a diffusion model to resolve the resulting uncertainty. Each stage of the network integrates three purpose-driven modules: a Concealed Object Region Extraction (CORE) module applies reversible modeling to the mask domain to identify core object regions; a Context-Aware Region Enhancement (CARE) module extends this principle to the RGB domain to foster better foreground-background separation; and a Finetuning Iteration via Noise-based Enhancement (FINE) module provides a final refinement. The FINE module introduces a targeted Bernoulli diffusion model that refines only the uncertain regions of the segmentation mask, harnessing the generative power of diffusion for fine-detail restoration without the prohibitive computational cost of a full-image process. This unique synergy, where the unfolding network provides a strong uncertainty prior for the diffusion model, allows RUN++ to efficiently direct its focus toward ambiguous areas, significantly mitigating false positives and negatives. Furthermore, we introduce a new paradigm for building robust CVP systems that remain effective under real-world degradations and extend this concept into a broader bi-level optimization framework.[61] GasTwinFormer: A Hybrid Vision Transformer for Livestock Methane Emission Segmentation and Dietary Classification in Optical Gas Imaging
Toqi Tahamid Sarker,Mohamed Embaby,Taminul Islam,Amer AbuGhazaleh,Khaled R Ahmed
Main category: cs.CV
TL;DR: GasTwinFormer是一种高效的混合视觉变换器,能够实现实时的甲烷排放监测和饮食分类。
Details
Motivation: 畜牧业甲烷排放占人类活动产生的甲烷的32%,因此需要自动化监测以应对气候变化。 Method: 引入了GasTwinFormer,一种混合视觉变换器,采用Mix Twin编码器和LR-ASPP解码器进行多尺度特征聚合。 Result: 在分割任务中达到74.47% mIoU和83.63% mF1,饮食分类准确率达到100%。 Conclusion: GasTwinFormer作为一种混合视觉变换器,成功实现了甲烷排放的实时监测和饮食分类,为畜牧业排放监测提供了一个实用解决方案。 Abstract: Livestock methane emissions represent 32% of human-caused methane production, making automated monitoring critical for climate mitigation strategies. We introduce GasTwinFormer, a hybrid vision transformer for real-time methane emission segmentation and dietary classification in optical gas imaging through a novel Mix Twin encoder alternating between spatially-reduced global attention and locally-grouped attention mechanisms. Our architecture incorporates a lightweight LR-ASPP decoder for multi-scale feature aggregation and enables simultaneous methane segmentation and dietary classification in a unified framework. We contribute the first comprehensive beef cattle methane emission dataset using OGI, containing 11,694 annotated frames across three dietary treatments. GasTwinFormer achieves 74.47% mIoU and 83.63% mF1 for segmentation while maintaining exceptional efficiency with only 3.348M parameters, 3.428G FLOPs, and 114.9 FPS inference speed. Additionally, our method achieves perfect dietary classification accuracy (100%), demonstrating the effectiveness of leveraging diet-emission correlations. Extensive ablation studies validate each architectural component, establishing GasTwinFormer as a practical solution for real-time livestock emission monitoring. Please see our project page at gastwinformer.github.io.[62] CurveFlow: Curvature-Guided Flow Matching for Image Generation
Yan Luo,Drake Du,Hao Huang,Yi Fang,Mengyu Wang
Main category: cs.CV
TL;DR: CurveFlow是一种新的flow matching框架,通过引入曲率引导的非线性轨迹,提高文本到图像生成的语义一致性和图像质量。
Details
Motivation: 现有线性轨迹的rectified flow模型可能强制穿过数据流形的低概率区域,影响生成图像与文本的语义对齐。 Method: CurveFlow引入了一种曲率正则化技术,以学习平滑的非线性轨迹。 Result: 在MS COCO 2014和2017上的实验表明,CurveFlow在文本到图像生成方面表现优异,尤其在BLEU、METEOR、ROUGE和CLAIR等语义一致性指标上。 Conclusion: CurveFlow通过引入曲率引导的非线性轨迹,提高了文本到图像生成中的语义一致性和图像质量。 Abstract: Existing rectified flow models are based on linear trajectories between data and noise distributions. This linearity enforces zero curvature, which can inadvertently force the image generation process through low-probability regions of the data manifold. A key question remains underexplored: how does the curvature of these trajectories correlate with the semantic alignment between generated images and their corresponding captions, i.e., instructional compliance? To address this, we introduce CurveFlow, a novel flow matching framework designed to learn smooth, non-linear trajectories by directly incorporating curvature guidance into the flow path. Our method features a robust curvature regularization technique that penalizes abrupt changes in the trajectory's intrinsic dynamics.Extensive experiments on MS COCO 2014 and 2017 demonstrate that CurveFlow achieves state-of-the-art performance in text-to-image generation, significantly outperforming both standard rectified flow variants and other non-linear baselines like Rectified Diffusion. The improvements are especially evident in semantic consistency metrics such as BLEU, METEOR, ROUGE, and CLAIR. This confirms that our curvature-aware modeling substantially enhances the model's ability to faithfully follow complex instructions while simultaneously maintaining high image quality. The code is made publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/CurveFlow.[63] HiRQA: Hierarchical Ranking and Quality Alignment for Opinion-Unaware Image Quality Assessment
Vaishnav Ramesh,Haining Wang,Md Jahidul Islam
Main category: cs.CV
TL;DR: 本文提出了一种名为HiRQA的无参考图像质量评估方法,通过排序和对比学习框架,实现了在合成和真实数据上的先进性能,并且具备轻量级版本用于实时部署。
Details
Motivation: 尽管在无参考图像质量评估方面取得了显著进展,但数据集偏差和对主观标签的依赖仍然阻碍了其泛化性能。 Method: 提出了一种名为HiRQA的方法,结合了排序和对比学习,引入了高阶排序损失、嵌入距离损失和训练时对比对齐损失。 Result: HiRQA在合成和真实基准上的实验验证了其最先进的性能、强大的泛化能力和可扩展性。 Conclusion: HiRQA是一个自我监督的、意见无关的框架,通过合成失真训练,可以有效推广到真实退化,并且具备轻量级版本HiRQA-S以实现实时部署。 Abstract: Despite significant progress in no-reference image quality assessment (NR-IQA), dataset biases and reliance on subjective labels continue to hinder their generalization performance. We propose HiRQA, Hierarchical Ranking and Quality Alignment), a self-supervised, opinion-unaware framework that offers a hierarchical, quality-aware embedding through a combination of ranking and contrastive learning. Unlike prior approaches that depend on pristine references or auxiliary modalities at inference time, HiRQA predicts quality scores using only the input image. We introduce a novel higher-order ranking loss that supervises quality predictions through relational ordering across distortion pairs, along with an embedding distance loss that enforces consistency between feature distances and perceptual differences. A training-time contrastive alignment loss, guided by structured textual prompts, further enhances the learned representation. Trained only on synthetic distortions, HiRQA generalizes effectively to authentic degradations, as demonstrated through evaluation on various distortions such as lens flare, haze, motion blur, and low-light conditions. For real-time deployment, we introduce \textbf{HiRQA-S}, a lightweight variant with an inference time of only 3.5 ms per image. Extensive experiments across synthetic and authentic benchmarks validate HiRQA's state-of-the-art (SOTA) performance, strong generalization ability, and scalability.[64] Reliable Multi-view 3D Reconstruction for `Just-in-time' Edge Environments
Md. Nurul Absur,Abhinav Kumar,Swastik Brahma,Saptarshi Debroy
Main category: cs.CV
TL;DR: This paper proposes a novel strategy for reliable multi-view 3D reconstruction in dynamic edge environments by using a portfolio theory inspired approach and demonstrates its effectiveness through experiments.
Details
Motivation: The motivation stems from the need for reliable multi-view 3D reconstruction applications in critical scenarios like emergency response and public safety, where edge environments are necessary but prone to disruptions. Method: The methodology involves a portfolio theoretic optimization problem solved using a genetic algorithm, which is tested using publicly available and customized 3D datasets. Result: The proposed camera selection strategy demonstrates benefits in guaranteeing reliable 3D reconstruction against traditional baseline strategies under spatiotemporal disruptions. Conclusion: The paper concludes that their proposed portfolio theory inspired edge resource management strategy can guarantee reliable multi-view 3D reconstruction even when cameras are prone to spatiotemporally correlated disruptions. Abstract: Multi-view 3D reconstruction applications are revolutionizing critical use cases that require rapid situational-awareness, such as emergency response, tactical scenarios, and public safety. In many cases, their near-real-time latency requirements and ad-hoc needs for compute resources necessitate adoption of `Just-in-time' edge environments where the system is set up on the fly to support the applications during the mission lifetime. However, reliability issues can arise from the inherent dynamism and operational adversities of such edge environments, resulting in spatiotemporally correlated disruptions that impact the camera operations, which can lead to sustained degradation of reconstruction quality. In this paper, we propose a novel portfolio theory inspired edge resource management strategy for reliable multi-view 3D reconstruction against possible system disruptions. Our proposed methodology can guarantee reconstruction quality satisfaction even when the cameras are prone to spatiotemporally correlated disruptions. The portfolio theoretic optimization problem is solved using a genetic algorithm that converges quickly for realistic system settings. Using publicly available and customized 3D datasets, we demonstrate the proposed camera selection strategy's benefits in guaranteeing reliable 3D reconstruction against traditional baseline strategies, under spatiotemporal disruptions.[65] XDR-LVLM: An Explainable Vision-Language Large Model for Diabetic Retinopathy Diagnosis
Masato Ito,Kaito Tanaka,Keisuke Matsuda,Aya Nakayama
Main category: cs.CV
TL;DR: 本文提出了一种可解释的糖尿病视网膜病变诊断框架XDR-LVLM,结合视觉和语言模型提供高精度诊断及自然语言解释。
Details
Motivation: 深度学习模型在糖尿病视网膜病变检测中存在黑盒问题,缺乏透明度和可解释性,阻碍了其在临床中的应用。 Method: 提出了一种新的框架XDR-LVLM,结合了医学视觉编码器、LVLM核心以及多任务提示工程和多阶段微调技术。 Result: 在DDR数据集上的实验表明,XDR-LVLM在疾病诊断方面取得了84.55%的平衡准确率和79.92%的F1分数,并在概念检测中表现出色(77.95% BACC,66.88% F1)。 Conclusion: XDR-LVLM有效提高了糖尿病视网膜病变诊断的透明度和可解释性,为临床诊断提供了可靠且可解释的自动化工具。 Abstract: Diabetic Retinopathy (DR) is a major cause of global blindness, necessitating early and accurate diagnosis. While deep learning models have shown promise in DR detection, their black-box nature often hinders clinical adoption due to a lack of transparency and interpretability. To address this, we propose XDR-LVLM (eXplainable Diabetic Retinopathy Diagnosis with LVLM), a novel framework that leverages Vision-Language Large Models (LVLMs) for high-precision DR diagnosis coupled with natural language-based explanations. XDR-LVLM integrates a specialized Medical Vision Encoder, an LVLM Core, and employs Multi-task Prompt Engineering and Multi-stage Fine-tuning to deeply understand pathological features within fundus images and generate comprehensive diagnostic reports. These reports explicitly include DR severity grading, identification of key pathological concepts (e.g., hemorrhages, exudates, microaneurysms), and detailed explanations linking observed features to the diagnosis. Extensive experiments on the Diabetic Retinopathy (DDR) dataset demonstrate that XDR-LVLM achieves state-of-the-art performance, with a Balanced Accuracy of 84.55% and an F1 Score of 79.92% for disease diagnosis, and superior results for concept detection (77.95% BACC, 66.88% F1). Furthermore, human evaluations confirm the high fluency, accuracy, and clinical utility of the generated explanations, showcasing XDR-LVLM's ability to bridge the gap between automated diagnosis and clinical needs by providing robust and interpretable insights.[66] MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion
Xuyang Chen,Zhijun Zhai,Kaixuan Zhou,Zengmao Wang,Jianan He,Dong Wang,Yanfeng Zhang,mingwei Sun,Rüdiger Westermann,Konrad Schindler,Liqiu Meng
Main category: cs.CV
TL;DR: This paper introduces MeSS, a novel method for generating realistic 3D urban scenes from mesh models by improving cross-view consistency using enhanced diffusion models and 3D reconstruction techniques.
Details
Motivation: Mesh models lack realistic textures, limiting their application in virtual urban navigation and autonomous driving. Existing diffusion models struggle with either 3D scene generation or maintaining visual consistency along predefined camera paths. Method: MeSS enhances image diffusion models to improve cross-view consistency through a three-stage pipeline: generating sparse views with Cascaded Outpainting ControlNets, propagating denser intermediate views via AGInpaint, and eliminating visual inconsistencies with GCAlign. Concurrently, a 3D Gaussian Splatting scene is reconstructed. Result: High-quality, geometrically aligned, and style-consistent outdoor scenes generated from city mesh models, supporting diverse style rendering and improved 3D scene synthesis. Conclusion: The proposed MeSS method outperforms existing approaches in generating high-quality, style-consistent outdoor scenes using city mesh models and allows for diverse style rendering through relighting and style transfer techniques. Abstract: Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques.[67] Adversarial Agent Behavior Learning in Autonomous Driving Using Deep Reinforcement Learning
Arjun Srinivasan,Anubhav Paras,Aniket Bera
Main category: cs.CV
TL;DR: 本文提出了一种新的对抗性行为学习方法,用于发现和引发基于规则的智能体的失败场景,验证了其在降低智能体累积奖励上的有效性。
Details
Motivation: 在自动驾驶等安全关键应用中,正确建模基于规则的智能体的行为至关重要。当前使用了几种行为建模策略和IDM模型来建模周围智能体,但缺乏对潜在失败场景的研究。 Method: 提出了一种学习对抗行为的方法,并将其应用于基于规则的智能体,以引发失败场景。 Result: 对抗智能体的评估显示了对基于规则的智能体的累积奖励的显著降低。 Conclusion: 本文提出了一种基于学习的方法来推导基于规则的智能体的对抗行为,并通过评估证明了该方法在降低基于规则的智能体累积奖励方面的有效性。 Abstract: Existing approaches in reinforcement learning train an agent to learn desired optimal behavior in an environment with rule based surrounding agents. In safety critical applications such as autonomous driving it is crucial that the rule based agents are modelled properly. Several behavior modelling strategies and IDM models are used currently to model the surrounding agents. We present a learning based method to derive the adversarial behavior for the rule based agents to cause failure scenarios. We evaluate our adversarial agent against all the rule based agents and show the decrease in cumulative reward.[68] DyMorph-B2I: Dynamic and Morphology-Guided Binary-to-Instance Segmentation for Renal Pathology
Leiyue Zhao,Yuechen Yang,Yanfan Zhu,Haichun Yang,Yuankai Huo,Paul D. Simonson,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng
Main category: cs.CV
TL;DR: 本研究提出DyMorph-B2I,用于肾病理学中更精确的形态量化,显著优于传统方法。
Details
Motivation: 现有数据集和自动方法仅提供二值掩码,限制了肾病理学功能单位的精确形态量化。 Method: 整合watershed、骨架化和形态学操作,并通过自适应几何优化和超参数调整实现动态分割。 Result: DyMorph-B2I在实例分割上优于传统方法,能够更准确地分离粘连和异质结构。 Conclusion: DyMorph-B2I是一个有效的二值到实例分割流程,提升了肾病理学中形态计量的准确性。 Abstract: Accurate morphological quantification of renal pathology functional units relies on instance-level segmentation, yet most existing datasets and automated methods provide only binary (semantic) masks, limiting the precision of downstream analyses. Although classical post-processing techniques such as watershed, morphological operations, and skeletonization, are often used to separate semantic masks into instances, their individual effectiveness is constrained by the diverse morphologies and complex connectivity found in renal tissue. In this study, we present DyMorph-B2I, a dynamic, morphology-guided binary-to-instance segmentation pipeline tailored for renal pathology. Our approach integrates watershed, skeletonization, and morphological operations within a unified framework, complemented by adaptive geometric refinement and customizable hyperparameter tuning for each class of functional unit. Through systematic parameter optimization, DyMorph-B2I robustly separates adherent and heterogeneous structures present in binary masks. Experimental results demonstrate that our method outperforms individual classical approaches and na\"ive combinations, enabling superior instance separation and facilitating more accurate morphometric analysis in renal pathology workflows. The pipeline is publicly available at: https://github.com/ddrrnn123/DyMorph-B2I.[69] STAGNet: A Spatio-Temporal Graph and LSTM Framework for Accident Anticipation
Vipooshan Vipulananthan,Kumudu Mohottala,Kavindu Chinthana,Nimsara Paramulla,Charith D Chitraranjan
Main category: cs.CV
TL;DR: 本研究开发了一种名为STAGNet的模型,用于从行车记录仪视频中预测事故,该模型表现出优于现有方法的性能。
Details
Motivation: 寻找一种更经济、更易部署的解决方案来预测事故,以提高道路安全性。 Method: 使用STAGNet模型,结合了更好的时空特征,并通过循环网络进行聚合,以提高从行车记录仪视频中预测事故的性能。 Result: 实验结果表明,STAGNet模型在三个公开数据集上的平均精度和平均碰撞时间值均高于现有方法。 Conclusion: STAGNet模型在预测事故方面优于现有方法,并且在不同的数据集上都表现出色。 Abstract: Accident prediction and timely warnings play a key role in improving road safety by reducing the risk of injury to road users and minimizing property damage. Advanced Driver Assistance Systems (ADAS) are designed to support human drivers and are especially useful when they can anticipate potential accidents before they happen. While many existing systems depend on a range of sensors such as LiDAR, radar, and GPS, relying solely on dash-cam video input presents a more challenging but a more cost-effective and easily deployable solution. In this work, we incorporate better spatio-temporal features and aggregate them through a recurrent network to improve upon state-of-the-art graph neural networks for predicting accidents from dash-cam videos. Experiments using three publicly available datasets show that our proposed STAGNet model achieves higher average precision and mean time-to-collision values than previous methods, both when cross-validated on a given dataset and when trained and tested on different datasets.[70] Collaborative Multi-Modal Coding for High-Quality 3D Generation
Ziang Cao,Zhaoxi Chen,Liang Pan,Ziwei Liu
Main category: cs.CV
TL;DR: TriMM is a novel 3D-native generative model that leverages multi-modal data (RGB, RGBD, point clouds) through collaborative coding and auxiliary supervision to produce high-quality 3D assets, performing well even with small datasets.
Details
Motivation: To overcome the limitations of existing 3D generative models that focus on single-modality paradigms or 3D structures, limiting the use of available multi-modal data. Method: TriMM uses collaborative multi-modal coding, auxiliary 2D and 3D supervision, and a triplane latent diffusion model to generate detailed 3D assets. Result: TriMM demonstrates competitive performance with models trained on large datasets, generates high-quality 3D assets with enhanced texture and geometry, and shows feasibility in using other multi-modal datasets. Conclusion: TriMM, a 3D-native generative model, successfully integrates multi-modal data like RGB, RGBD, and point clouds to generate high-quality 3D assets, showing effectiveness even with limited training data. Abstract: 3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.[71] Center-Oriented Prototype Contrastive Clustering
Shihao Dong,Xiaotong Zhou,Yuhui Zheng,Huiying Xu,Xinzhong Zhu
Main category: cs.CV
TL;DR: This paper proposes a novel contrastive clustering framework that improves prototype calculation and consistency learning, achieving better performance than existing methods.
Details
Motivation: Existing contrastive clustering methods face inter-class conflicts and inaccuracies in hard prototype calculation, which this paper aims to address. Method: The method introduces a soft prototype contrastive module that calculates prototype weights based on sample-to-cluster probability and a dual consistency learning module that aligns transformations and neighborhoods to enhance feature invariance and intra-cluster compactness. Result: Extensive experiments on five datasets demonstrate the effectiveness of the proposed method compared to state-of-the-art approaches. Conclusion: The proposed center-oriented prototype contrastive clustering framework, including a soft prototype contrastive module and a dual consistency learning module, effectively addresses inter-class conflicts and prototype deviation in existing clustering methods. Abstract: Contrastive learning is widely used in clustering tasks due to its discriminative representation. However, the conflict problem between classes is difficult to solve effectively. Existing methods try to solve this problem through prototype contrast, but there is a deviation between the calculation of hard prototypes and the true cluster center. To address this problem, we propose a center-oriented prototype contrastive clustering framework, which consists of a soft prototype contrastive module and a dual consistency learning module. In short, the soft prototype contrastive module uses the probability that the sample belongs to the cluster center as a weight to calculate the prototype of each category, while avoiding inter-class conflicts and reducing prototype drift. The dual consistency learning module aligns different transformations of the same sample and the neighborhoods of different samples respectively, ensuring that the features have transformation-invariant semantic information and compact intra-cluster distribution, while providing reliable guarantees for the calculation of prototypes. Extensive experiments on five datasets show that the proposed method is effective compared to the SOTA. Our code is published on https://github.com/LouisDong95/CPCC.[72] AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation
Ruipu Wu,Yige Zhang,Jinyu Chen,Linjiang Huang,Shifeng Zhang,Xu Zhou,Liang Wang,Si Liu
Main category: cs.CV
TL;DR: This paper introduces DuAl-VLN, a new UAV navigation task where two UAVs work together—one for broad environmental understanding and the other for precise navigation—supported by a new dataset and framework for improved performance.
Details
Motivation: The motivation stems from the challenges in existing UAV-VLN tasks, such as extended trajectories, complex UAV maneuverability, and the need for overly detailed instructions. The study aims to leverage UAVs' high mobility to provide multi-grained perspectives while maintaining manageable learning complexity. Method: The researchers introduced the DuAl-VLN task and developed the HaL-13k dataset with 13,838 trajectories. They also proposed the AeroDuo framework, where a high-altitude UAV uses a multimodal large language model for reasoning, and a low-altitude UAV uses a lightweight policy for navigation. Result: The result includes the creation of the HaL-13k dataset and the development of the AeroDuo framework, which enables efficient UAV collaboration with minimal information exchange. Conclusion: The study concludes that through the collaboration of two UAVs at different altitudes, the DuAl-VLN task achieves efficient and precise navigation using both broad environmental reasoning and detailed target navigation. Abstract: Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and complex maneuverability of UAVs, achieving reliable UAV-VLN performance is challenging and often requires human intervention or overly detailed instructions. To harness the advantages of UAVs' high mobility, which could provide multi-grained perspectives, while maintaining a manageable motion space for learning, we introduce a novel task called Dual-Altitude UAV Collaborative VLN (DuAl-VLN). In this task, two UAVs operate at distinct altitudes: a high-altitude UAV responsible for broad environmental reasoning, and a low-altitude UAV tasked with precise navigation. To support the training and evaluation of the DuAl-VLN, we construct the HaL-13k, a dataset comprising 13,838 collaborative high-low UAV demonstration trajectories, each paired with target-oriented language instructions. This dataset includes both unseen maps and an unseen object validation set to systematically evaluate the model's generalization capabilities across novel environments and unfamiliar targets. To consolidate their complementary strengths, we propose a dual-UAV collaborative VLN framework, AeroDuo, where the high-altitude UAV integrates a multimodal large language model (Pilot-LLM) for target reasoning, while the low-altitude UAV employs a lightweight multi-stage policy for navigation and target grounding. The two UAVs work collaboratively and only exchange minimal coordinate information to ensure efficiency.[73] Pretrained Diffusion Models Are Inherently Skipped-Step Samplers
Wenju Xu
Main category: cs.CV
TL;DR: The paper introduces skipped-step sampling, a mechanism that bypasses intermediate denoising steps in diffusion models, achieving accelerated sampling while maintaining quality.
Details
Motivation: To understand whether the original diffusion process can achieve the same efficiency as non-Markovian processes like DDIM without resorting to them. Method: skipped-step sampling mechanism that bypasses multiple intermediate denoising steps in the iterative generation process. Result: Experiments show that the proposed method achieves high-quality generation with significantly reduced sampling steps on popular pretrained diffusion models. Conclusion: skipped-step sampling is an intrinsic property of pretrained diffusion models and can be integrated with DDIM for enhanced generation. Abstract: Diffusion models have been achieving state-of-the-art results across various generation tasks. However, a notable drawback is their sequential generation process, requiring long-sequence step-by-step generation. Existing methods, such as DDIM, attempt to reduce sampling steps by constructing a class of non-Markovian diffusion processes that maintain the same training objective. However, there remains a gap in understanding whether the original diffusion process can achieve the same efficiency without resorting to non-Markovian processes. In this paper, we provide a confirmative answer and introduce skipped-step sampling, a mechanism that bypasses multiple intermediate denoising steps in the iterative generation process, in contrast with the traditional step-by-step refinement of standard diffusion inference. Crucially, we demonstrate that this skipped-step sampling mechanism is derived from the same training objective as the standard diffusion model, indicating that accelerated sampling via skipped-step sampling via a Markovian way is an intrinsic property of pretrained diffusion models. Additionally, we propose an enhanced generation method by integrating our accelerated sampling technique with DDIM. Extensive experiments on popular pretrained diffusion models, including the OpenAI ADM, Stable Diffusion, and Open Sora models, show that our method achieves high-quality generation with significantly reduced sampling steps.[74] Comp-X: On Defining an Interactive Learned Image Compression Paradigm With Expert-driven LLM Agent
Yixin Gao,Xin Li,Xiaohan Pan,Runsen Feng,Bingchen Li,Yunpeng Qi,Yiting Lu,Zhengxue Cheng,Zhibo Chen,Jörn Ostermann
Main category: cs.CV
TL;DR: Comp-X is an innovatively interactive image compression paradigm that uses an LLM-powered agent to unify multiple coding modes, enabling efficient and user-friendly compression while maintaining performance.
Details
Motivation: Commonly used image codecs suffer from limited coding modes and rely on manual mode selection, making them unfriendly for unprofessional users. This limitation motivates the development of Comp-X, an intelligently interactive image compression paradigm. Method: The Comp-X paradigm incorporates a multi-functional coding framework, an interactive coding agent powered by an LLM, and a dedicated benchmark called IIC-bench for evaluation. The interactive agent utilizes augmented in-context learning with expert feedback to understand and respond to coding requests. Result: Extensive experimental results demonstrate that Comp-X can efficiently understand coding requests and achieve impressive textual interaction capability while maintaining comparable compression performance. Conclusion: Comp-X provides a promising avenue for AGI in image compression by maintaining comparable compression performance with a single coding framework while efficiently understanding coding requests and offering impressive textual interaction capability. Abstract: We present Comp-X, the first intelligently interactive image compression paradigm empowered by the impressive reasoning capability of large language model (LLM) agent. Notably, commonly used image codecs usually suffer from limited coding modes and rely on manual mode selection by engineers, making them unfriendly for unprofessional users. To overcome this, we advance the evolution of image coding paradigm by introducing three key innovations: (i) multi-functional coding framework, which unifies different coding modes of various objective/requirements, including human-machine perception, variable coding, and spatial bit allocation, into one framework. (ii) interactive coding agent, where we propose an augmented in-context learning method with coding expert feedback to teach the LLM agent how to understand the coding request, mode selection, and the use of the coding tools. (iii) IIC-bench, the first dedicated benchmark comprising diverse user requests and the corresponding annotations from coding experts, which is systematically designed for intelligently interactive image compression evaluation. Extensive experimental results demonstrate that our proposed Comp-X can understand the coding requests efficiently and achieve impressive textual interaction capability. Meanwhile, it can maintain comparable compression performance even with a single coding framework, providing a promising avenue for artificial general intelligence (AGI) in image compression.[75] Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images
Jinsol Song,Jiamu Wang,Anh Tien Nguyen,Keunho Byeon,Sangjeong Ahn,Sung Hak Lee,Jin Tae Kwak
Main category: cs.CV
TL;DR: Ano-NAViLa是一种结合正常和异常病理知识的轻量级视觉-语言模型,用于病理图像中的异常检测,具有高准确性和可解释性。
Details
Motivation: 现有的主要用于工业设置的异常检测方法在病理学中面临计算限制、多样化组织结构和缺乏可解释性等挑战,因此需要一种新的方法来提升病理图像中异常检测的性能。 Method: Ano-NAViLa基于预训练的视觉-语言模型,采用轻量级可训练MLP,结合正常和异常病理知识,增强模型的准确性和鲁棒性。 Result: 在来自不同器官的两个淋巴结数据集上评估,Ano-NAViLa在异常检测和定位方面表现出优于现有模型的状态-of-the-art性能。 Conclusion: Ano-NAViLa通过结合正常和异常病理知识,在病理图像的异常检测和定位方面实现了最先进的性能,同时提高了准确性和鲁棒性,并通过图像-文本关联提供了可解释性。 Abstract: Anomaly detection in computational pathology aims to identify rare and scarce anomalies where disease-related data are often limited or missing. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a Normal and Abnormal pathology knowledge-augmented Vision-Language model for Anomaly detection in pathology images. Ano-NAViLa is built on a pre-trained vision-language model with a lightweight trainable MLP. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.[76] RATopo: Improving Lane Topology Reasoning via Redundancy Assignment
Han Li,Shaofei Huang,Longfei Xu,Yulu Gao,Beipeng Mu,Si Liu
Main category: cs.CV
TL;DR: 本文提出RATopo策略,通过重新设计Transformer解码器结构并引入冗余分配机制,有效提升自动驾驶中车道拓扑推理性能。
Details
Motivation: 现有的先检测后推理方法由于检测阶段一对一匹配的限制,导致拓扑推理性能次优。 Method: 重新设计Transformer解码器结构,通过交换交叉注意和自注意层来保留冗余的车道预测,并使用具有独立参数的多个并行交叉注意块以增加检测车道的多样性。 Result: 在OpenLane-V2上的大量实验表明,提出的RATopo策略能够实现丰富的数量和几何多样性拓扑监督。 Conclusion: RATopo策略是模型无关的,可以无缝集成到现有的拓扑推理框架中,持续提升车道间和车道交通拓扑性能。 Abstract: Lane topology reasoning plays a critical role in autonomous driving by modeling the connections among lanes and the topological relationships between lanes and traffic elements. Most existing methods adopt a first-detect-then-reason paradigm, where topological relationships are supervised based on the one-to-one assignment results obtained during the detection stage. This supervision strategy results in suboptimal topology reasoning performance due to the limited range of valid supervision. In this paper, we propose RATopo, a Redundancy Assignment strategy for lane Topology reasoning that enables quantity-rich and geometry-diverse topology supervision. Specifically, we restructure the Transformer decoder by swapping the cross-attention and self-attention layers. This allows redundant lane predictions to be retained before suppression, enabling effective one-to-many assignment. We also instantiate multiple parallel cross-attention blocks with independent parameters, which further enhances the diversity of detected lanes. Extensive experiments on OpenLane-V2 demonstrate that our RATopo strategy is model-agnostic and can be seamlessly integrated into existing topology reasoning frameworks, consistently improving both lane-lane and lane-traffic topology performance.[77] DesignCLIP: Multimodal Learning with CLIP for Design Patent Understanding
Zhu Wang,Homaira Huda Shomee,Sathya N. Ravi,Sourav Medya
Main category: cs.CV
TL;DR: 本文介绍了DesignCLIP,一个利用视觉-语言模型进行设计专利分析的统一框架,通过类感知分类和对比学习提高了专利分类和检索任务的性能。
Details
Motivation: 传统的设计专利分析任务依赖于图像数据,但专利图像往往无法传达全面的视觉上下文和语义信息,导致评估中的歧义。视觉-语言模型的进展为更可靠的AI驱动专利分析提供了机会。 Method: 利用CLIP模型,结合类感知分类和对比学习,使用生成的详细标题和多视角图像学习,开发了DesignCLIP框架,并在大规模美国设计专利数据集上进行验证。 Result: DesignCLIP在所有任务中均优于基线和最先进的模型,包括专利分类和专利检索,并探索了多模态专利检索,为设计中的创造力和创新提供了更多灵感来源。 Conclusion: DesignCLIP通过结合视觉-语言模型和专利数据的特点,为设计专利分析提供了一个统一的框架,并验证了其在多项下游任务中的有效性,突出了多模态方法在专利分析中的潜力。 Abstract: In the field of design patent analysis, traditional tasks such as patent classification and patent image retrieval heavily depend on the image data. However, patent images -- typically consisting of sketches with abstract and structural elements of an invention -- often fall short in conveying comprehensive visual context and semantic information. This inadequacy can lead to ambiguities in evaluation during prior art searches. Recent advancements in vision-language models, such as CLIP, offer promising opportunities for more reliable and accurate AI-driven patent analysis. In this work, we leverage CLIP models to develop a unified framework DesignCLIP for design patent applications with a large-scale dataset of U.S. design patents. To address the unique characteristics of patent data, DesignCLIP incorporates class-aware classification and contrastive learning, utilizing generated detailed captions for patent images and multi-views image learning. We validate the effectiveness of DesignCLIP across various downstream tasks, including patent classification and patent retrieval. Additionally, we explore multimodal patent retrieval, which provides the potential to enhance creativity and innovation in design by offering more diverse sources of inspiration. Our experiments show that DesignCLIP consistently outperforms baseline and SOTA models in the patent domain on all tasks. Our findings underscore the promise of multimodal approaches in advancing patent analysis. The codebase is available here: https://anonymous.4open.science/r/PATENTCLIP-4661/README.md.[78] TPA: Temporal Prompt Alignment for Fetal Congenital Heart Defect Classification
Darya Taratynova,Alya Almsouti,Beknur Kalmakhanbet,Numan Saeed,Mohammad Yaqub
Main category: cs.CV
TL;DR: Temporal Prompt Alignment (TPA) improves fetal heart defect detection in ultrasound videos by incorporating temporal modeling, contrastive learning with text prompts, and uncertainty calibration, achieving state-of-the-art results.
Details
Motivation: Automated detection of congenital heart defects in ultrasound videos is challenging due to image noise, probe variability, and the neglect of temporal information in existing methods. There is a need for better calibration and multi-class classification in clinical settings. Method: TPA uses an image encoder to extract features from ultrasound video frames, employs a temporal extractor to capture heart motion, aligns video representations with text prompts using contrastive learning, and enhances calibration through a CVAESM module that quantifies uncertainty. Result: TPA achieved an 85.40% macro F1 score for CHD diagnosis, reduced expected calibration error by 5.38%, and improved performance on EchoNet-Dynamic by increasing macro F1 from 53.89% to 58.62%. It also reduced adaptive ECE by 6.8%. Conclusion: Temporal Prompt Alignment (TPA) is an effective framework for fetal congenital heart defect (CHD) classification that integrates temporal modeling, prompt-aware contrastive learning, and uncertainty quantification, achieving state-of-the-art performance and improved calibration. Abstract: Congenital heart defect (CHD) detection in ultrasound videos is hindered by image noise and probe positioning variability. While automated methods can reduce operator dependence, current machine learning approaches often neglect temporal information, limit themselves to binary classification, and do not account for prediction calibration. We propose Temporal Prompt Alignment (TPA), a method leveraging foundation image-text model and prompt-aware contrastive learning to classify fetal CHD on cardiac ultrasound videos. TPA extracts features from each frame of video subclips using an image encoder, aggregates them with a trainable temporal extractor to capture heart motion, and aligns the video representation with class-specific text prompts via a margin-hinge contrastive loss. To enhance calibration for clinical reliability, we introduce a Conditional Variational Autoencoder Style Modulation (CVAESM) module, which learns a latent style vector to modulate embeddings and quantifies classification uncertainty. Evaluated on a private dataset for CHD detection and on a large public dataset, EchoNet-Dynamic, for systolic dysfunction, TPA achieves state-of-the-art macro F1 scores of 85.40% for CHD diagnosis, while also reducing expected calibration error by 5.38% and adaptive ECE by 6.8%. On EchoNet-Dynamic's three-class task, it boosts macro F1 by 4.73% (from 53.89% to 58.62%). Temporal Prompt Alignment (TPA) is a framework for fetal congenital heart defect (CHD) classification in ultrasound videos that integrates temporal modeling, prompt-aware contrastive learning, and uncertainty quantification.[79] BasketLiDAR: The First LiDAR-Camera Multimodal Dataset for Professional Basketball MOT
Ryunosuke Hayashi,Kohei Torimi,Rokuto Nagata,Kazuma Ikeda,Ozora Sako,Taichi Nakamura,Masaki Tani,Yoshimitsu Aoki,Kentaro Yoshioka
Main category: cs.CV
TL;DR: This paper introduces BasketLiDAR, a new multi-modal dataset combining LiDAR and camera data, and a novel MOT framework for real-time, high-accuracy basketball player tracking even under occlusion.
Details
Motivation: Traditional multi-camera systems for real-time 3D player tracking are limited by their 2D nature and complex 3D reconstruction, prompting the need for a more efficient solution in challenging sports scenarios like basketball with frequent occlusions. Method: The paper proposes a novel multi-object tracking (MOT) framework with two pipelines: one based solely on LiDAR and another that fuses LiDAR with camera data, leveraging LiDAR's high-precision 3D spatial information. Result: The experimental results show that the method enables real-time operation and achieves superior tracking performance under occlusion conditions, which was difficult with conventional camera-only methods. Conclusion: This paper concludes that the proposed MOT algorithm using LiDAR and camera data fusion enhances tracking accuracy and efficiency in complex basketball scenarios, with the BasketLiDAR dataset being made available for further research. Abstract: Real-time 3D trajectory player tracking in sports plays a crucial role in tactical analysis, performance evaluation, and enhancing spectator experience. Traditional systems rely on multi-camera setups, but are constrained by the inherently two-dimensional nature of video data and the need for complex 3D reconstruction processing, making real-time analysis challenging. Basketball, in particular, represents one of the most difficult scenarios in the MOT field, as ten players move rapidly and complexly within a confined court space, with frequent occlusions caused by intense physical contact. To address these challenges, this paper constructs BasketLiDAR, the first multimodal dataset in the sports MOT field that combines LiDAR point clouds with synchronized multi-view camera footage in a professional basketball environment, and proposes a novel MOT framework that simultaneously achieves improved tracking accuracy and reduced computational cost. The BasketLiDAR dataset contains a total of 4,445 frames and 3,105 player IDs, with fully synchronized IDs between three LiDAR sensors and three multi-view cameras. We recorded 5-on-5 and 3-on-3 game data from actual professional basketball players, providing complete 3D positional information and ID annotations for each player. Based on this dataset, we developed a novel MOT algorithm that leverages LiDAR's high-precision 3D spatial information. The proposed method consists of a real-time tracking pipeline using LiDAR alone and a multimodal tracking pipeline that fuses LiDAR and camera data. Experimental results demonstrate that our approach achieves real-time operation, which was difficult with conventional camera-only methods, while achieving superior tracking performance even under occlusion conditions. The dataset is available upon request at: https://sites.google.com/keio.jp/keio-csg/projects/basket-lidar[80] First RAG, Second SEG: A Training-Free Paradigm for Camouflaged Object Detection
Wutao Liu,YiDan Wang,Pan Gao
Main category: cs.CV
TL;DR: RAG-SEG是一种无需训练的伪装物体检测方法,利用检索增强生成和SAM分割,实现了高效和实用的性能。
Details
Motivation: 解决伪装物体检测中对象与背景高度相似的问题,同时减少对大量计算资源和手动高质量提示的依赖。 Method: 提出RAG-SEG方法,分为检索增强生成(RAG)和基于SAM的分割(SEG)两个阶段。 Result: 在基准COD数据集上的大量实验证明,RAG-SEG表现与最先进方法相当或超越。 Conclusion: RAG-SEG方法无需传统训练,同时保持了竞争性能,并在个人笔记本上进行实验,突显了其计算效率和实用性。 Abstract: Camouflaged object detection (COD) poses a significant challenge in computer vision due to the high similarity between objects and their backgrounds. Existing approaches often rely on heavy training and large computational resources. While foundation models such as the Segment Anything Model (SAM) offer strong generalization, they still struggle to handle COD tasks without fine-tuning and require high-quality prompts to yield good performance. However, generating such prompts manually is costly and inefficient. To address these challenges, we propose \textbf{First RAG, Second SEG (RAG-SEG)}, a training-free paradigm that decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks as prompts, followed by SAM-based segmentation (SEG) for refinement. RAG-SEG constructs a compact retrieval database via unsupervised clustering, enabling fast and effective feature retrieval. During inference, the retrieved features produce pseudo-labels that guide precise mask generation using SAM2. Our method eliminates the need for conventional training while maintaining competitive performance. Extensive experiments on benchmark COD datasets demonstrate that RAG-SEG performs on par with or surpasses state-of-the-art methods. Notably, all experiments are conducted on a \textbf{personal laptop}, highlighting the computational efficiency and practicality of our approach. We present further analysis in the Appendix, covering limitations, salient object detection extension, and possible improvements.[81] VideoEraser: Concept Erasure in Text-to-Video Diffusion Models
Naen Xu,Jinghuai Zhang,Changjiang Li,Zhi Chen,Chunyi Zhou,Qingming Li,Tianyu Du,Shouling Ji
Main category: cs.CV
TL;DR: VideoEraser是一个无需训练的框架,用于防止文本到视频扩散模型生成不希望的内容,通过选择性调整提示嵌入和对抗弹性噪声引导两个阶段实现。
Details
Motivation: 由于文本到视频扩散模型可能被滥用以生成有害或误导性内容,从而引发隐私、版权和安全问题,因此需要一种有效的方法来防止这种内容的生成。 Method: 设计了一个名为VideoEraser的训练无关框架,该框架通过Selective Prompt Embedding Adjustment (SPEA) 和 Adversarial-Resilient Noise Guidance (ARNG) 两个阶段来防止生成不希望的内容。 Result: 在四个任务上的实验结果显示,VideoEraser在效果、完整性、保真度、鲁棒性和泛化性方面始终优于先前的方法,平均减少了46%的不希望内容生成。 Conclusion: VideoEraser是一个有效的解决方案,可以防止文本到视频扩散模型生成不希望的内容,且无需对模型进行再训练。 Abstract: The rapid growth of text-to-video (T2V) diffusion models has raised concerns about privacy, copyright, and safety due to their potential misuse in generating harmful or misleading content. These models are often trained on numerous datasets, including unauthorized personal identities, artistic creations, and harmful materials, which can lead to uncontrolled production and distribution of such content. To address this, we propose VideoEraser, a training-free framework that prevents T2V diffusion models from generating videos with undesirable concepts, even when explicitly prompted with those concepts. Designed as a plug-and-play module, VideoEraser can seamlessly integrate with representative T2V diffusion models via a two-stage process: Selective Prompt Embedding Adjustment (SPEA) and Adversarial-Resilient Noise Guidance (ARNG). We conduct extensive evaluations across four tasks, including object erasure, artistic style erasure, celebrity erasure, and explicit content erasure. Experimental results show that VideoEraser consistently outperforms prior methods regarding efficacy, integrity, fidelity, robustness, and generalizability. Notably, VideoEraser achieves state-of-the-art performance in suppressing undesirable content during T2V generation, reducing it by 46% on average across four tasks compared to baselines.[82] Predicting Road Crossing Behaviour using Pose Detection and Sequence Modelling
Subhasis Dasgupta,Preetam Saha,Agniva Roy,Jaydip Sen
Main category: cs.CV
TL;DR: This paper explores deep learning models like GRU, LSTM, and 1D CNN to predict pedestrian road-crossing intentions using video and pose analysis.
Details
Motivation: Autonomous vehicles need to predict pedestrian intent to cross roads for safer navigation. Method: Deep learning models for pose prediction and sequence modeling were analyzed, including GRU, LSTM, and 1D CNN. Result: The study found GRU to be more effective than LSTM for intent prediction, while 1D CNN offered the highest speed. Conclusion: GRU outperforms LSTM, and 1D CNN is the fastest model for predicting pedestrian road-crossing intent. Abstract: The world is constantly moving towards AI based systems and autonomous vehicles are now reality in different parts of the world. These vehicles require sensors and cameras to detect objects and maneuver according to that. It becomes important to for such vehicles to also predict from a distant if a person is about to cross a road or not. The current study focused on predicting the intent of crossing the road by pedestrians in an experimental setup. The study involved working with deep learning models to predict poses and sequence modelling for temporal predictions. The study analysed three different sequence modelling to understand the prediction behaviour and it was found out that GRU was better in predicting the intent compared to LSTM model but 1D CNN was the best model in terms of speed. The study involved video analysis, and the output of pose detection model was integrated later on to sequence modelling techniques for an end-to-end deep learning framework for predicting road crossing intents.[83] RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features
Olga Matykina,Dmitry Yudin
Main category: cs.CV
TL;DR: 本文提出了一种名为RCDINO的多模态三维目标检测模型,通过融合视觉特征和预训练模型的语义表示,在自动驾驶和机器人领域实现了卓越性能。
Details
Motivation: 三维目标检测对于自动驾驶和机器人技术至关重要,需要有效融合来自相机和雷达的多模态数据。 Method: RCDINO是一种基于多模态变压器的模型,通过将视觉主干特征与预训练DINOv2基础模型的语义丰富表示融合,增强了视觉表示。 Result: 在nuScenes数据集上的实验表明,RCDINO在雷达-相机模型中达到了最先进水平,NDS为56.4,mAP为48.1。 Conclusion: RCDINO实现了雷达-相机模型中的最先进性能,证明了多模态数据融合在三维目标检测中的有效性。 Abstract: Three-dimensional object detection is essential for autonomous driving and robotics, relying on effective fusion of multimodal data from cameras and radar. This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features by fusing them with semantically rich representations from the pretrained DINOv2 foundation model. This approach enriches visual representations and improves the model's detection performance while preserving compatibility with the baseline architecture. Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models, with 56.4 NDS and 48.1 mAP. Our implementation is available at https://github.com/OlgaMatykina/RCDINO.[84] An Empirical Study on How Video-LLMs Answer Video Questions
Chenhui Gou,Ziyu Ma,Zicheng Duan,Haoyu He,Feng Chen,Akide Liu,Bohan Zhuang,Jianfei Cai,Hamid Rezatofighi
Main category: cs.CV
TL;DR: 本文通过注意力敲除方法对Video-LLMs进行系统分析,揭示了其内部处理视频内容的机制,并提出了提高计算效率的方法。
Details
Motivation: 尽管Video-LLMs在回答视频问题方面表现出强大的能力,但大多数现有研究集中在提升性能上,对它们内部机制的理解有限,本文旨在通过系统的实证研究填补这一空白。 Method: 采用注意力敲除作为主要分析工具,设计了三种变体:视频时间敲除、视频空间敲除和语言到视频敲除,并在不同层数上应用这三种敲除方法。 Result: 研究揭示了三个关键发现:(1) 全局设置表明视频信息提取主要发生在早期层,形成一个清晰的两阶段过程;(2) 在细粒度设置中,某些中间层对视频问答有显著影响;(3) 在两种设置中,时空建模更多依赖语言引导的检索。 Conclusion: 本文是首个系统揭示Video-LLMs如何内部处理和理解视频内容的研究,为未来研究提供了可解释性和效率视角。 Abstract: Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process -- lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter's high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.[85] Transfer learning optimization based on evolutionary selective fine tuning
Jacinto Colan,Ana Davila,Yasuhisa Hasegawa
Main category: cs.CV
TL;DR: BioTune 是一种进化自适应微调技术,通过选择性地微调模型中的特定层,提高迁移学习的效率,减少可训练参数的数量,从而降低计算成本并促进在不同数据特征和分布下的高效迁移学习。
Details
Motivation: 大型全训练模型的计算需求仍然是一项挑战,传统微调方法可能会导致过拟合和较高的计算成本,因此需要一种更高效的迁移学习策略。 Method: BioTune 使用进化算法识别并专注于一组集中的层进行微调,以优化模型在给定目标任务上的性能。 Result: 在九个来自不同领域的图像分类数据集上的评估表明,与现有的微调方法(如 AutoRGN 和 LoRA)相比,BioTune 实现了具有竞争力或更优的准确性和效率。 Conclusion: BioTune 通过集中微调相关层,减少了可训练参数数量,提高了迁移学习的效率,为不同数据特征和分布下的高效迁移学习提供了有效解决方案。 Abstract: Deep learning has shown substantial progress in image analysis. However, the computational demands of large, fully trained models remain a consideration. Transfer learning offers a strategy for adapting pre-trained models to new tasks. Traditional fine-tuning often involves updating all model parameters, which can potentially lead to overfitting and higher computational costs. This paper introduces BioTune, an evolutionary adaptive fine-tuning technique that selectively fine-tunes layers to enhance transfer learning efficiency. BioTune employs an evolutionary algorithm to identify a focused set of layers for fine-tuning, aiming to optimize model performance on a given target task. Evaluation across nine image classification datasets from various domains indicates that BioTune achieves competitive or improved accuracy and efficiency compared to existing fine-tuning methods such as AutoRGN and LoRA. By concentrating the fine-tuning process on a subset of relevant layers, BioTune reduces the number of trainable parameters, potentially leading to decreased computational cost and facilitating more efficient transfer learning across diverse data characteristics and distributions.[86] Image-Conditioned 3D Gaussian Splat Quantization
Xinshuang Liu,Runfa Blark Li,Keito Suzuki,Truong Nguyen
Main category: cs.CV
TL;DR: 本文提出了一种高效的3D高斯点压缩方法(ICGS-Quantizer),解决了现有方法在存储限制和场景变化适应性方面的不足。
Details
Motivation: 现有的3DGS压缩方法存在两个局限性:一是仅能将中等规模场景压缩到兆字节范围,难以应用于大规模场景或场景集合;二是缺乏支持长期存档后场景变化的机制。 Method: 利用跨高斯和跨属性相关性以及所有训练场景中的共享码本,提出了一种图像条件高斯点量化器(ICGS-Quantizer)来提高量化效率。 Result: ICGS-Quantizer在压缩效率和对场景变化的适应性方面始终优于现有最先进方法。 Conclusion: ICGS-Quantizer有效减少了3D高斯点的存储需求,并通过解码时的图像条件实现了对场景变化的适应性。 Abstract: 3D Gaussian Splatting (3DGS) has attracted considerable attention for enabling high-quality real-time rendering. Although 3DGS compression methods have been proposed for deployment on storage-constrained devices, two limitations hinder archival use: (1) they compress medium-scale scenes only to the megabyte range, which remains impractical for large-scale scenes or extensive scene collections; and (2) they lack mechanisms to accommodate scene changes after long-term archival. To address these limitations, we propose an Image-Conditioned Gaussian Splat Quantizer (ICGS-Quantizer) that substantially enhances compression efficiency and provides adaptability to scene changes after archiving. ICGS-Quantizer improves quantization efficiency by jointly exploiting inter-Gaussian and inter-attribute correlations and by using shared codebooks across all training scenes, which are then fixed and applied to previously unseen test scenes, eliminating the overhead of per-scene codebooks. This approach effectively reduces the storage requirements for 3DGS to the kilobyte range while preserving visual fidelity. To enable adaptability to post-archival scene changes, ICGS-Quantizer conditions scene decoding on images captured at decoding time. The encoding, quantization, and decoding processes are trained jointly, ensuring that the codes, which are quantized representations of the scene, are effective for conditional decoding. We evaluate ICGS-Quantizer on 3D scene compression and 3D scene updating. Experimental results show that ICGS-Quantizer consistently outperforms state-of-the-art methods in compression efficiency and adaptability to scene changes. Our code, model, and data will be publicly available on GitHub.[87] DriveSplat: Decoupled Driving Scene Reconstruction with Geometry-enhanced Partitioned Neural Gaussians
Cong Wang,Xianda Guo,Wenbo Xu,Wei Tian,Ruiqi Song,Chenming Zhang,Lingxi Li,Long Chen
Main category: cs.CV
TL;DR: DriveSplat 是一种用于驾驶场景的高质量三维重建方法,结合了动态-静态解耦策略与神经高斯表示,以解决运动模糊问题并提升几何结构的准确性。
Details
Motivation: 现有基于3D高斯点绘的方法在处理驾驶场景中的动态物体和静态背景时,无法有效优化背景几何关系,导致新视角渲染效果不佳且几何表示不准确。 Method: DriveSplat 引入了区域化体素初始化方案,将场景划分为近、中、远三个区域,并采用可变形神经高斯模型处理非刚性动态物体。此外,通过预训练模型的深度和法线先验对整个框架进行监督,提升几何重建的准确性。 Result: 在 Waymo 和 KITTI 数据集上的实验表明,DriveSplat 在驾驶场景的新视角合成任务中表现出最先进的性能。 Conclusion: DriveSplat 通过动态-静态解耦和可变形神经高斯模型,有效提升了驾驶场景中三维重建的质量与几何准确性,解决了现有方法在背景优化和几何表示方面的不足。 Abstract: In the realm of driving scenarios, the presence of rapidly moving vehicles, pedestrians in motion, and large-scale static backgrounds poses significant challenges for 3D scene reconstruction. Recent methods based on 3D Gaussian Splatting address the motion blur problem by decoupling dynamic and static components within the scene. However, these decoupling strategies overlook background optimization with adequate geometry relationships and rely solely on fitting each training view by adding Gaussians. Therefore, these models exhibit limited robustness in rendering novel views and lack an accurate geometric representation. To address the above issues, we introduce DriveSplat, a high-quality reconstruction method for driving scenarios based on neural Gaussian representations with dynamic-static decoupling. To better accommodate the predominantly linear motion patterns of driving viewpoints, a region-wise voxel initialization scheme is employed, which partitions the scene into near, middle, and far regions to enhance close-range detail representation. Deformable neural Gaussians are introduced to model non-rigid dynamic actors, whose parameters are temporally adjusted by a learnable deformation network. The entire framework is further supervised by depth and normal priors from pre-trained models, improving the accuracy of geometric structures. Our method has been rigorously evaluated on the Waymo and KITTI datasets, demonstrating state-of-the-art performance in novel-view synthesis for driving scenarios.[88] DIO: Refining Mutual Information and Causal Chain to Enhance Machine Abstract Reasoning Ability
Ruizhuo Song,Beiming Yuan
Main category: cs.CV
TL;DR: 本文研究了深度学习模型在抽象推理任务中的瓶颈问题,通过因果链建模设计了DIO模型,并提出了改进方法以增强机器的抽象推理能力。
Details
Motivation: 当前深度学习模型在抽象推理方面存在根本瓶颈,Raven渐进矩阵问题被引入作为评估抽象推理能力的权威基准。 Method: 采用“因果链建模”视角分析RPM任务中的完整因果链,并设计了DIO基线模型的网络架构。 Result: 实验表明DIO模型的优化目标未能使模型真正获得预设的人类推理逻辑,主要受限于互信息下界的紧致性和统计度量无法捕捉因果关系。 Conclusion: 本文提出三种改进方法以克服DIO模型的局限性,从而增强机器智能的抽象推理能力。 Abstract: Despite the outstanding performance of current deep learning models across various domains, their fundamental bottleneck in abstract reasoning remains unresolved. To address this challenge, the academic community has introduced Raven's Progressive Matrices (RPM) problems as an authoritative benchmark for evaluating the abstract reasoning capabilities of deep learning algorithms, with a focus on core intelligence dimensions such as abstract reasoning, pattern recognition, and complex problem-solving. Therefore, this paper centers on solving RPM problems, aiming to contribute to enhancing the abstract reasoning abilities of machine intelligence. Firstly, this paper adopts a ``causal chain modeling'' perspective to systematically analyze the complete causal chain in RPM tasks: image $\rightarrow$ abstract attributes $\rightarrow$ progressive attribute patterns $\rightarrow$ pattern consistency $\rightarrow$ correct answer. Based on this analysis, the network architecture of the baseline model DIO is designed. However, experiments reveal that the optimization objective formulated for DIO, namely maximizing the variational lower bound of mutual information between the context and the correct option, fails to enable the model to genuinely acquire the predefined human reasoning logic. This is attributed to two main reasons: the tightness of the lower bound significantly impacts the effectiveness of mutual information maximization, and mutual information, as a statistical measure, does not capture the causal relationship between subjects and objects. To overcome these limitations, this paper progressively proposes three improvement methods:[89] Spiking Variational Graph Representation Inference for Video Summarization
Wenrui Li,Wei Han,Liang-Jian Deng,Ruiqin Xiong,Xiaopeng Fan
Main category: cs.CV
TL;DR: This paper introduces the SpiVG Network, a novel approach for video summarization that improves information density, handles noise, and reduces computational complexity, showing superior performance across multiple datasets.
Details
Motivation: Efficient video summarization techniques are essential due to the rise of short video content. Existing methods struggle with capturing global temporal dependencies, maintaining semantic coherence, and managing noise during multi-channel feature fusion. Method: The paper proposes a Spiking Variational Graph (SpiVG) Network, which includes a keyframe extractor based on Spiking Neural Networks (SNN), a Dynamic Aggregation Graph Reasoner for fine-grained reasoning, and a Variational Inference Reconstruction Module to handle noise and uncertainty. Result: Experimental results show that the SpiVG Network outperforms existing methods across multiple datasets including SumMe, TVSum, VideoXum, and QFVS. Conclusion: SpiVG surpasses existing methods in video summarization by efficiently extracting key information while reducing computational complexity, with codes and pre-trained models publicly available. Abstract: With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at https://github.com/liwrui/SpiVG.[90] From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations
Anthony Bisulco,Rahul Ramesh,Randall Balestriero,Pratik Chaudhari
Main category: cs.CV
TL;DR: This study explores how Masked Autoencoders (MAEs) learn spatial correlations in images and how hyperparameters like masking ratio and patch size can be used to improve performance on downstream tasks.
Details
Motivation: Despite the effectiveness of Masked Autoencoders (MAEs), there is a lack of exploration regarding the connection between their hyperparameters and performance on downstream tasks. Method: The study analytically derives the features learned by a linear MAE and extends this analysis to non-linear MAEs to understand their adaptation to spatial correlations. Result: The research reveals that MAEs learn spatial correlations in input images and provides insights into selecting MAE hyperparameters to optimize performance. Conclusion: MAE representations adapt to spatial correlations in the dataset beyond second-order statistics, and hyperparameters such as masking ratio and patch size can be utilized to capture short- and long-range spatial correlations. Abstract: Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models. Despite their effectiveness, they require extensive hyperparameter tuning (masking ratio, patch size, encoder/decoder layers) when applied to novel datasets. While prior theoretical works have analyzed MAEs in terms of their attention patterns and hierarchical latent variable models, the connection between MAE hyperparameters and performance on downstream tasks is relatively unexplored. This work investigates how MAEs learn spatial correlations in the input image. We analytically derive the features learned by a linear MAE and show that masking ratio and patch size can be used to select for features that capture short- and long-range spatial correlations. We extend this analysis to non-linear MAEs to show that MAE representations adapt to spatial correlations in the dataset, beyond second-order statistics. Finally, we discuss some insights on how to select MAE hyper-parameters in practice.[91] Bidirectional Temporal Information Propagation for Moving Infrared Small Target Detection
Dengyan Luo,Yanping Xiang,Hu Wang,Luping Ji. Shuai Li,Mao Ye
Main category: cs.CV
TL;DR: This paper proposes BIRD, a bidirectional temporal information propagation method for moving infrared small target detection, which effectively combines local and global temporal information for improved performance and faster inference.
Details
Motivation: Existing sliding-window-based multi-frame methods ignore global temporal information and result in redundant computation and sub-optimal performance, which the proposed method aims to address. Method: BIRD utilizes a bidirectional propagation strategy with Local Temporal Motion Fusion (LTMF) and Global Temporal Motion Fusion (GTMF) modules for recursive modeling of local and global temporal information. It also involves joint optimization using detection loss and Spatio-Temporal Fusion (STF) loss. Result: The BIRD method demonstrates superior performance in moving infrared small target detection along with faster inference speed. Conclusion: The proposed BIRD method achieves state-of-the-art performance and shows fast inference speed. Abstract: Moving infrared small target detection is broadly adopted in infrared search and track systems, and has attracted considerable research focus in recent years. The existing learning-based multi-frame methods mainly aggregate the information of adjacent frames in a sliding window fashion to assist the detection of the current frame. However, the sliding-window-based methods do not consider joint optimization of the entire video clip and ignore the global temporal information outside the sliding window, resulting in redundant computation and sub-optimal performance. In this paper, we propose a Bidirectional temporal information propagation method for moving InfraRed small target Detection, dubbed BIRD. The bidirectional propagation strategy simultaneously utilizes local temporal information of adjacent frames and global temporal information of past and future frames in a recursive fashion. Specifically, in the forward and backward propagation branches, we first design a Local Temporal Motion Fusion (LTMF) module to model local spatio-temporal dependency between a target frame and its two adjacent frames. Then, a Global Temporal Motion Fusion (GTMF) module is developed to further aggregate the global propagation feature with the local fusion feature. Finally, the bidirectional aggregated features are fused and input into the detection head for detection. In addition, the entire video clip is jointly optimized by the traditional detection loss and the additional Spatio-Temporal Fusion (STF) loss. Extensive experiments demonstrate that the proposed BIRD method not only achieves the state-of-the-art performance but also shows a fast inference speed.[92] A Curated Dataset and Deep Learning Approach for Minor Dent Detection in Vehicles
Danish Zia Baig,Mohsin Kamal
Main category: cs.CV
TL;DR: 本研究提出了一种基于YOLOv8框架的深度学习方法,用于自动检测汽车表面的微观缺陷,特别是小凹痕。通过创建包含不同光照条件、角度和纹理的注释图像的定制数据集,并使用实时数据增强方法训练YOLOv8m模型及其定制变体YOLOv8m-t4和YOLOv8m-t42,实现了高精度和低延迟的实时检测。
Details
Motivation: 传统的汽车损伤检测技术是劳动密集型的、手动的,并且经常忽略微小的表面缺陷,如微观凹痕。机器学习提供了一种创新的解决方案,以满足对更快、更精确检测方法的日益增长的需求。 Method: 本文采用了YOLOv8目标识别框架,并通过实时数据增强方法训练了YOLOv8m模型及其定制变体YOLOv8m-t4和YOLOv8m-t42,以创建一种基于深度学习的解决方案,用于自动检测汽车外表的微观表面缺陷,特别是微小凹痕。 Result: 实验结果表明,该方法具有出色的检测精度和低推理延迟,适用于实时应用,如自动保险评估和汽车检测。评估参数包括平均平均精度(mAP)、精度、召回率和F1分数,验证了模型的有效性。YOLOv8m-t42模型在识别微观表面缺陷时表现出色,精度为0.86,召回率为0.84,F1分数为0.85。YOLOv8m-t42的PR曲线面积为0.88,表明其性能比YOLOv8m-t4(0.82)更一致。 Conclusion: YOLOv8m-t42模型在识别微观表面缺陷方面优于YOLOv8m-t4模型,尽管收敛速度较慢,但其精度更高,更适合实际的凹痕检测应用。 Abstract: Conventional car damage inspection techniques are labor-intensive, manual, and frequently overlook tiny surface imperfections like microscopic dents. Machine learning provides an innovative solution to the increasing demand for quicker and more precise inspection methods. The paper uses the YOLOv8 object recognition framework to provide a deep learning-based solution for automatically detecting microscopic surface flaws, notably tiny dents, on car exteriors. Traditional automotive damage inspection procedures are manual, time-consuming, and frequently unreliable at detecting tiny flaws. To solve this, a bespoke dataset containing annotated photos of car surfaces under various lighting circumstances, angles, and textures was created. To improve robustness, the YOLOv8m model and its customized variants, YOLOv8m-t4 and YOLOv8m-t42, were trained employing real-time data augmentation approaches. Experimental results show that the technique has excellent detection accuracy and low inference latency, making it suited for real-time applications such as automated insurance evaluations and automobile inspections. Evaluation parameters such as mean Average Precision (mAP), precision, recall, and F1-score verified the model's efficacy. With a precision of 0.86, recall of 0.84, and F1-score of 0.85, the YOLOv8m-t42 model outperformed the YOLOv8m-t4 model (precision: 0.81, recall: 0.79, F1-score: 0.80) in identifying microscopic surface defects. With a little reduced mAP@0.5:0.95 of 0.20, the mAP@0.5 for YOLOv8m-t42 stabilized at 0.60. Furthermore, YOLOv8m-t42's PR curve area was 0.88, suggesting more consistent performance than YOLOv8m-t4 (0.82). YOLOv8m-t42 has greater accuracy and is more appropriate for practical dent detection applications, even though its convergence is slower.[93] Aligning Moments in Time using Video Queries
Yogesh Kumar,Uday Agarwal,Manish Gupta,Anand Mishra
Main category: cs.CV
TL;DR: MATR is a transformer-based model for video-to-video moment retrieval that achieves state-of-the-art results on benchmark datasets by effectively aligning query and target videos.
Details
Motivation: The Vid2VidMR task involves challenges like semantic alignment and modeling dependencies between videos, which motivated the development of MATR. Method: MATR uses a transformer-based architecture with dual-stage sequence alignment and a self-supervised pre-training technique for video moment retrieval. Result: MATR achieves significant performance gains of 13.1% in R@1 and 8.1% in mIoU on ActivityNet-VRL dataset, and 14.7% in R@1 and 14.4% in mIoU on the SportsMoments dataset. Conclusion: The proposed MATR model excels in the Vid2VidMR task by effectively aligning query and target videos, leading to accurate moment localization. Abstract: Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query video. Additionally, to provide a strong task-specific initialization for MATR, we propose a self-supervised pre-training technique that involves training the model to localize random clips within videos. Extensive experiments demonstrate that MATR achieves notable performance improvements of 13.1% in R@1 and 8.1% in mIoU on an absolute scale compared to state-of-the-art methods on the popular ActivityNet-VRL dataset. Additionally, on our newly proposed dataset, SportsMoments, MATR shows a 14.7% gain in R@1 and a 14.4% gain in mIoU on an absolute scale over strong baselines.[94] Enhancing Novel View Synthesis from extremely sparse views with SfM-free 3D Gaussian Splatting Framework
Zongqi He,Hanmin Li,Kin-Chung Chan,Yushen Zuo,Hao Xie,Zhe Xiao,Jun Xiao,Kin-Man Lam
Main category: cs.CV
TL;DR: This paper proposes an SfM-free 3D Gaussian Splatting-based method that effectively reconstructs high-quality 3D scenes from extremely sparse-view inputs, significantly outperforming existing methods in both quantitative metrics and visual quality.
Details
Motivation: 3D Gaussian Splatting (3DGS) relies heavily on dense multi-view inputs and accurate camera poses, which are often unavailable in real-world scenarios. Sparse input views lead to poor 3D reconstruction and degraded rendering quality, motivating the need for an SfM-free approach that works under sparse-view conditions. Method: The method jointly estimates camera poses and reconstructs 3D scenes without relying on SfM. It employs a dense stereo module for initialization, a coherent view interpolation module for generating additional supervision signals, and introduces multi-scale Laplacian consistent regularization and adaptive spatial-aware multi-scale geometry regularization to enhance geometry and rendering quality. Result: The proposed method achieves a 2.75dB improvement in PSNR under extremely sparse-view conditions (using only 2 training views) and produces high-quality, distortion-free synthesized images with rich high-frequency details, outperforming existing 3DGS-based approaches. Conclusion: The proposed method significantly outperforms state-of-the-art 3DGS-based approaches under extremely sparse-view conditions, achieving a 2.75dB improvement in PSNR and producing high-quality, distortion-free synthesized images with rich high-frequency details. Abstract: 3D Gaussian Splatting (3DGS) has demonstrated remarkable real-time performance in novel view synthesis, yet its effectiveness relies heavily on dense multi-view inputs with precisely known camera poses, which are rarely available in real-world scenarios. When input views become extremely sparse, the Structure-from-Motion (SfM) method that 3DGS depends on for initialization fails to accurately reconstruct the 3D geometric structures of scenes, resulting in degraded rendering quality. In this paper, we propose a novel SfM-free 3DGS-based method that jointly estimates camera poses and reconstructs 3D scenes from extremely sparse-view inputs. Specifically, instead of SfM, we propose a dense stereo module to progressively estimates camera pose information and reconstructs a global dense point cloud for initialization. To address the inherent problem of information scarcity in extremely sparse-view settings, we propose a coherent view interpolation module that interpolates camera poses based on training view pairs and generates viewpoint-consistent content as additional supervision signals for training. Furthermore, we introduce multi-scale Laplacian consistent regularization and adaptive spatial-aware multi-scale geometry regularization to enhance the quality of geometrical structures and rendered content. Experiments show that our method significantly outperforms other state-of-the-art 3DGS-based approaches, achieving a remarkable 2.75dB improvement in PSNR under extremely sparse-view conditions (using only 2 training views). The images synthesized by our method exhibit minimal distortion while preserving rich high-frequency details, resulting in superior visual quality compared to existing techniques.[95] LGMSNet: Thinning a medical image segmentation model via dual-level multiscale fusion
Chengqi Dong,Fenghe Tang,Rongge Mao,Xinpei Gao,S. Kevin Zhou
Main category: cs.CV
TL;DR: LGMSNet is a novel lightweight framework for medical image segmentation that effectively balances efficiency and performance by addressing channel redundancy and integrating global context through hybrid architectural components.
Details
Motivation: To develop an efficient and generalizable medical image segmentation model that overcomes the limitations of existing lightweight models in terms of performance, global context perception, and handling channel redundancy. Method: The authors proposed LGMSNet, a lightweight framework using a local and global dual multiscale architecture with heterogeneous intra-layer kernels and sparse transformer-convolutional hybrid branches. Result: LGMSNet achieved state-of-the-art performance across six public datasets and showed strong zero-shot generalization on four unseen datasets with minimal computational overhead. Conclusion: LGMSNet demonstrates superior performance in medical image segmentation, especially in resource-constrained settings, by addressing channel redundancy and incorporating global contextual perception. Abstract: Medical image segmentation plays a pivotal role in disease diagnosis and treatment planning, particularly in resource-constrained clinical settings where lightweight and generalizable models are urgently needed. However, existing lightweight models often compromise performance for efficiency and rarely adopt computationally expensive attention mechanisms, severely restricting their global contextual perception capabilities. Additionally, current architectures neglect the channel redundancy issue under the same convolutional kernels in medical imaging, which hinders effective feature extraction. To address these challenges, we propose LGMSNet, a novel lightweight framework based on local and global dual multiscale that achieves state-of-the-art performance with minimal computational overhead. LGMSNet employs heterogeneous intra-layer kernels to extract local high-frequency information while mitigating channel redundancy. In addition, the model integrates sparse transformer-convolutional hybrid branches to capture low-frequency global information. Extensive experiments across six public datasets demonstrate LGMSNet's superiority over existing state-of-the-art methods. In particular, LGMSNet maintains exceptional performance in zero-shot generalization tests on four unseen datasets, underscoring its potential for real-world deployment in resource-limited medical scenarios. The whole project code is in https://github.com/cq-dong/LGMSNet.[96] MExECON: Multi-view Extended Explicit Clothed humans Optimized via Normal integration
Fulden Ece Uğur,Rafael Redondo,Albert Barreiro,Stefan Hristov,Roger Marí
Main category: cs.CV
TL;DR: MExECON 是一种用于从稀疏多视角RGB图像中进行服装人体头像3D重建的新管道,它通过联合多视角身体优化算法改进了几何和身体姿态估计。
Details
Motivation: 在单视角方法ECON的基础上,MExECON 扩展其能力以利用多个视点,提高3D重建的质量和准确性。 Method: MExECON 的核心是联合多视角身体优化(JMBO)算法,该算法在所有输入视图中拟合一个SMPL-X身体模型,强制执行多视角一致性。优化后的身体模型作为低频先验,指导后续表面重建,通过法线图集成添加几何细节。 Result: 实验结果表明,MExECON 在多视角条件下比单视角基准提高了保真度,并且与现代少样本3D重建方法相比具有竞争力。 Conclusion: MExECON 为服装人体头像的3D重建提供了一种有效的新方法,无需任何网络重新训练即可实现多视角增益。 Abstract: This work presents MExECON, a novel pipeline for 3D reconstruction of clothed human avatars from sparse multi-view RGB images. Building on the single-view method ECON, MExECON extends its capabilities to leverage multiple viewpoints, improving geometry and body pose estimation. At the core of the pipeline is the proposed Joint Multi-view Body Optimization (JMBO) algorithm, which fits a single SMPL-X body model jointly across all input views, enforcing multi-view consistency. The optimized body model serves as a low-frequency prior that guides the subsequent surface reconstruction, where geometric details are added via normal map integration. MExECON integrates normal maps from both front and back views to accurately capture fine-grained surface details such as clothing folds and hairstyles. All multi-view gains are achieved without requiring any network re-training. Experimental results show that MExECON consistently improves fidelity over the single-view baseline and achieves competitive performance compared to modern few-shot 3D reconstruction methods.[97] Task-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion
Mengyu Wang,Zhenyu Liu,Kun Li,Yu Wang,Yuwei Wang,Yanyan Wei,Fei Wang
Main category: cs.CV
TL;DR: 提出了一种名为AdaSFFuse的新型多模态图像融合框架,结合了自适应近似小波变换和空间-频率Mamba块,以提高融合效果并减少计算成本。
Details
Motivation: 多模态图像融合(MMIF)旨在整合不同成像模式的互补信息,但目前的方法仍面临模态不对齐、高频细节破坏和任务特定限制等挑战。 Method: 提出了AdaSFFuse,包括自适应近似小波变换(AdaWAT)和空间-频率Mamba块,用于高效多模态融合。 Result: 在四个MMIF任务上的实验证明了AdaSFFuse的卓越融合性能,确保了低计算成本和紧凑网络,提供了性能和效率之间的良好平衡。 Conclusion: AdaSFFuse是一个任务通用的多模态图像融合框架,通过自适应跨域共融学习提高了多模态特征的对齐和集成,减少了频率损失,保留了关键细节。 Abstract: Multimodal Image Fusion (MMIF) aims to integrate complementary information from different imaging modalities to overcome the limitations of individual sensors. It enhances image quality and facilitates downstream applications such as remote sensing, medical diagnostics, and robotics. Despite significant advancements, current MMIF methods still face challenges such as modality misalignment, high-frequency detail destruction, and task-specific limitations. To address these challenges, we propose AdaSFFuse, a novel framework for task-generalized MMIF through adaptive cross-domain co-fusion learning. AdaSFFuse introduces two key innovations: the Adaptive Approximate Wavelet Transform (AdaWAT) for frequency decoupling, and the Spatial-Frequency Mamba Blocks for efficient multimodal fusion. AdaWAT adaptively separates the high- and low-frequency components of multimodal images from different scenes, enabling fine-grained extraction and alignment of distinct frequency characteristics for each modality. The Spatial-Frequency Mamba Blocks facilitate cross-domain fusion in both spatial and frequency domains, enhancing this process. These blocks dynamically adjust through learnable mappings to ensure robust fusion across diverse modalities. By combining these components, AdaSFFuse improves the alignment and integration of multimodal features, reduces frequency loss, and preserves critical details. Extensive experiments on four MMIF tasks -- Infrared-Visible Image Fusion (IVF), Multi-Focus Image Fusion (MFF), Multi-Exposure Image Fusion (MEF), and Medical Image Fusion (MIF) -- demonstrate AdaSFFuse's superior fusion performance, ensuring both low computational cost and a compact network, offering a strong balance between performance and efficiency. The code will be publicly available at https://github.com/Zhen-yu-Liu/AdaSFFuse.[98] ExtraGS: Geometric-Aware Trajectory Extrapolation with Uncertainty-Guided Generative Priors
Kaiyuan Tan,Yingying Shen,Haohui Zhu,Zhiwei Zhan,Shan Zhao,Mingfei Tu,Hongcheng Luo,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye
Main category: cs.CV
TL;DR: The paper proposes ExtraGS, a novel framework for trajectory extrapolation that combines geometric and generative priors, significantly improving the quality and consistency of synthesized driving views for autonomous vehicles.
Details
Motivation: Synthesizing extrapolated views from driving logs is crucial for autonomous vehicle simulation but remains challenging due to poor geometric consistency and over-smoothed renderings in existing methods. Method: Proposed ExtraGS, a framework integrating geometric and generative priors, utilizing Road Surface Gaussian (RSG) representation, Far Field Gaussians (FFG), and a self-supervised uncertainty estimation framework based on spherical harmonics. Result: Extensive experiments show that ExtraGS improves realism and geometric consistency of extrapolated views across multiple datasets, multi-camera setups, and generative priors. Conclusion: ExtraGS significantly enhances the realism and geometric consistency of extrapolated views while preserving high fidelity along the original trajectory. Abstract: Synthesizing extrapolated views from recorded driving logs is critical for simulating driving scenes for autonomous driving vehicles, yet it remains a challenging task. Recent methods leverage generative priors as pseudo ground truth, but often lead to poor geometric consistency and over-smoothed renderings. To address these limitations, we propose ExtraGS, a holistic framework for trajectory extrapolation that integrates both geometric and generative priors. At the core of ExtraGS is a novel Road Surface Gaussian(RSG) representation based on a hybrid Gaussian-Signed Distance Function (SDF) design, and Far Field Gaussians (FFG) that use learnable scaling factors to efficiently handle distant objects. Furthermore, we develop a self-supervised uncertainty estimation framework based on spherical harmonics that enables selective integration of generative priors only where extrapolation artifacts occur. Extensive experiments on multiple datasets, diverse multi-camera setups, and various generative priors demonstrate that ExtraGS significantly enhances the realism and geometric consistency of extrapolated views, while preserving high fidelity along the original trajectory.[99] Multi-Object Sketch Animation with Grouping and Motion Trajectory Priors
Guotao Liang,Juncheng Hu,Ximing Xing,Jing Zhang,Qian Yu
Main category: cs.CV
TL;DR: GroupSketch introduces a two-stage pipeline for vector sketch animation that effectively handles multi-object interactions and complex motions.
Details
Motivation: Existing approaches struggle with multi-object interactions and complex motions, either being limited to single-object cases or suffering from temporal inconsistency and poor generalization. Method: The method adopts a two-stage pipeline comprising Motion Initialization and Motion Refinement. In the first stage, the input sketch is interactively divided into semantic groups and key frames are defined. In the second stage, a Group-based Displacement Network (GDN) refines the coarse animation by predicting group-specific displacement fields, leveraging priors from a text-to-video model. Result: Extensive experiments demonstrate that GroupSketch significantly outperforms existing methods in generating high-quality, temporally consistent animations for complex, multi-object sketches. Conclusion: GroupSketch is able to generate high-quality, temporally consistent animations for complex, multi-object sketches and expands the practical applications of sketch animation. Abstract: We introduce GroupSketch, a novel method for vector sketch animation that effectively handles multi-object interactions and complex motions. Existing approaches struggle with these scenarios, either being limited to single-object cases or suffering from temporal inconsistency and poor generalization. To address these limitations, our method adopts a two-stage pipeline comprising Motion Initialization and Motion Refinement. In the first stage, the input sketch is interactively divided into semantic groups and key frames are defined, enabling the generation of a coarse animation via interpolation. In the second stage, we propose a Group-based Displacement Network (GDN), which refines the coarse animation by predicting group-specific displacement fields, leveraging priors from a text-to-video model. GDN further incorporates specialized modules, such as Context-conditioned Feature Enhancement (CCFE), to improve temporal consistency. Extensive experiments demonstrate that our approach significantly outperforms existing methods in generating high-quality, temporally consistent animations for complex, multi-object sketches, thus expanding the practical applications of sketch animation.[100] D3FNet: A Differential Attention Fusion Network for Fine-Grained Road Structure Extraction in Remote Perception Systems
Chang Liu,Yang Xu,Tamas Sziranyi
Main category: cs.CV
TL;DR: D3FNet是一种用于细粒度道路结构分割的膨胀双流差分注意融合网络,特别适用于挑战性的窄路提取场景。
Details
Motivation: 由于窄路的宽度有限、拓扑结构碎片化以及频繁的遮挡,从高分辨率遥感图像中提取窄路仍然是一个重大挑战。 Method: D3FNet基于D-LinkNet的编码器-解码器主干,并引入了三个关键创新:(1) 差分注意力膨胀提取模块(DADE),在瓶颈处增强细微的道路特征,同时抑制背景噪声;(2) 双流解码融合机制(DDFM),整合原始特征和注意力调制特征,以平衡空间精度与语义上下文;(3) 多尺度膨胀策略(速率1、3、5、9),减轻网格伪影并改善窄路预测的连续性。 Result: 在DeepGlobe和CHN6-CUG基准上的大量实验表明,D3FNet在挑战性道路区域的IoU和召回率方面优于最先进的基线方法。 Conclusion: D3FNet被证实是复杂远程和协作感知场景中细粒度窄路提取的一种稳健解决方案。 Abstract: Extracting narrow roads from high-resolution remote sensing imagery remains a significant challenge due to their limited width, fragmented topology, and frequent occlusions. To address these issues, we propose D3FNet, a Dilated Dual-Stream Differential Attention Fusion Network designed for fine-grained road structure segmentation in remote perception systems. Built upon the encoder-decoder backbone of D-LinkNet, D3FNet introduces three key innovations:(1) a Differential Attention Dilation Extraction (DADE) module that enhances subtle road features while suppressing background noise at the bottleneck; (2) a Dual-stream Decoding Fusion Mechanism (DDFM) that integrates original and attention-modulated features to balance spatial precision with semantic context; and (3) a multi-scale dilation strategy (rates 1, 3, 5, 9) that mitigates gridding artifacts and improves continuity in narrow road prediction. Unlike conventional models that overfit to generic road widths, D3FNet specifically targets fine-grained, occluded, and low-contrast road segments. Extensive experiments on the DeepGlobe and CHN6-CUG benchmarks show that D3FNet achieves superior IoU and recall on challenging road regions, outperforming state-of-the-art baselines. Ablation studies further verify the complementary synergy of attention-guided encoding and dual-path decoding. These results confirm D3FNet as a robust solution for fine-grained narrow road extraction in complex remote and cooperative perception scenarios.[101] Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Youjia Zhang,Youngeun Kim,Young-Geun Choi,Hongyeob Kim,Huiling Liu,Sungeun Hong
Main category: cs.CV
TL;DR: 本文提出了一种新的测试时自适应方法ADAPT,该方法无需源数据、无需梯度更新,并且不需要完全访问目标数据,就能在多种分布转移下实现高效的性能和鲁棒性。
Details
Motivation: 测试时自适应方法面临可扩展性差、缺乏对类条件特征分布的显式建模等挑战,限制了其更广泛的应用。 Method: 通过使用逐渐更新的类均值和共享协方差矩阵来建模类条件似然,将测试时自适应重新定义为高斯概率推理任务。引入了基于CLIP先验和历史知识库的轻量级正则化来纠正潜在的似然偏差。 Result: ADAPT方法在广泛的基准实验中表现出色,特别是在各种分布转移条件下,不仅达到了最先进的性能,还展示了优越的可扩展性和鲁棒性。 Conclusion: ADAPT是一个先进的分布感知且无需反向传播的测试时自适应方法,它在多种分布转移下实现了最先进的性能,同时具有出色的可扩展性和鲁棒性。 Abstract: Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.[102] High-Frequency First: A Two-Stage Approach for Improving Image INR
Sumit Kumar Dam,Mrityunjoy Gain,Eui-Nam Huh,Choong Seon Hong
Main category: cs.CV
TL;DR: This paper introduces a novel two-stage training strategy for implicit neural representations that adaptively highlights high-frequency image details, improving reconstruction quality and addressing the issue of spectral bias in neural networks.
Details
Motivation: The motivation is to address the spectral bias of neural networks, which struggle to capture high-frequency details in images when using implicit neural representations. Method: The method involves a two-stage training approach using a neighbor-aware soft mask to assign higher weights to pixels with strong local variations, followed by full-image training. Result: Experimental results demonstrate that the proposed approach improves reconstruction quality and complements existing INR methods by focusing on high-frequency pixel importance. Conclusion: The proposed two-stage training strategy effectively mitigates the spectral bias problem in implicit neural representations by adaptively emphasizing high-frequency details during training. Abstract: Implicit Neural Representations (INRs) have emerged as a powerful alternative to traditional pixel-based formats by modeling images as continuous functions over spatial coordinates. A key challenge, however, lies in the spectral bias of neural networks, which tend to favor low-frequency components while struggling to capture high-frequency (HF) details such as sharp edges and fine textures. While prior approaches have addressed this limitation through architectural modifications or specialized activation functions, we propose an orthogonal direction by directly guiding the training process. Specifically, we introduce a two-stage training strategy where a neighbor-aware soft mask adaptively assigns higher weights to pixels with strong local variations, encouraging early focus on fine details. The model then transitions to full-image training. Experimental results show that our approach consistently improves reconstruction quality and complements existing INR methods. As a pioneering attempt to assign frequency-aware importance to pixels in image INR, our work offers a new avenue for mitigating the spectral bias problem.[103] Fast globally optimal Truncated Least Squares point cloud registration with fixed rotation axis
Ivo Ivanov,Carsten Markgraf
Main category: cs.CV
TL;DR: This paper introduces a fast linear-time convex relaxation and contractor method for globally optimal point cloud registration, achieving speeds two orders of magnitude faster than existing methods for rotation-only problems.
Details
Motivation: Existing methods for globally optimal point cloud registration with high outlier robustness, such as semidefinite programming (SDP), are too slow for practical use, especially with large datasets. Method: The paper introduces a linear time convex relaxation and a contractor method to accelerate Branch and Bound (BnB), enabling fast and globally optimal solutions for the rotation-only truncated least squares (TLS) problem. Result: The proposed solver achieves provable global optimality in less than half a second for 3D point clouds with 100 points when the rotation axis is known, and is two orders of magnitude faster than STRIDE, the state-of-the-art SDP solver. Conclusion: The proposed method offers a significantly faster solution for globally optimal point cloud registration compared to existing SDP-based approaches, although it is currently limited to rotation-only problems. Abstract: Recent results showed that point cloud registration with given correspondences can be made robust to outlier rates of up to 95\% using the truncated least squares (TLS) formulation. However, solving this combinatorial optimization problem to global optimality is challenging. Provably globally optimal approaches using semidefinite programming (SDP) relaxations take hundreds of seconds for 100 points. In this paper, we propose a novel linear time convex relaxation as well as a contractor method to speed up Branch and Bound (BnB). Our solver can register two 3D point clouds with 100 points to provable global optimality in less than half a second when the axis of rotation is provided. Although it currently cannot solve the full 6DoF problem, it is two orders of magnitude faster than the state-of-the-art SDP solver STRIDE when solving the rotation-only TLS problem. In addition to providing a formal proof for global optimality, we present empirical evidence of global optimality using adversarial instances with local minimas close to the global minimum.[104] Multi-perspective monitoring of wildlife and human activities from camera traps and drones with deep learning models
Hao Chen,Fang Qiu,Li An,Douglas Stow,Eve Bohnett,Haitao Lyu,Shuang Tian
Main category: cs.CV
TL;DR: 本研究通过结合相机陷阱和无人机热成像技术,利用深度学习模型对野生动物和人类活动进行多视角监测,成功识别了尼泊尔奇旺国家公园及其周边地区的活动热点和潜在的人兽冲突区域。
Details
Motivation: 了解野生动物和人类活动的空间分布对于评估人兽互动和制定有效的保护计划至关重要。 Method: 结合相机陷阱和无人机图像进行多视角监测,利用深度学习模型(如YOLOv11s和改进的Faster RCNN)对野生动物和人类活动进行自动识别,并进行空间模式分析以确定活动热点和潜在的人兽冲突区域。 Result: 在测试的深度学习模型中,YOLOv11s在检测相机陷阱图像中的物体表现最佳,精度为96.2%,召回率为92.3%,mAP50为96.7%。空间模式分析明确了野生动物和人类活动的热点及其在某些区域的重叠模式,表明存在潜在的冲突。 Conclusion: 整合多视角监测与自动物体检测技术能够增强野生动物监测和景观管理能力。 Abstract: Wildlife and human activities are key components of landscape systems. Understanding their spatial distribution is essential for evaluating human wildlife interactions and informing effective conservation planning. Multiperspective monitoring of wildlife and human activities by combining camera traps and drone imagery. Capturing the spatial patterns of their distributions, which allows the identification of the overlap of their activity zones and the assessment of the degree of human wildlife conflict. The study was conducted in Chitwan National Park (CNP), Nepal, and adjacent regions. Images collected by visible and nearinfrared camera traps and thermal infrared drones from February to July 2022 were processed to create training and testing datasets, which were used to build deep learning models to automatic identify wildlife and human activities. Drone collected thermal imagery was used for detecting targets to provide a multiple monitoring perspective. Spatial pattern analysis was performed to identify animal and resident activity hotspots and delineation potential human wildlife conflict zones. Among the deep learning models tested, YOLOv11s achieved the highest performance with a precision of 96.2%, recall of 92.3%, mAP50 of 96.7%, and mAP50 of 81.3%, making it the most effective for detecting objects in camera trap imagery. Drone based thermal imagery, analyzed with an enhanced Faster RCNN model, added a complementary aerial viewpoint for camera trap detections. Spatial pattern analysis identified clear hotspots for both wildlife and human activities and their overlapping patterns within certain areas in the CNP and buffer zones indicating potential conflict. This study reveals human wildlife conflicts within the conserved landscape. Integrating multiperspective monitoring with automated object detection enhances wildlife surveillance and landscape management.[105] When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
Pengcheng Fang,Yuxia Chen,Rui Guo
Main category: cs.CV
TL;DR: Grounded VideoDiT introduces innovations to improve temporal perception in Video LLMs, achieving state-of-the-art results on benchmarks.
Details
Motivation: Recent Video LLMs have limitations in temporal perception, including implicit timestamp encoding, weak frame-level features, and language vision alignment drift. Method: The paper introduces a Diffusion Temporal Latent encoder, object grounded representations, and a mixed token scheme with discrete temporal tokens to enhance boundary sensitivity, maintain temporal consistency, align visual evidence, and enable fine-grained temporal reasoning. Result: Grounded VideoDiT achieves state-of-the-art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks. Conclusion: Grounded VideoDiT improves temporal perception in Video LLMs by introducing three key innovations that enable robust grounding capabilities. Abstract: Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides explicit timestamp modeling, enabling fine grained temporal reasoning. Together, these designs equip Grounded VideoDiT with robust grounding capabilities, as validated by state of the art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks.[106] Weakly-Supervised Learning for Tree Instances Segmentation in Airborne Lidar Point Clouds
Swann Emilien Céleste Destouches,Jesse Lahaye,Laurent Valentin Jospin,Jan Skaloud
Main category: cs.CV
TL;DR: 本文提出了一种弱监督学习方法,用于空中激光扫描数据的树实例分割,显著提高了识别准确率并减少了误判。
Details
Motivation: 由于空中激光扫描数据的变化性和精确标注数据获取的高成本,树木实例分割仍然具有挑战性。 Method: 提出了一种弱监督方法,通过人类操作员的质量评估提供初始分割结果的标签,利用这些标签训练评分模型,并使用评分模型的反馈对分割模型进行微调。 Result: 通过该方法,在正确识别树实例方面提高了34%,同时显著减少了误判的非树实例数量。 Conclusion: 尽管在稀疏森林区域和复杂环境中仍存在挑战,该论文提出的方法显著提高了树实例的识别准确率,并减少了非树实例的误判。 Abstract: Tree instance segmentation of airborne laser scanning (ALS) data is of utmost importance for forest monitoring, but remains challenging due to variations in the data caused by factors such as sensor resolution, vegetation state at acquisition time, terrain characteristics, etc. Moreover, obtaining a sufficient amount of precisely labeled data to train fully supervised instance segmentation methods is expensive. To address these challenges, we propose a weakly supervised approach where labels of an initial segmentation result obtained either by a non-finetuned model or a closed form algorithm are provided as a quality rating by a human operator. The labels produced during the quality assessment are then used to train a rating model, whose task is to classify a segmentation output into the same classes as specified by the human operator. Finally, the segmentation model is finetuned using feedback from the rating model. This in turn improves the original segmentation model by 34\% in terms of correctly identified tree instances while considerably reducing the number of non-tree instances predicted. Challenges still remain in data over sparsely forested regions characterized by small trees (less than two meters in height) or within complex surroundings containing shrubs, boulders, etc. which can be confused as trees where the performance of the proposed method is reduced.[107] Towards a 3D Transfer-based Black-box Attack via Critical Feature Guidance
Shuchao Pang,Zhenghan Chen,Shen Zhang,Liming Lu,Siyuan Liang,Anan Du,Yongbin Zhou
Main category: cs.CV
TL;DR: 本文提出了一种新的基于特征引导的转移攻击方法CFG,用于生成对抗点云,实验证明其性能优于现有方法。
Details
Motivation: 深度神经网络在3D点云上容易受到对抗样本的影响,但在实际场景中难以获取目标模型的信息,因此需要一种不依赖模型信息的转移攻击方法。 Method: 提出了一种新的基于特征引导的转移攻击方法CFG,通过计算特征重要性来生成对抗点云,并在损失函数中限制点云的最大偏差程度。 Result: 实验表明,CFG方法在生成对抗点云方面显著优于现有方法,并有效提升了对抗样本的转移能力。 Conclusion: CFG方法在生成对抗点云方面优于现有攻击方法,并在ModelNet40和ScanObjectNN基准数据集上进行了验证。 Abstract: Deep neural networks for 3D point clouds have been demonstrated to be vulnerable to adversarial examples. Previous 3D adversarial attack methods often exploit certain information about the target models, such as model parameters or outputs, to generate adversarial point clouds. However, in realistic scenarios, it is challenging to obtain any information about the target models under conditions of absolute security. Therefore, we focus on transfer-based attacks, where generating adversarial point clouds does not require any information about the target models. Based on our observation that the critical features used for point cloud classification are consistent across different DNN architectures, we propose CFG, a novel transfer-based black-box attack method that improves the transferability of adversarial point clouds via the proposed Critical Feature Guidance. Specifically, our method regularizes the search of adversarial point clouds by computing the importance of the extracted features, prioritizing the corruption of critical features that are likely to be adopted by diverse architectures. Further, we explicitly constrain the maximum deviation extent of the generated adversarial point clouds in the loss function to ensure their imperceptibility. Extensive experiments conducted on the ModelNet40 and ScanObjectNN benchmark datasets demonstrate that the proposed CFG outperforms the state-of-the-art attack methods by a large margin.[108] MapKD: Unlocking Prior Knowledge with Cross-Modal Distillation for Efficient Online HD Map Construction
Ziyang Yan,Ruikai Li,Zhiyong Cui,Bohan Li,Han Jiang,Yilong Ren,Aoyong Li,Zhenning Li,Sijia Wen,Haiyang Yu
Main category: cs.CV
TL;DR: MapKD is a new framework for online HD map construction that enhances the performance of a vision-based student model using knowledge distillation techniques, leading to improved mIoU, mAP scores, and faster inference.
Details
Motivation: Current online HD map construction methods rely on stale offline maps and multi-modal sensor suites, leading to computational overhead. A more efficient, vision-centric approach is needed. Method: MapKD, a novel multi-level cross-modal knowledge distillation framework with a Teacher-Coach-Student paradigm, is proposed. It uses Token-Guided 2D Patch Distillation and Masked Semantic Response Distillation strategies. Result: Extensive experiments on the nuScenes dataset demonstrate the effectiveness of MapKD in improving performance and inference speed. Conclusion: MapKD improves the student model by +6.68 mIoU and +10.94 mAP while accelerating inference speed. Abstract: Online HD map construction is a fundamental task in autonomous driving systems, aiming to acquire semantic information of map elements around the ego vehicle based on real-time sensor inputs. Recently, several approaches have achieved promising results by incorporating offline priors such as SD maps and HD maps or by fusing multi-modal data. However, these methods depend on stale offline maps and multi-modal sensor suites, resulting in avoidable computational overhead at inference. To address these limitations, we employ a knowledge distillation strategy to transfer knowledge from multimodal models with prior knowledge to an efficient, low-cost, and vision-centric student model. Specifically, we propose MapKD, a novel multi-level cross-modal knowledge distillation framework with an innovative Teacher-Coach-Student (TCS) paradigm. This framework consists of: (1) a camera-LiDAR fusion model with SD/HD map priors serving as the teacher; (2) a vision-centric coach model with prior knowledge and simulated LiDAR to bridge the cross-modal knowledge transfer gap; and (3) a lightweight vision-based student model. Additionally, we introduce two targeted knowledge distillation strategies: Token-Guided 2D Patch Distillation (TGPD) for bird's eye view feature alignment and Masked Semantic Response Distillation (MSRD) for semantic learning guidance. Extensive experiments on the challenging nuScenes dataset demonstrate that MapKD improves the student model by +6.68 mIoU and +10.94 mAP while simultaneously accelerating inference speed. The code is available at:https://github.com/2004yan/MapKD2026.[109] CM2LoD3: Reconstructing LoD3 Building Models Using Semantic Conflict Maps
Franz Hanke,Antonia Bieringer,Olaf Wysocki,Boris Jutzi
Main category: cs.CV
TL;DR: CM2LoD3 是一种基于冲突图和语义冲突图生成器的新型 LoD3 建筑模型重建方法,可提高建筑开口的分割和重建精度,为自动化和高效的城市三维建模提供了新路径。
Details
Motivation: 现有的 LoD1 和 LoD2 建筑模型缺乏对高级城市分析至关重要的详细立面元素,而 LoD3 模型通常需要手动建模,难以大规模应用。因此,需要一种自动化的方法来重建 LoD3 模型。 Method: 通过使用从射线到模型先验分析中获得的冲突图 (CM),并结合开发的语义冲突图生成器 (SCMG) 生成的合成 CM,对真实世界 CM 进行语义分割。此外,还融合了带置信度评分的纹理模型分割,以提高分割性能。 Result: 实验结果显示,CM2LoD3 方法在分割和重建建筑开口方面表现出有效性,通过不确定性感知的纹理分割融合实现了 61% 的性能提升。 Conclusion: CM2LoD3 是一种创新的 LoD3 建筑模型重建方法,能够有效分割和重建建筑开口,为自动化 LoD3 模型重建提供了新的解决方案。 Abstract: Detailed 3D building models are crucial for urban planning, digital twins, and disaster management applications. While Level of Detail 1 (LoD)1 and LoD2 building models are widely available, they lack detailed facade elements essential for advanced urban analysis. In contrast, LoD3 models address this limitation by incorporating facade elements such as windows, doors, and underpasses. However, their generation has traditionally required manual modeling, making large-scale adoption challenging. In this contribution, CM2LoD3, we present a novel method for reconstructing LoD3 building models leveraging Conflict Maps (CMs) obtained from ray-to-model-prior analysis. Unlike previous works, we concentrate on semantically segmenting real-world CMs with synthetically generated CMs from our developed Semantic Conflict Map Generator (SCMG). We also observe that additional segmentation of textured models can be fused with CMs using confidence scores to further increase segmentation performance and thus increase 3D reconstruction accuracy. Experimental results demonstrate the effectiveness of our CM2LoD3 method in segmenting and reconstructing building openings, with the 61% performance with uncertainty-aware fusion of segmented building textures. This research contributes to the advancement of automated LoD3 model reconstruction, paving the way for scalable and efficient 3D city modeling. Our project is available: https://github.com/InFraHank/CM2LoD3[110] LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions
Yongju Jia,Jiarui Ma,Xiangxian Li,Baiqiao Zhang,Xianhui Cao,Juan Liu,Yulong Bian
Main category: cs.CV
TL;DR: 本文提出了一种新的多维动态提示路由框架,以解决预训练视觉-语言模型在类别不平衡场景下微调时的偏差问题。
Details
Motivation: 现有的预训练视觉-语言模型(VLMs)在微调时容易受到类别不平衡场景中的偏差影响,而引入大语言模型(LLMs)增强VLM微调的方法往往忽略了VLMs预训练中的类别不平衡问题,可能导致偏差在下游任务中累积。 Method: 提出了多维动态提示路由(MDPR)框架,通过构建全面的类知识库,并在微调过程中使用动态路由机制对齐全局视觉类、检索最佳提示,并平衡细粒度语义。 Result: 在长尾基准数据集(如CIFAR-LT、ImageNet-LT和Places-LT)上的大量实验表明,MDPR的效果与当前最先进的方法相当。消融实验进一步证实了我们对尾部类语义库的有效性,且动态路由计算开销极小。 Conclusion: MDPR是一种灵活高效的VLM微调增强方法,适用于数据不平衡的情况。 Abstract: Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive capability in visual tasks, but their fine-tuning often suffers from bias in class-imbalanced scene. Recent works have introduced large language models (LLMs) to enhance VLM fine-tuning with supplementing semantic information. However, they often overlook inherent class imbalance in VLMs' pre-training, which may lead to bias accumulation in downstream tasks. To address this problem, this paper proposes a Multi-dimensional Dynamic Prompt Routing (MDPR) framework. MDPR constructs a comprehensive knowledge base for classes, spanning five visual-semantic dimensions. During fine-tuning, the dynamic routing mechanism aligns global visual classes, retrieves optimal prompts, and balances fine-grained semantics, yielding stable predictions through logits fusion. Extensive experiments on long-tailed benchmarks, including CIFAR-LT, ImageNet-LT, and Places-LT, demonstrate that MDPR achieves comparable results with current SOTA methods. Ablation studies further confirm the effectiveness of our semantic library for tail classes, and show that our dynamic routing incurs minimal computational overhead, making MDPR a flexible and efficient enhancement for VLM fine-tuning under data imbalance.[111] StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding
Yanlai Yang,Zhuokai Zhao,Satya Narayan Shukla,Aashu Singh,Shlok Kumar Mishra,Lizhu Zhang,Mengye Ren
Main category: cs.CV
TL;DR: StreamMem is a streaming video understanding method that efficiently compresses KV cache without needing prior questions, enabling effective long video QA.
Details
Motivation: Current MLLMs face memory and computational challenges when handling long videos due to KV cache overhead, and existing compression methods are impractical in streaming or multi-turn conversational settings. Method: StreamMem compresses the KV cache using attention scores between visual tokens and generic query tokens in a streaming manner, maintaining a fixed-size KV memory. Result: StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware approaches across multiple long and streaming video benchmarks. Conclusion: StreamMem provides an efficient, query-agnostic KV cache compression approach that enables efficient question answering in long video understanding scenarios without prior knowledge of queries. Abstract: Multimodal large language models (MLLMs) have made significant progress in visual-language reasoning, but their ability to efficiently handle long videos remains limited. Despite recent advances in long-context MLLMs, storing and attending to the key-value (KV) cache for long visual contexts incurs substantial memory and computational overhead. Existing visual compression methods require either encoding the entire visual context before compression or having access to the questions in advance, which is impractical for long video understanding and multi-turn conversational settings. In this work, we propose StreamMem, a query-agnostic KV cache memory mechanism for streaming video understanding. Specifically, StreamMem encodes new video frames in a streaming manner, compressing the KV cache using attention scores between visual tokens and generic query tokens, while maintaining a fixed-size KV memory to enable efficient question answering (QA) in memory-constrained, long-video scenarios. Evaluation on three long video understanding and two streaming video question answering benchmarks shows that StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.[112] WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception
Zhiheng Liu,Xueqing Deng,Shoufa Chen,Angtian Wang,Qiushan Guo,Mingfei Han,Zeyue Xue,Mengzhao Chen,Ping Luo,Linjie Yang
Main category: cs.CV
TL;DR: WorldWeaver enhances long video generation by integrating perceptual conditions and RGB modeling, leveraging depth cues and noise scheduling to reduce errors and improve consistency.
Details
Motivation: Current generative video models struggle with structural and temporal consistency over long sequences due to reliance on RGB signals, which can accumulate errors over time. Method: WorldWeaver utilizes depth cues and a memory bank to preserve contextual information, incorporates segmented noise scheduling to mitigate drift, and is applicable to both diffusion and rectified flow-based models. Result: Experiments show that WorldWeaver effectively reduces temporal drift and improves the quality and fidelity of long-horizon video generation. Conclusion: WorldWeaver improves long video generation by jointly modeling RGB frames and perceptual conditions, effectively reducing temporal drift and enhancing video fidelity. Abstract: Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.[113] Fine-grained Multi-class Nuclei Segmentation with Molecular-empowered All-in-SAM Model
Xueyuan Li,Can Cui,Ruining Deng,Yucheng Tang,Quan Liu,Tianyuan Yao,Shunxing Bao,Naweed Chowdhury,Haichun Yang,Yuankai Huo
Main category: cs.CV
TL;DR: 本文提出了一种新的All-in-SAM模型,通过分子赋能学习和SAM模型的调整,提升了计算病理学中细胞分类的性能,并减轻了注释者的负担。
Details
Motivation: 近期计算病理学的发展受到视觉基础模型(尤其是Segment Anything Model(SAM))进步的推动,但通用视觉基础模型在细粒度语义分割方面(如识别特定核亚型或特定细胞)仍面临挑战。 Method: 本研究采用了一种全栈式方法,包括:(1) 通过分子赋能学习吸引普通注释者参与,减少对详细像素级注释的需求;(2) 利用SAM适配器调整SAM模型,强调特定语义的学习适应;(3) 通过集成分子导向校正学习(MOCL)提升分割准确性。 Result: 实验结果表明,All-in-SAM模型在面对不同注释质量时显著提升了细胞分类性能。 Conclusion: 该研究提出了一种分子赋能的All-in-SAM模型,不仅减轻了注释者的负担,还扩展了资源有限环境下精确生物医学图像分析的可及性,从而推进了医学诊断和病理图像分析的自动化。 Abstract: Purpose: Recent developments in computational pathology have been driven by advances in Vision Foundation Models, particularly the Segment Anything Model (SAM). This model facilitates nuclei segmentation through two primary methods: prompt-based zero-shot segmentation and the use of cell-specific SAM models for direct segmentation. These approaches enable effective segmentation across a range of nuclei and cells. However, general vision foundation models often face challenges with fine-grained semantic segmentation, such as identifying specific nuclei subtypes or particular cells. Approach: In this paper, we propose the molecular-empowered All-in-SAM Model to advance computational pathology by leveraging the capabilities of vision foundation models. This model incorporates a full-stack approach, focusing on: (1) annotation-engaging lay annotators through molecular-empowered learning to reduce the need for detailed pixel-level annotations, (2) learning-adapting the SAM model to emphasize specific semantics, which utilizes its strong generalizability with SAM adapter, and (3) refinement-enhancing segmentation accuracy by integrating Molecular-Oriented Corrective Learning (MOCL). Results: Experimental results from both in-house and public datasets show that the All-in-SAM model significantly improves cell classification performance, even when faced with varying annotation quality. Conclusions: Our approach not only reduces the workload for annotators but also extends the accessibility of precise biomedical image analysis to resource-limited settings, thereby advancing medical diagnostics and automating pathology image analysis.[114] Waver: Wave Your Way to Lifelike Video Generation
Yifu Zhang,Hao Yang,Yuqi Zhang,Yifei Hu,Fengda Zhu,Chuang Lin,Xiaofeng Mei,Yi Jiang,Zehuan Yuan,Bingyue Peng
Main category: cs.CV
TL;DR: Waver是一个能够生成高质量图像和视频的统一模型,它通过引入新的架构和数据处理方法,在多个任务上表现出色。
Details
Motivation: 为了推动视频生成技术的发展,并提供一个能够高效生成高质量视频的统一模型。 Method: 引入了Hybrid Stream DiT架构,建立了一个全面的数据整理流程,并使用基于MLLM的视频质量模型过滤高质量样本。 Result: Waver能够原生生成720p分辨率的5到10秒视频,并通过后续处理提升至1080p,其在多个排行榜上名列前茅。 Conclusion: Waver是一个高效的统一图像和视频生成的基础模型,其在文本到视频和图像到视频任务上表现优异,并希望推动视频生成技术的发展。 Abstract: We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.[115] ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling
Jinhyung Park,Javier Romero,Shunsuke Saito,Fabian Prada,Takaaki Shiratori,Yichen Xu,Federica Bogo,Shoou-I Yu,Kris Kitani,Rawal Khirodkar
Main category: cs.CV
TL;DR: ATLAS是一种从600k高分辨率扫描中学习的高保真身体模型,它通过将网格表示锚定在人体骨架上,显式解耦形状和骨架基,从而提高形状表现力和对身体属性的细粒度定制能力。
Details
Motivation: 现有的人体网格建模方法难以捕捉多样化的身体姿态和形状的详细变化,主要是由于训练数据多样性的限制和建模假设的限制。此外,传统方法首先使用线性基优化外部身体表面,然后从表面顶点回归内部骨骼关节,这种方法在内部骨骼和外部软组织之间引入了依赖性,限制了对身体高度和骨骼长度的直接控制。 Method: 提出ATLAS模型,通过将网格表示锚定在人体骨架上显式解耦形状和骨架基。 Result: ATLAS在拟合多样姿态的未见过的受试者时优于现有方法,并且定量评估显示,与线性模型相比,我们的非线性姿态修正能更有效地捕捉复杂姿态。 Conclusion: ATLAS是一种更优的高保真身体模型,具有更高的形状表现力和对身体属性的细粒度定制能力。 Abstract: Parametric body models offer expressive 3D representation of humans across a wide range of poses, shapes, and facial expressions, typically derived by learning a basis over registered 3D meshes. However, existing human mesh modeling approaches struggle to capture detailed variations across diverse body poses and shapes, largely due to limited training data diversity and restrictive modeling assumptions. Moreover, the common paradigm first optimizes the external body surface using a linear basis, then regresses internal skeletal joints from surface vertices. This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. Unlike previous methods, we explicitly decouple the shape and skeleton bases by grounding our mesh representation in the human skeleton. This decoupling enables enhanced shape expressivity, fine-grained customization of body attributes, and keypoint fitting independent of external soft-tissue characteristics. ATLAS outperforms existing methods by fitting unseen subjects in diverse poses more accurately, and quantitative evaluations show that our non-linear pose correctives more effectively capture complex poses compared to linear models.[116] SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass
Yanxu Meng,Haoning Wu,Ya Zhang,Weidi Xie
Main category: cs.CV
TL;DR: SceneGen是一种新的框架,可以根据场景图像和对应的物体掩码生成多个3D资产,其操作无需优化或资产检索,且在多个评估中展现了高效和稳健的生成能力。
Details
Motivation: 3D内容生成因其在VR/AR和实体AI中的应用而受到广泛关注。生成单个场景图像中的多个3D资产是一个具有挑战性的任务,这就是SceneGen试图解决的问题。 Method: SceneGen利用了一个新的特征聚合模块,该模块整合了来自视觉和几何编码器的局部和全局场景信息。结合位置头,这使得在单次前馈传递中生成3D资产及其相对空间位置成为可能。此外,SceneGen可以直接扩展到多图像输入场景。 Result: SceneGen在多个定量和定性评估中展现了高效和稳健的生成能力。尽管仅在单图像输入上进行训练,SceneGen在多图像输入情况下也能够提升生成性能。 Conclusion: SceneGen是一个能够根据场景图像和对应的物体掩码生成多个具有几何形状和纹理的3D资产的新框架,其操作无需优化或资产检索。 Abstract: 3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen's direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.[117] Visual Autoregressive Modeling for Instruction-Guided Image Editing
Qingyang Mao,Qi Cai,Yehao Li,Yingwei Pan,Mingyue Cheng,Ting Yao,Qi Liu,Tao Mei
Main category: cs.CV
TL;DR: The paper presents VAREdit, a novel visual autoregressive framework for image editing that offers better adherence to instructions and improved efficiency compared to diffusion-based methods.
Details
Motivation: Recent advances in diffusion models have led to issues with unintended modifications and compromised adherence to editing instructions, motivating the development of an alternative approach using autoregressive models. Method: The paper introduces VAREdit, a visual autoregressive framework that reframes image editing as a next-scale prediction problem, using a Scale-Aligned Reference module to improve conditioning of source image tokens. Result: VAREdit outperforms leading diffusion-based methods by 30%+ higher GPT-Balance score and completes a 512x512 editing in 1.2 seconds, making it 2.2x faster than similarly sized UltraEdit. Conclusion: VAREdit demonstrates significant advancements in both editing adherence and efficiency compared to diffusion-based methods. Abstract: Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a $512\times512$ editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.[118] Scaling Group Inference for Diverse and High-Quality Generation
Gaurav Parmar,Or Patashnik,Daniil Ostashev,Kuan-Chieh Wang,Kfir Aberman,Srinivasa Narasimhan,Jun-Yan Zhu
Main category: cs.CV
TL;DR: This paper proposes a scalable group inference method for generative models that improves both the diversity and quality of output groups, applicable across various tasks like text-to-image and video generation.
Details
Motivation: Generative models typically produce redundant results when generating multiple outputs, limiting user choice and exploration; this work aims to enhance the diversity and quality of output groups. Method: The method formulates group inference as a quadratic integer assignment problem, using unary terms for sample quality and binary terms for group diversity, with progressive pruning for efficiency. Result: Experiments show significant improvements in group diversity and quality compared to baselines, with scalability to large candidate sets. Conclusion: The proposed scalable group inference method enhances both diversity and quality of generative model outputs, applicable across a wide range of tasks. Abstract: Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we progressively prune the candidate set using intermediate predictions, allowing our method to scale up to large candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.[119] CineScale: Free Lunch in High-Resolution Cinematic Visual Generation
Haonan Qiu,Ning Yu,Ziqi Huang,Paul Debevec,Ziwei Liu
Main category: cs.CV
TL;DR: CineScale是一种新的推理范式,可以无需大量微调即可实现更高分辨率的视觉生成。
Details
Motivation: 视觉扩散模型在有限的分辨率下训练,限制了它们生成高保真图像或视频的能力。 Method: 提出了CineScale,一种新的推理范式,并针对两种视频生成架构提出了专门的变体。 Result: 实现了8k图像生成而无需任何微调,并通过仅需少量LoRA微调实现了4k视频生成。 Conclusion: CineScale是一个新的推理范式,可以实现更高分辨率的视觉生成,而无需大量微调。 Abstract: Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.cs.AI [Back]
[120] SurgWound-Bench: A Benchmark for Surgical Wound Diagnosis
Jiahao Xu,Changchang Yin,Odysseas Chatzipanagiotou,Diamantis Tsilimigras,Kevin Clear,Bingsheng Yao,Dakuo Wang,Timothy Pawlik,Ping Zhang
Main category: cs.AI
TL;DR: 本文介绍了一个用于手术伤口诊断的三阶段学习框架WoundQwen和一个开源手术伤口数据集SurgWound。