cs.CL [Back]

[1] From Image Captioning to Visual Storytelling

Admitos Passadakis,Yingjin Song,Albert Gatt

Main category: cs.CL

TL;DR: This paper proposes a novel approach to Visual Storytelling by combining vision-to-language and language-to-language methods, resulting in more coherent and narrative stories while introducing a new metric called ideality to simulate human-likeness.

Details

Motivation: The motivation stems from the challenge of creating both narrative and coherent stories grounded in image sequences, aiming to improve the balance between these aspects in Visual Storytelling. Method: The method involves employing a vision-to-language model to obtain image captions, which are then transformed into coherent narratives using language-to-language techniques within a unified framework. Result: The results indicate that integrating captioning and storytelling in a unified framework positively impacts story quality, accelerates training time, and enhances reusability and reproducibility. Conclusion: The study concludes that treating Visual Storytelling as a superset of Image Captioning effectively balances narrative coherence with grounding in the image sequence, enhancing the quality of generated stories. Abstract: Visual Storytelling is a challenging multimodal task between Vision & Language, where the purpose is to generate a story for a stream of images. Its difficulty lies on the fact that the story should be both grounded to the image sequence but also narrative and coherent. The aim of this work is to balance between these aspects, by treating Visual Storytelling as a superset of Image Captioning, an approach quite different compared to most of prior relevant studies. This means that we firstly employ a vision-to-language model for obtaining captions of the input images, and then, these captions are transformed into coherent narratives using language-to-language methods. Our multifarious evaluation shows that integrating captioning and storytelling under a unified framework, has a positive impact on the quality of the produced stories. In addition, compared to numerous previous studies, this approach accelerates training time and makes our framework readily reusable and reproducible by anyone interested. Lastly, we propose a new metric/tool, named ideality, that can be used to simulate how far some results are from an oracle model, and we apply it to emulate human-likeness in visual storytelling.

[2] Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach

Kezia Oketch,John P. Lalor,Ahmed Abbasi

Main category: cs.CL

TL;DR: This paper introduces a taxonomy-guided evaluation for Swahili NLP to address sociolinguistic diversity gaps, using a dataset from Kenyan speakers to highlight the influence of cultural and linguistic variation on model performance.

Details

Motivation: Address the lack of sociolinguistic diversity in current NLP evaluations and create a more inclusive and culturally relevant framework for assessing language models. Method: A taxonomy was developed and applied to evaluate pre-trained and instruction-tuned language models using a dataset of 2,170 free-text responses from Kenyan Swahili speakers, focusing on psychometric tasks in health contexts. Result: The dataset revealed tribal influences, urban vernacular, code-mixing, and loanwords, while the taxonomy-guided evaluation uncovered patterns in model prediction errors. Conclusion: Taxonomy-guided evaluation of Swahili NLP highlights the importance of sociolinguistic diversity in shaping model performance and advances culturally grounded evaluation frameworks. Abstract: We introduce the first taxonomy-guided evaluation of Swahili NLP, addressing gaps in sociolinguistic diversity. Drawing on health-related psychometric tasks, we collect a dataset of 2,170 free-text responses from Kenyan speakers. The data exhibits tribal influences, urban vernacular, code-mixing, and loanwords. We develop a structured taxonomy and use it as a lens for examining model prediction errors across pre-trained and instruction-tuned language models. Our findings advance culturally grounded evaluation frameworks and highlight the role of sociolinguistic variation in shaping model performance.

[3] Contrastive Analysis of Constituent Order Preferences Within Adverbial Roles in English and Chinese News: A Large-Language-Model-Driven Approach

Yiran Rex Ma

Main category: cs.CL

TL;DR: This paper explores the differences in constituent order of English-Chinese news from the perspective of functional chunks with adverbial roles and finds that English news prefers linear narrative while Chinese news prefers overall presentation mode.

Details

Motivation: This paper aims to explore the differences in constituent order of English-Chinese news from the perspective of functional chunks with adverbial roles and analyze their typical positional preferences and distribution patterns. Method: Based on comparable English-Chinese news corpora annotated by Large Language Model (LLM), this paper explores the differences in constituent order of English-Chinese news from the perspective of functional chunks with adverbial roles. Result: English news prefers linear narrative of core information first, functional chunks are mostly post-positioned; Chinese news prefers overall presentation mode of background first, functional chunks are often pre-positioned. In SVO structure, both show differences in the distribution of functional chunks, but the tendency of Chinese pre-positioning is more significant. When function blocks are co-occurring, both show high flexibility. Conclusion: The study reveals that word order has both systematic preference and dynamic adaptability, providing new empirical support for contrastive study of English-Chinese information structure. Abstract: Based on comparable English-Chinese news corpora annotated by Large Language Model (LLM), this paper attempts to explore the differences in constituent order of English-Chinese news from the perspective of functional chunks with adverbial roles, and analyze their typical positional preferences and distribution patterns. It is found that: (1) English news prefers linear narrative of core information first, and functional chunks are mostly post-positioned, while Chinese news prefers overall presentation mode of background first, and functional chunks are often pre-positioned; (2) In SVO structure, both English and Chinese news show differences in the distribution of functional chunks, but the tendency of Chinese pre-positioning is more significant, while that of English post-positioning is relatively mild; (3) When function blocks are co-occurring, both English and Chinese news show high flexibility, and the order adjustment is driven by information and pragmatic purposes. The study reveals that word order has both systematic preference and dynamic adaptability, providing new empirical support for contrastive study of English-Chinese information structure.

[4] T-REX: Table -- Refute or Entail eXplainer

Tim Luka Horstmann,Baptiste Geisenberger,Mehwish Alam

Main category: cs.CL

TL;DR: 介绍了一个名为T-REX的工具，它通过最先进的指令调整推理LLM，使得非专家用户能够进行多模态、多语言表格的声明验证。

Details

Motivation: 尽管大型语言模型在表格事实核查方面取得了显著进展，但目前的解决方案对于非专家而言仍然难以接触。 Method: 使用最先进的指令调整推理LLM开发了一个交互式工具T-REX，用于对多模态、多语言表格进行声明验证。 Result: T-REX是首个用于表格声明验证的实时交互工具，旨在提高事实核查技术的准确性和透明度。 Conclusion: T-REX是一个开放在线的工具，旨在通过先进的指令调整推理LLM，使非专家用户能够进行声明验证，从而提高事实核查技术的可及性。 Abstract: Verifying textual claims against structured tabular data is a critical yet challenging task in Natural Language Processing with broad real-world impact. While recent advances in Large Language Models (LLMs) have enabled significant progress in table fact-checking, current solutions remain inaccessible to non-experts. We introduce T-REX (T-REX: Table -- Refute or Entail eXplainer), the first live, interactive tool for claim verification over multimodal, multilingual tables using state-of-the-art instruction-tuned reasoning LLMs. Designed for accuracy and transparency, T-REX empowers non-experts by providing access to advanced fact-checking technology. The system is openly available online.

[5] Confidence Estimation for Text-to-SQL in Large Language Models

Sepideh Entezari Maleki,Mohammadreza Pourreza,Davood Rafiei

Main category: cs.CL

TL;DR: This paper explores confidence estimation strategies for text-to-SQL in large language models, showing that consistency-based methods work well for black-box models and SQL-syntax-aware approaches are effective in white-box settings, with execution-based grounding improving both.

Details

Motivation: Confidence estimation for text-to-SQL is important to assess the reliability of model-generated SQL queries without access to gold answers, especially in large language models where access to weights and gradients is limited. Method: Evaluated black-box and white-box confidence estimation strategies for assessing the reliability of model-generated SQL queries on cross-domain text-to-SQL benchmarks. Result: Consistency-based methods perform better among black-box models, while SQL-syntax-aware approaches show advantages in white-box settings for interpreting LLM logits. Conclusion: Execution-based grounding of queries enhances the effectiveness of both black-box and white-box confidence estimation approaches. Abstract: Confidence estimation for text-to-SQL aims to assess the reliability of model-generated SQL queries without having access to gold answers. We study this problem in the context of large language models (LLMs), where access to model weights and gradients is often constrained. We explore both black-box and white-box confidence estimation strategies, evaluating their effectiveness on cross-domain text-to-SQL benchmarks. Our evaluation highlights the superior performance of consistency-based methods among black-box models and the advantage of SQL-syntax-aware approaches for interpreting LLM logits in white-box settings. Furthermore, we show that execution-based grounding of queries provides a valuable supplementary signal, improving the effectiveness of both approaches.

[6] Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models

Badrinath Ramakrishnan,Akshaya Balaji

Main category: cs.CL

TL;DR: This paper investigates data memorization in fine-tuned large language models (LLMs) and proposes a multi-layered framework to reduce privacy leakage, achieving 0% data leakage while preserving 94.7% of model utility.

Details

Motivation: The motivation stems from the growing concern about privacy risks in Large Language Models (LLMs), particularly during fine-tuning processes, as these models tend to memorize training data, which can lead to unintended data leakage. Method: The authors conducted controlled experiments on modern LLM architectures, including GPT-2, Phi-3, and Gemma-2, to evaluate the impact of repeated sensitive data on privacy leakage. They proposed and evaluated four privacy protection methods: semantic data deduplication, differential privacy during generation, entropy-based filtering, and pattern-based content filtering. Result: The experiments showed that fine-tuning with repeated sensitive data increases privacy leakage rates from baseline levels of 0-5% to 60-75%, representing a 64.2% average increase across tested models. The proposed privacy protection techniques were able to reduce data leakage to 0% while maintaining 94.7% of the original model utility. Conclusion: The paper concludes that fine-tuning LLMs with repeated sensitive data significantly increases privacy leakage, but this risk can be effectively mitigated using the proposed multi-layered privacy protection framework. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, but their tendency to memorize training data poses significant privacy risks, particularly during fine-tuning processes. This paper presents a comprehensive empirical analysis of data memorization in fine-tuned LLMs and introduces a novel multi-layered privacy protection framework. Through controlled experiments on modern LLM architectures including GPT-2, Phi-3, and Gemma-2, we demonstrate that fine-tuning with repeated sensitive data increases privacy leakage rates from baseline levels of 0-5% to 60-75%, representing a 64.2% average increase across tested models. We propose and rigorously evaluate four complementary privacy protection methods: semantic data deduplication, differential privacy during generation, entropy-based filtering, and pattern-based content filtering. Our experimental results show that these techniques can reduce data leakage to 0% while maintaining 94.7% of original model utility.

[7] Punctuation and Predicates in Language Models

Sonakshi Chauhan,Maheep Chaudhary,Koby Choy,Samuel Nellessen,Nandi Schoots

Main category: cs.CL

TL;DR: This paper analyzes how information is propagated in LLMs, highlighting the role of punctuation as both necessary and sufficient in some models, and reveals differences in how LLMs process reasoning rules like conditionals and universal quantification.

Details

Motivation: The motivation is to understand where information is collected and how it is propagated throughout layers in large language models, particularly focusing on the computational importance of punctuation tokens and how different components of input are processed. Method: The authors used intervention-based techniques, interchange intervention, and layer-swapping experiments to evaluate the necessity and sufficiency of punctuation tokens and to investigate how different reasoning rules are processed by LLMs. Result: The results show that punctuation tokens are both necessary and sufficient in multiple layers for GPT-2, less so for DeepSeek, and not at all for Gemma. Additionally, the paper found that conditional statements and universal quantification are processed very differently in LLMs. Conclusion: The paper concludes that different reasoning rules, such as conditional statements and universal quantification, are processed differently by LLMs, and there are stark model-specific differences in the necessity and sufficiency of punctuation tokens across layers in various models like GPT-2, DeepSeek, and Gemma. Abstract: In this paper we explore where information is collected and how it is propagated throughout layers in large language models (LLMs). We begin by examining the surprising computational importance of punctuation tokens which previous work has identified as attention sinks and memory aids. Using intervention-based techniques, we evaluate the necessity and sufficiency (for preserving model performance) of punctuation tokens across layers in GPT-2, DeepSeek, and Gemma. Our results show stark model-specific differences: for GPT-2, punctuation is both necessary and sufficient in multiple layers, while this holds far less in DeepSeek and not at all in Gemma. Extending beyond punctuation, we ask whether LLMs process different components of input (e.g., subjects, adjectives, punctuation, full sentences) by forming early static summaries reused across the network, or if the model remains sensitive to changes in these components across layers. Extending beyond punctuation, we investigate whether different reasoning rules are processed differently by LLMs. In particular, through interchange intervention and layer-swapping experiments, we find that conditional statements (if, then), and universal quantification (for all) are processed very differently. Our findings offer new insight into the internal mechanisms of punctuation usage and reasoning in LLMs and have implications for interpretability.

[8] DLLMQuant: Quantizing Diffusion-based Large Language Models

Chen Xu,Dawei Yang

Main category: cs.CL

TL;DR: This paper proposes DLLMQuant, a novel PTQ framework specifically designed for diffusion-based large language models (DLLMs), which addresses quantization challenges through three techniques: Temporal-Mask Adaptive Sampling (TMAS), Interaction-Aware Activation Quantization (IA-AQ), and Certainty-Guided Quantization (CGQ).

Details

Motivation: DLLMs face deployment challenges due to large model sizes and high computational costs. Existing PTQ methods, when applied directly to DLLMs, result in severe accuracy degradation and reduced generalization performance. Method: The paper proposes DLLMQuant, a PTQ framework tailored for DLLMs, incorporating three techniques: Temporal-Mask Adaptive Sampling (TMAS), Interaction-Aware Activation Quantization (IA-AQ), and Certainty-Guided Quantization (CGQ). Result: Experiments show that DLLMQuant achieves significant performance gains while enhancing efficiency for DLLMs. Conclusion: DLLMQuant effectively addresses the quantization issues in DLLMs, achieving significant performance gains while enhancing efficiency. Abstract: Diffusion-based large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used method for compressing and accelerating Large Language Models (LLMs), suffers from severe accuracy degradation and reduced generalization performance when directly applied to DLLMs (e.g., AWQ suffers a 16% accuracy drop on LLADA under W4A4). This paper explores how DLLMs' key mechanisms - dynamic masking, iterative generation, bidirectional attention - clash with quantization. We identify three core issues: 1) Iterative generation and dynamic masking ratios lead to distinct token distributions across decoding steps, which are not adequately captured by existing PTQ calibration methods; 2) Quantization errors are accumulated and amplified progressively during iteration in DLLMs, causing quantized models to perform worse as decoding steps progress; 3) Unmasked tokens stabilize while masked remain probabilistic, making overall feature distribution incompatible with existing PTQ methods. To address these issues, we propose DLLMQuant, a PTQ framework tailored for DLLMs, which incorporates three novel techniques: 1) Temporal-Mask Adaptive Sampling (TMAS), a calibration method that accounts for both time and mask factors, with the capacity to capture distributions across timesteps. 2) Interaction-Aware Activation Quantization (IA-AQ), which utilizes bidirectional attention's interaction signals to dynamically allocate quantization resources. 3) Certainty-Guided Quantization (CGQ), which integrates mask status and token scores as key weighting criteria into error compensation, making weight quantization more suitable for DLLMs. Experiments show that DLLMQuant achieves significant performance gains while enhancing efficiency.

[9] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

Xian Gao,Jiacheng Ruan,Zongyun Zhang,Jingsheng Gao,Ting Liu,Yuzhuo Fu

Main category: cs.CL

TL;DR: This paper introduces MMReview, a comprehensive benchmark for evaluating large language models in automated peer review tasks involving multimodal content across multiple academic disciplines.

Details

Motivation: The motivation stems from the lack of a unified evaluation benchmark for assessing LLM-based peer review systems, especially in handling multimodal content like figures and tables. Method: The authors proposed MMReview, a benchmark with multimodal content and expert-written reviews for 240 papers across 17 domains. It includes 13 tasks grouped into four categories to assess LLMs and MLLMs on review generation, outcome formulation, human preference alignment, and robustness. Experiments were conducted on 16 open-source and 5 closed-source models. Result: MMReview was shown to thoroughly evaluate LLMs and MLLMs across various tasks and models, demonstrating its effectiveness and potential as a standardized benchmark. Conclusion: The study concludes that MMReview serves as a comprehensive benchmark for evaluating LLMs and MLLMs in automated peer review tasks across multiple disciplines and modalities, offering a standardized foundation for future development in this area. Abstract: With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.

[10] DPad: Efficient Diffusion Language Models with Suffix Dropout

Xinhua Chen,Sitao Huang,Cong Guo,Chiyue Wei,Yintao He,Jianyi Zhang,Hai "Hellen" Li,Yiran Chen

Main category: cs.CL

TL;DR: DPad is a training-free method that improves computational efficiency in diffusion-based Large Language Models by restricting attention to nearby suffix tokens, achieving significant speedup with minimal impact on accuracy.

Details

Motivation: The motivation is to reduce the high computational overhead in diffusion-based Large Language Models where all future suffix tokens are predicted at each step. Method: The method involves restricting attention to a small set of nearby suffix tokens using a sliding window and distance-decay dropout. Result: DPad achieves up to a 61.4x speedup over vanilla diffusion-based Large Language Models while maintaining comparable accuracy. Conclusion: The paper concludes that DPad, a training-free method for diffusion-based Large Language Models, significantly improves computational efficiency with minimal impact on accuracy. Abstract: Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to $\mathbf{61.4\times}$ speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at https://github.com/Crys-Chen/DPad.

[11] Comparing energy consumption and accuracy in text classification inference

Johannes Zschache,Tilman Hartwig

Main category: cs.CL

TL;DR: 研究在自然语言处理任务中大型语言模型推理阶段的能耗和准确性之间的权衡，发现最佳模型在保持高准确性的同时也能实现能效。

Details

Motivation: 随着大型语言模型在自然语言处理任务中的广泛应用，人们对其能源效率和可持续性表示关注。虽然之前的研究主要集中在模型训练阶段的能耗，但对推理阶段的关注相对较少。 Method: 对各种模型架构和硬件配置下的文本分类推理中的模型准确性和能耗之间的权衡进行了系统评估。 Result: 研究发现，在准确性方面表现最佳的模型也可以是节能的，而较大的大型语言模型往往消耗更多能量且分类准确性较低。推理能耗（从毫瓦时到千瓦时）存在显著差异，并受模型类型、模型大小和硬件规格的影响。此外，推理能耗与模型运行时间之间存在强相关性，表明在无法直接测量能耗的情况下，执行时间可以作为能耗的实用代理。 Conclusion: 研究结果对可持续人工智能发展具有重要意义，为研究人员、行业从业者和政策制定者在平衡自然语言处理应用中的性能和资源效率方面提供了可行的见解。 Abstract: The increasing deployment of large language models (LLMs) in natural language processing (NLP) tasks raises concerns about energy efficiency and sustainability. While prior research has largely focused on energy consumption during model training, the inference phase has received comparatively less attention. This study systematically evaluates the trade-offs between model accuracy and energy consumption in text classification inference across various model architectures and hardware configurations. Our empirical analysis shows that the best-performing model in terms of accuracy can also be energy-efficient, while larger LLMs tend to consume significantly more energy with lower classification accuracy. We observe substantial variability in inference energy consumption ($<$mWh to $>$kWh), influenced by model type, model size, and hardware specifications. Additionally, we find a strong correlation between inference energy consumption and model runtime, indicating that execution time can serve as a practical proxy for energy usage in settings where direct measurement is not feasible. These findings have implications for sustainable AI development, providing actionable insights for researchers, industry practitioners, and policymakers seeking to balance performance and resource efficiency in NLP applications.

[12] Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper

Krishna Garg,Firoz Shaikh,Sambaran Bandyopadhyay,Cornelia Caragea

Main category: cs.CL

TL;DR: 本文介绍了SciIG任务，旨在评估大型语言模型（LLM）根据标题、摘要和相关文献生成高质量科研论文引言的能力，并通过新构建的数据集对多种先进模型进行了多维度的评估。

Details

Motivation: 随着研究人员越来越多地使用LLM作为写作助手，如何生成高质量的科研论文引言变得愈发重要和具有挑战性。 Method: 提出Scientific Introduction Generation (SciIG)任务，并构建了基于NAACL 2025和ICLR 2025论文的新数据集，通过自动化指标和LLM-as-a-judge方法对多个先进LLM进行多维度评估。 Result: LLaMA-4 Maverick在大多数指标上表现优异，特别是在语义相似性和忠实性方面；此外，三样本提示方法整体表现优于更少样本的方法。 Conclusion: 研究提供了开发高效科研写作助手的实践见解，并为LLM辅助的学术写作设定了现实期望，同时公开了所有代码和数据集以促进未来研究。 Abstract: As researchers increasingly adopt LLMs as writing assistants, generating high-quality research paper introductions remains both challenging and essential. We introduce Scientific Introduction Generation (SciIG), a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works. Curating new datasets from NAACL 2025 and ICLR 2025 papers, we assess five state-of-the-art models, including both open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems, across multiple dimensions: lexical overlap, semantic similarity, content coverage, faithfulness, consistency, citation correctness, and narrative quality. Our comprehensive framework combines automated metrics with LLM-as-a-judge evaluations. Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness. Moreover, three-shot prompting consistently outperforms fewer-shot approaches. These findings provide practical insights into developing effective research writing assistants and set realistic expectations for LLM-assisted academic writing. To foster reproducibility and future research, we will publicly release all code and datasets.

[13] Disentangling concept semantics via multilingual averaging in Sparse Autoencoders

Cliff O'Reilly,Ernesto Jimenez-Ruiz,Tillman Weyde

Main category: cs.CL

TL;DR: This paper proposes averaging concept activations from multilingual LLM prompts to isolate concept semantics, improving interpretability and alignment with formal knowledge structures.

Details

Motivation: LLMs often entangle semantic information with syntactic and language-specific details. This work aims to isolate concept semantics for better interpretability and to bridge LLMs with formal knowledge representation. Method: We created English text representations from OWL ontology classes, translated them into French and Chinese, and used the Gemma 2B LLM with Sparse Autoencoders to obtain concept activations. Activations across languages were averaged, and the results were correlated with ground truth ontology class mappings. Result: The conceptual average of activations across languages showed a strong alignment with the true relationships between ontology classes, indicating higher accuracy compared to using single-language activations. Conclusion: averaging concept activations across languages helps isolate concept semantics in LLMs, providing a new technique for more accurate mechanistic interpretation of internal network states. Abstract: Connecting LLMs with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and sparse autoencoders are widely used to represent textual content, but the semantics are entangled with syntactic and language-specific information. We propose a method that isolates concept semantics in Large Langue Models by averaging concept activations derived via Sparse Autoencoders. We create English text representations from OWL ontology classes, translate the English into French and Chinese and then pass these texts as prompts to the Gemma 2B LLM. Using the open source Gemma Scope suite of Sparse Autoencoders, we obtain concept activations for each class and language version. We average the different language activations to derive a conceptual average. We then correlate the conceptual averages with a ground truth mapping between ontology classes. Our results give a strong indication that the conceptual average aligns to the true relationship between classes when compared with a single language by itself. The result hints at a new technique which enables mechanistic interpretation of internal network states with higher accuracy.

[14] GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs

Adrian-Marius Dumitran,Alexandra-Mihaela Danila,Angela-Liliana Dumitran

Main category: cs.CL

TL;DR: GRILE是一个新的基准测试，用于评估大型语言模型在低资源语言（如罗马尼亚语）中的教育价值，特别关注语法推理和解释生成能力。

Details

Motivation: 探索大型语言模型在低资源语言教育中的潜力，并建立一个专门的测试基准。 Method: 创建包含1151个选择题的GRILE基准测试，测试多语种和特定于罗马尼亚语的LLMs在选择正确答案和生成语言准确解释方面的能力。 Result: Gemini 2.5 Pro达到83%的准确率，大多数开源模型低于65%，48%的解释存在事实或教学错误。 Conclusion: 研究揭示了低资源语言教育NLP中存在可信度挑战，并确立了GRILE作为可控解释生成与评估的新测试平台。 Abstract: LLMs (Large language models) have revolutionized NLP (Natural Language Processing), yet their pedagogical value for low-resource languages remains unclear. We present GRILE (Grammar Romanian Inference and Language Explanations) , the first open benchmark of 1,151 multiple-choice questions harvested from Romanian high-stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state-of-the-art multilingual and Romanian-specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically accurate explanations. While Gemini 2.5 Pro reaches 83% accuracy, most open-weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM3 orthographic norms. All data, code and a public web demo are released to catalyze future research. Our findings expose open challenges for trustworthy educational NLP in low-resource settings and establish GRILE as a new test-bed for controllable explanation generation and evaluation.

[15] Tokens with Meaning: A Hybrid Tokenization Approach for NLP

M. Ali Bayram,Ali Arda Fincan,Ahmet Semih Gümüş,Sercan Karakaş,Banu Diri,Savaş Yıldırım,Demircan Çelik

Main category: cs.CL

TL;DR: This paper introduces a hybrid tokenization framework that improves tokenization for morphologically rich languages like Turkish by combining rule-based analysis with statistical segmentation, achieving better performance and linguistic coherence compared to existing methods.

Details

Motivation: Subword tokenization methods like BPE and WordPiece struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. This work aims to address this limitation. Method: The method integrates phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes and altered root forms, and uses special tokens for whitespace and case. BPE is applied for out-of-vocabulary coverage. Result: On the TR-MMLU benchmark, the tokenizer achieved the highest Turkish Token Percentage (90.29%) and Pure Token Percentage (85.8%). Comparisons with LLaMA, Gemma, and GPT tokenizers showed more linguistically meaningful and coherent tokens. Conclusion: The hybrid tokenization framework improves tokenization for morphologically rich languages by combining rule-based morphological analysis with statistical subword segmentation, offering a language-independent approach adaptable to multilingual NLP systems. Abstract: Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitab{\i}), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29\%) and Pure Token Percentage (85.8\%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.

[16] A Joint Multitask Model for Morpho-Syntactic Parsing

Demian Inostroza,Mel Mistica,Ekaterina Vylomova,Chris Guest,Kemal Kurniawan

Main category: cs.CL

TL;DR: 本文提出了一种用于形态句法解析的联合多任务模型，并在多个语言上取得了良好的性能表现。

Details

Motivation: 设计一个联合的多任务模型用于预测形态句法分析。 Method: 使用共享的XLM-RoBERTa编码器和三个专门的解码器进行内容词识别、依存句法分析和形态句法特征预测。 Result: 该模型在覆盖九种类型多样的语言的共享任务排行榜上取得了最佳整体表现，平均MSLAS得分为78.7%，LAS得分为80.1%，Feats F1得分为90.3%。 Conclusion: 该模型在核心语法案例和名词特征上存在困难，但总体表现最佳。 Abstract: We present a joint multitask model for the UniDive 2025 Morpho-Syntactic Parsing shared task, where systems predict both morphological and syntactic analyses following novel UD annotation scheme. Our system uses a shared XLM-RoBERTa encoder with three specialized decoders for content word identification, dependency parsing, and morphosyntactic feature prediction. Our model achieves the best overall performance on the shared task's leaderboard covering nine typologically diverse languages, with an average MSLAS score of 78.7 percent, LAS of 80.1 percent, and Feats F1 of 90.3 percent. Our ablation studies show that matching the task's gold tokenization and content word identification are crucial to model performance. Error analysis reveals that our model struggles with core grammatical cases (particularly Nom-Acc) and nominal features across languages.

[17] Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency

Aman Goel,Daniel Schwartz,Yanjun Qi

Main category: cs.CL

TL;DR: Finch-Zk是一个无需外部知识源的黑盒框架，通过跨模型一致性检查和针对性缓解技术，有效检测和缓解LLM输出中的幻觉问题。

Details

Motivation: 大型语言模型（LLMs）容易产生幻觉，即生成看似合理但包含事实错误的内容。为提升LLM生成内容的事实可靠性，提出Finch-Zk框架。 Method: Finch-Zk采用两个关键创新：1）跨模型一致性检查策略，通过比较不同模型对语义等价提示的响应来揭示细粒度错误；2）针对性缓解技术，对问题部分进行精确修正，同时保留准确内容。 Result: 在FELM数据集上的实验表明，Finch-Zk的幻觉检测F1分数比现有方法提高6-39%；在GPQA-diamond数据集上，对Llama 4 Maverick和Claude 4 Sonnet等模型的应用显示答案准确率提高了7-8个百分点。 Conclusion: Finch-Zk提供了一种实用且可部署的解决方案，有效增强了生产环境中LLM系统生成内容的事实可靠性。 Abstract: Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations--generating content that appears plausible but contains factual inaccuracies. We present Finch-Zk, a black-box framework that leverages FINe-grained Cross-model consistency to detect and mitigate Hallucinations in LLM outputs without requiring external knowledge sources. Finch-Zk introduces two key innovations: 1) a cross-model consistency checking strategy that reveals fine-grained inaccuracies by comparing responses generated by diverse models from semantically-equivalent prompts, and 2) a targeted mitigation technique that applies precise corrections to problematic segments while preserving accurate content. Experiments on the FELM dataset show Finch-Zk improves hallucination detection F1 scores by 6-39\% compared to existing approaches. For mitigation, Finch-Zk achieves 7-8 absolute percentage points improvement in answer accuracy on the GPQA-diamond dataset when applied to state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet. Extensive evaluation across multiple models demonstrates that Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems.

[18] SurveyGen-I: Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing

Jing Chen,Zhiheng Yang,Yixian Shen,Jie Liu,Adam Belloum,Chrysa Papagainni,Paola Grosso

Main category: cs.CL

TL;DR: SurveyGen-I是一种自动调查生成框架，通过结合检索、自适应规划和记忆机制，提高了生成调查的质量和引用覆盖的一致性。

Details

Motivation: 现有的基于LLM的方法在生成长篇、多部分的调查时往往难以保持连贯性，并且在提供全面的引用覆盖方面存在不足。 Method: SurveyGen-I首先执行调查级检索以构建初始大纲和写作计划，然后通过存储先前编写的内容和术语的记忆机制在生成过程中动态优化两者，确保各小节之间的连贯性。 Result: 在四个科学领域的实验表明，SurveyGen-I在内容质量、一致性以及引用覆盖方面始终优于之前的方法。 Conclusion: SurveyGen-I是一个自动调查生成框架，通过结合粗到细的检索、自适应规划和内存引导生成，有效解决了现有LLM-based方法在保持跨多部分调查的连贯性和提供全面引用覆盖方面的局限性。 Abstract: Survey papers play a critical role in scientific communication by consolidating progress across a field. Recent advances in Large Language Models (LLMs) offer a promising solution by automating key steps in the survey-generation pipeline, such as retrieval, structuring, and summarization. However, existing LLM-based approaches often struggle with maintaining coherence across long, multi-section surveys and providing comprehensive citation coverage. To address these limitations, we introduce SurveyGen-I, an automatic survey generation framework that combines coarse-to-fine retrieval, adaptive planning, and memory-guided generation. SurveyGen-I first performs survey-level retrieval to construct the initial outline and writing plan, and then dynamically refines both during generation through a memory mechanism that stores previously written content and terminology, ensuring coherence across subsections. When the system detects insufficient context, it triggers fine-grained subsection-level retrieval. During generation, SurveyGen-I leverages this memory mechanism to maintain coherence across subsections. Experiments across four scientific domains demonstrate that SurveyGen-I consistently outperforms previous works in content quality, consistency, and citation coverage.

[19] Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever

Yixin Chen,Ying Xiong,Shangyu Wu,Yufei Cui,Xue Liu,Nan Guan,Chun Jason Xue

Main category: cs.CL

TL;DR: 本文提出了一种通过行为对齐检索器（BAR）来帮助大型语言模型更准确地进行工具调用的方法，减少了错误调用并提高了效率。

Details

Motivation: 现有的工具增强大型语言模型方法在微调LLMs或使用基于演示的提示时存在高训练开销，并且无法考虑不一致的演示样本，这会误导模型的调用行为。 Method: 构建了一个包含不同函数调用行为（调用或非调用）的语料库，并使用对比学习框架训练BAR，其中包含定制的正/负样本对和双负对比损失，以确保稳健地检索到行为一致的示例。 Result: 实验表明，该方法在保持高任务性能的同时显著减少了错误的函数调用。 Conclusion: 论文提出了一种行为对齐检索器（BAR），通过提供行为一致的示例来帮助大型语言模型（LLMs）更准确地进行工具调用决策，从而显著减少错误函数调用并保持高任务性能，为工具增强的LLMs提供了一种成本效益高且高效的解决方案。 Abstract: Tool-augmented large language models (LLMs) leverage external functions to extend their capabilities, but inaccurate function calls can lead to inefficiencies and increased costs.Existing methods address this challenge by fine-tuning LLMs or using demonstration-based prompting, yet they often suffer from high training overhead and fail to account for inconsistent demonstration samples, which misguide the model's invocation behavior. In this paper, we trained a behavior-aligned retriever (BAR), which provides behaviorally consistent demonstrations to help LLMs make more accurate tool-using decisions. To train the BAR, we construct a corpus including different function-calling behaviors, i.e., calling or non-calling.We use the contrastive learning framework to train the BAR with customized positive/negative pairs and a dual-negative contrastive loss, ensuring robust retrieval of behaviorally consistent examples.Experiments demonstrate that our approach significantly reduces erroneous function calls while maintaining high task performance, offering a cost-effective and efficient solution for tool-augmented LLMs.

[20] ISCA: A Framework for Interview-Style Conversational Agents

Charles Welch,Allison Lahnala,Vasudha Varadarajan,Lucie Flek,Rada Mihalcea,J. Lomax Boyd,João Sedoc

Main category: cs.CL

TL;DR: This paper introduces an accessible, low-compute system for creating conversational agents to facilitate controlled qualitative and quantitative data collection, demonstrated through two case studies and offered as open-source for further development.

Details

Motivation: The motivation behind this research is to provide a controlled and standardized conversational flow for qualitative data collection and quantitative analysis, particularly useful in tracking attitude formation or behavior change. Method: The method involves developing a low-compute non-generative system for conversational agents, which can be adjusted using an online administrative panel without requiring coding knowledge. The system was tested through two case studies. Result: The result is a flexible, open-source system that allows the creation of interview-style conversational agents via an online interface, demonstrated through two case studies involving COVID-19 and public opinion on neurotechnology. Conclusion: The paper concludes that the presented low-compute non-generative system for conversational agents is an effective and accessible tool for facilitating qualitative data collection and quantitative analysis through controlled interactions. Abstract: We present a low-compute non-generative system for implementing interview-style conversational agents which can be used to facilitate qualitative data collection through controlled interactions and quantitative analysis. Use cases include applications to tracking attitude formation or behavior change, where control or standardization over the conversational flow is desired. We show how our system can be easily adjusted through an online administrative panel to create new interviews, making the tool accessible without coding. Two case studies are presented as example applications, one regarding the Expressive Interviewing system for COVID-19 and the other a semi-structured interview to survey public opinion on emerging neurotechnology. Our code is open-source, allowing others to build off of our work and develop extensions for additional functionality.

Wenhan Dong,Zhen Sun,Yuemeng Zhao,Zifan Peng,Jun Wu,Jingyi Zheng,Yule Liu,Xinlei He,Yu Wang,Ruiming Wang,Xinyi Huang,Lei Mo

Main category: cs.CL

TL;DR: 本文介绍ZPD-SCA，一个新的基准，用于评估LLMs在中文阅读理解难度评估中的能力，并揭示其在教育认知对齐方面的潜力和限制。

Details

Motivation: 大型语言模型（LLMs）在教育应用中的潜力尚未充分探索，尤其是在评估阅读材料与学生发展阶段的认知对齐方面。 Method: 引入ZPD-SCA，一个新颖的基准，用于评估阶段性的中文阅读理解难度，并通过60名特级教师的注释进行测试。 Result: 实验结果显示，在零样本学习场景中，LLMs表现不佳，但在提供上下文示例后，其表现显著提高。即使最佳模型也显示出系统性偏差，表明在对齐材料难度与学生认知能力方面仍存在困难。 Conclusion: ZPD-SCA为评估LLMs在认知对齐教育应用中的表现提供了基础，并揭示了当前LLMs在准确评估阅读材料难度方面的潜力和局限性。 Abstract: Large language models (LLMs) have demonstrated potential in educational applications, yet their capacity to accurately assess the cognitive alignment of reading materials with students' developmental stages remains insufficiently explored. This gap is particularly critical given the foundational educational principle of the Zone of Proximal Development (ZPD), which emphasizes the need to match learning resources with Students' Cognitive Abilities (SCA). Despite the importance of this alignment, there is a notable absence of comprehensive studies investigating LLMs' ability to evaluate reading comprehension difficulty across different student age groups, especially in the context of Chinese language education. To fill this gap, we introduce ZPD-SCA, a novel benchmark specifically designed to assess stage-level Chinese reading comprehension difficulty. The benchmark is annotated by 60 Special Grade teachers, a group that represents the top 0.15% of all in-service teachers nationwide. Experimental results reveal that LLMs perform poorly in zero-shot learning scenarios, with Qwen-max and GLM even falling below the probability of random guessing. When provided with in-context examples, LLMs performance improves substantially, with some models achieving nearly double the accuracy of their zero-shot baselines. These results reveal that LLMs possess emerging abilities to assess reading difficulty, while also exposing limitations in their current training for educationally aligned judgment. Notably, even the best-performing models display systematic directional biases, suggesting difficulties in accurately aligning material difficulty with SCA. Furthermore, significant variations in model performance across different genres underscore the complexity of task. We envision that ZPD-SCA can provide a foundation for evaluating and improving LLMs in cognitively aligned educational applications.

[22] Credence Calibration Game? Calibrating Large Language Models through Structured Play

Ke Fang,Tianyi Zhao,Lu Cheng

Main category: cs.CL

TL;DR: This paper introduces a novel prompt-based calibration framework inspired by the Credence Calibration Game, which dynamically improves model calibration through feedback-driven prompting and performance summaries, without requiring additional supervision or parameter updates.

Details

Motivation: The motivation stems from the need to ensure that Large Language Models' (LLMs) confidence estimates align with their actual correctness, especially as they are increasingly used in decision-critical domains. Method: The method involves a prompt-based calibration framework inspired by the Credence Calibration Game, which creates a structured interaction loop where LLMs receive feedback on their confidence estimates compared to actual correctness. Result: Extensive experiments showed consistent improvements in evaluation metrics across models and game configurations, demonstrating the effectiveness of the proposed calibration framework. Conclusion: The proposed game-based prompting framework effectively improves LLM calibration without requiring additional supervision or parameter updates. Abstract: As Large Language Models (LLMs) are increasingly deployed in decision-critical domains, it becomes essential to ensure that their confidence estimates faithfully correspond to their actual correctness. Existing calibration methods have primarily focused on post-hoc adjustments or auxiliary model training; however, many of these approaches necessitate additional supervision or parameter updates. In this work, we propose a novel prompt-based calibration framework inspired by the Credence Calibration Game. Our method establishes a structured interaction loop wherein LLMs receive feedback based on the alignment of their predicted confidence with correctness. Through feedback-driven prompting and natural language summaries of prior performance, our framework dynamically improves model calibration. Extensive experiments across models and game configurations demonstrate consistent improvements in evaluation metrics. Our results highlight the potential of game-based prompting as an effective strategy for LLM calibration. Code and data are available at https://anonymous.4open.science/r/LLM-Calibration/.

Yupei Yang,Fan Feng,Lin Yang,Wanxi Deng,Lin Qu,Biwei Huang,Shikui Tu,Lei Xu

Main category: cs.CL

TL;DR: DEPTH is a novel relation extraction framework that improves accuracy and reduces hallucinations by integrating sentence simplification, hierarchical refinement, and causality-driven reward models.

Details

Motivation: Existing relation extraction methods focusing on relation classification struggle with reliably determining whether a relation exists, especially in complex cases, leading to spurious predictions and noisy edges in knowledge graphs. Method: DEPTH integrates Dependency-aware sentence simplification and Two-tiered Hierarchical refinement into the relation extraction pipeline, combined with a causality-driven reward model for robust fine-tuning. Result: Experiments on six benchmarks show DEPTH reduces the average hallucination rate to 7.0% while achieving a 17.2% improvement in average F1 score over state-of-the-art baselines. Conclusion: DEPTH demonstrates significant improvements in reducing hallucinations and enhancing relation extraction accuracy, making it a promising framework for structured knowledge construction. Abstract: Relation extraction enables the construction of structured knowledge for many downstream applications. While large language models (LLMs) have shown great promise in this domain, most existing methods concentrate on relation classification, which predicts the semantic relation type between a related entity pair. However, we observe that LLMs often struggle to reliably determine whether a relation exists, especially in cases involving complex sentence structures or intricate semantics, which leads to spurious predictions. Such hallucinations can introduce noisy edges in knowledge graphs, compromising the integrity of structured knowledge and downstream reliability. To address these challenges, we propose DEPTH, a framework that integrates Dependency-aware sEntence simPlification and Two-tiered Hierarchical refinement into the relation extraction pipeline. Given a sentence and its candidate entity pairs, DEPTH operates in two stages: (1) the Grounding module extracts relations for each pair by leveraging their shortest dependency path, distilling the sentence into a minimal yet coherent relational context that reduces syntactic noise while preserving key semantics; (2) the Refinement module aggregates all local predictions and revises them based on a holistic understanding of the sentence, correcting omissions and inconsistencies. We further introduce a causality-driven reward model that mitigates reward hacking by disentangling spurious correlations, enabling robust fine-tuning via reinforcement learning with human feedback. Experiments on six benchmarks demonstrate that DEPTH reduces the average hallucination rate to 7.0\% while achieving a 17.2\% improvement in average F1 score over state-of-the-art baselines.

[24] Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs

Yinghan Zhou,Weifeng Zhu,Juan Wen,Wanli Peng,Zhengxian Wu,Yiming Xue

Main category: cs.CL

TL;DR: This paper identifies and addresses the issue of poor self-recognition in large language models under the Individual Presentation Paradigm (IPP) by introducing a new framework called Cognitive Surgery (CoSur), which successfully enhances model performance in identifying self-generated text.

Details

Motivation: The motivation of this study is to understand why large language models struggle to identify self-generated text in the Individual Presentation Paradigm (IPP), despite performing well under the Pair Presentation Paradigm (PPP), and to develop a method to enhance their performance in IPP. Method: The authors propose a novel framework called Cognitive Surgery (CoSur), which includes four modules: representation extraction, territory construction, authorship discrimination, and cognitive editing. This framework aims to awaken the model's latent ability, known as Implicit Territorial Awareness (ITA), to distinguish between self- and other-generated texts. Result: The proposed Cognitive Surgery (CoSur) framework significantly improved the performance of three different large language models in the IPP scenario, achieving average accuracies of 83.25%, 66.19%, and 88.01%. Conclusion: The study concludes that through the Cognitive Surgery (CoSur) framework, the Implicit Territorial Awareness (ITA) in large language models can be effectively awakened, significantly improving their ability to distinguish self-generated from other-generated text in the Individual Presentation Paradigm. Abstract: Large language models (LLMs) have been shown to possess a degree of self-recognition capability-the ability to identify whether a given text was generated by themselves. Prior work has demonstrated that this capability is reliably expressed under the Pair Presentation Paradigm (PPP), where the model is presented with two texts and asked to choose which one it authored. However, performance deteriorates sharply under the Individual Presentation Paradigm (IPP), where the model is given a single text to judge authorship. Although this phenomenon has been observed, its underlying causes have not been systematically analyzed. In this paper, we first replicate existing findings to confirm that LLMs struggle to distinguish self- from other-generated text under IPP. We then investigate the reasons for this failure and attribute it to a phenomenon we term Implicit Territorial Awareness (ITA)-the model's latent ability to distinguish self- and other-texts in representational space, which remains unexpressed in its output behavior. To awaken the ITA of LLMs, we propose Cognitive Surgery (CoSur), a novel framework comprising four main modules: representation extraction, territory construction, authorship discrimination and cognitive editing. Experimental results demonstrate that our proposed method improves the performance of three different LLMs in the IPP scenario, achieving average accuracies of 83.25%, 66.19%, and 88.01%, respectively.

[25] Knowledge Graph-Infused Fine-Tuning for Structured Reasoning in Large Language Models

Wuyang Zhang,Yexin Tian,Xiandong Meng,Mengjie Wang,Junliang Du

Main category: cs.CL

TL;DR: The paper proposes a structure-aware fine-tuning framework that enhances the ability of large language models to represent complex semantic units by integrating knowledge graphs, resulting in better performance in tasks involving structural reasoning and entity extraction.

Details

Motivation: The paper aims to address the issues of missing reasoning chains and insufficient entity-level semantic understanding in large language models when handling tasks that require structured knowledge. Method: The paper proposes a fine-tuning algorithm framework based on knowledge graph injection. It introduces structured graph information for auxiliary learning using a graph neural network to encode entities and their relations. A fusion mechanism and a gating mechanism are designed to integrate knowledge graph embeddings with contextual representations from the language model, while a joint loss function is used to optimize task performance and structural alignment. Result: Experimental findings show that the proposed method improves the accuracy of entity prediction and semantic reasoning, validating its effectiveness and stability across tasks like entity recognition, question answering, and language generation. Conclusion: The proposed structure-aware fine-tuning framework significantly enhances the model's ability to represent complex semantic units and demonstrates better semantic consistency and contextual logic modeling involving structural reasoning and entity extraction. Abstract: This paper addresses the problems of missing reasoning chains and insufficient entity-level semantic understanding in large language models when dealing with tasks that require structured knowledge. It proposes a fine-tuning algorithm framework based on knowledge graph injection. The method builds on pretrained language models and introduces structured graph information for auxiliary learning. A graph neural network is used to encode entities and their relations, constructing a graph-based semantic representation. A fusion mechanism is then designed to jointly model the knowledge graph embeddings with the contextual representations from the language model. To enhance the robustness of knowledge integration, a gating mechanism is introduced to dynamically balance the contributions of linguistic semantics and structural knowledge. This effectively mitigates conflicts between different representational spaces. During training, a joint loss function is constructed to account for both task performance and structural alignment objectives. This helps improve the accuracy of entity prediction and semantic reasoning. The study also includes a series of systematic sensitivity experiments. It evaluates the effects of learning rate, graph coverage, and structural perturbations on model performance. The results further validate the effectiveness and stability of the proposed method across tasks such as entity recognition, question answering, and language generation. Experimental findings show that the proposed structure-aware fine-tuning framework significantly enhances the model's ability to represent complex semantic units. It demonstrates better semantic consistency and contextual logic modeling in scenarios involving structural reasoning and entity extraction.

[26] NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

NVIDIA,:,Aarti Basant,Abhijit Khairnar,Abhijit Paithankar,Abhinav Khattar,Adi Renduchintala,Adithya Renduchintala,Aditya Malte,Akhiad Bercovich,Akshay Hazare,Alejandra Rico,Aleksander Ficek,Alex Kondratenko,Alex Shaposhnikov,Ali Taghibakhshi,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amy Shen,Andrew Tao,Ann Guan,Anna Shors,Anubhav Mandarwal,Arham Mehta,Arun Venkatesan,Ashton Sharabiani,Ashwath Aithal,Ashwin Poojary,Ayush Dattagupta,Balaram Buddharaju,Banghua Zhu,Barnaby Simkin,Bilal Kartal,Bita Darvish Rouhani,Bobby Chen,Boris Ginsburg,Brandon Norick,Brian Yu,Bryan Catanzaro,Charles Wang,Charlie Truong,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christian Munley,Christopher Parisien,Dan Su,Daniel Afrimi,Daniel Korzekwa,Daniel Rohrer,Daria Gitman,David Mosallanezhad,Deepak Narayanan,Dima Rekesh,Dina Yared,Dmytro Pykhtar,Dong Ahn,Duncan Riach,Eileen Long,Elliott Ning,Eric Chung,Erick Galinkin,Evelina Bakhturina,Gargi Prasad,Gerald Shen,Haim Elisha,Harsh Sharma,Hayley Ross,Helen Ngo,Herman Sahota,Hexin Wang,Hoo Chang Shin,Hua Huang,Iain Cunningham,Igor Gitman,Ivan Moshkov,Jaehun Jung,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jimmy Zhang,Jinze Xue,Jocelyn Huang,Joey Conway,John Kamalu,Jonathan Cohen,Joseph Jennings,Julien Veron Vialard,Junkeun Yi,Jupinder Parmar,Kari Briski,Katherine Cheung,Katherine Luna,Keith Wyss,Keshav Santhanam,Kezhi Kong,Krzysztof Pawelec,Kumar Anik,Kunlun Li,Kushan Ahmadian,Lawrence McAfee,Laya Sleiman,Leon Derczynski,Luis Vega,Maer Rodrigues de Melo,Makesh Narsimhan Sreedhar,Marcin Chochowski,Mark Cai,Markus Kliegl,Marta Stepniewska-Dziubinska,Matvei Novikov,Mehrzad Samadi,Meredith Price,Meriem Boubdir,Michael Boone,Michael Evans,Michal Bien,Michal Zawalski,Miguel Martinez,Mike Chrzanowski,Mohammad Shoeybi,Mostofa Patwary,Namit Dhameja,Nave Assaf,Negar Habibi,Nidhi Bhatia,Nikki Pope,Nima Tajbakhsh,Nirmal Kumar Juluru,Oleg Rybakov,Oleksii Hrinchuk,Oleksii Kuchaiev,Oluwatobi Olabiyi,Pablo Ribalta,Padmavathy Subramanian,Parth Chadha,Pavlo Molchanov,Peter Dykas,Peter Jin,Piotr Bialecki,Piotr Januszewski,Pradeep Thalasta,Prashant Gaikwad,Prasoon Varshney,Pritam Gundecha,Przemek Tredak,Rabeeh Karimi Mahabadi,Rajen Patel,Ran El-Yaniv,Ranjit Rajan,Ria Cheruvu,Rima Shahbazyan,Ritika Borkar,Ritu Gala,Roger Waleffe,Ruoxi Zhang,Russell J. Hewett,Ryan Prenger,Sahil Jain,Samuel Kriman,Sanjeev Satheesh,Saori Kaji,Sarah Yurick,Saurav Muralidharan,Sean Narenthiran,Seonmyeong Bak,Sepehr Sameni,Seungju Han,Shanmugam Ramasamy,Shaona Ghosh,Sharath Turuvekere Sreenivas,Shelby Thomas,Shizhe Diao,Shreya Gopal,Shrimai Prabhumoye,Shubham Toshniwal,Shuoyang Ding,Siddharth Singh,Siddhartha Jain,Somshubra Majumdar,Stefania Alborghetti,Syeda Nahida Akter,Terry Kong,Tim Moon,Tomasz Hliwiak,Tomer Asida,Tony Wang,Twinkle Vashishth,Tyler Poon,Udi Karpas,Vahid Noroozi,Venkat Srinivasan,Vijay Korthikanti,Vikram Fugro,Vineeth Kalluru,Vitaly Kurin,Vitaly Lavrukhin,Wasi Uddin Ahmad,Wei Du,Wonmin Byeon,Ximing Lu,Xin Dong,Yashaswi Karnati,Yejin Choi,Yian Zhang,Ying Lin,Yonggan Fu,Yoshi Suhara,Zhen Dong,Zhiyu Li,Zhongbo Zhu,Zijia Chen

Main category: cs.CL

TL;DR: 本文介绍了一种新的混合 Mamba-Transformer 语言模型 Nemotron-Nano-9B-v2，在提高推理吞吐量的同时实现了与类似大小模型相比更优的准确性。

Details

Motivation: 提高推理工作负载的吞吐量，同时实现与类似大小模型相比更优的准确性。 Method: 通过用 Mamba-2 层替换 Transformer 架构中的大部分自注意力层，构建了基于 Nemotron-H 架构的模型。首先使用 FP8 训练配方在 20 万亿个 token 上预训练了一个 120 亿参数的模型 (Nemotron-Nano-12B-v2-Base)。然后应用 Minitron 策略对模型进行压缩和蒸馏，使其能够在单个 NVIDIA A10G GPU 上进行推理。 Result: 与现有类似大小的模型（例如 Qwen3-8B）相比，Nemotron-Nano-9B-v2 在推理基准测试中实现了同等或更好的准确性，并在 8k 输入和 16k 输出 token 的推理设置中实现了高达 6 倍的推理吞吐量。 Conclusion: Nemotron-Nano-9B-v2 是一种混合 Mamba-Transformer 语言模型，相比类似大小的模型，它在推理工作负载上提高了吞吐量，并实现了最先进的准确性。 Abstract: We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

[27] In2x at WMT25 Translation Task

Lei Pang,Hanyi Mao,Quanjia Xiao,HaiXiao Liu,Xiangyi Li

Main category: cs.CL

TL;DR: The In2x research team submitted an open-system approach to the WMT25 General Machine Translation Shared Task, focusing on extending large language models to Japanese and other less commonly spoken languages through data construction methods and reward model design.

Details

Motivation: The motivation behind this research is to enable large language model systems to achieve exceptional performance in low-resource or less commonly spoken languages, thereby increasing the accessibility and usability of these models across a broader range of languages. Method: The submission employs a paradigm that includes data construction methods and reward model design to extend the capabilities of large language models to other languages. Result: The specific results of the submission are not detailed in the abstract, but the focus is on the exploration and development of a generalizable paradigm for extending LLMs to other languages. Conclusion: The In2x research team's open-system submission for the WMT25 General Machine Translation Shared Task successfully explores a generalizable paradigm for extending large language models (LLMs) to other languages, particularly focusing on Japanese-related translation tasks. Abstract: This paper presents the open-system submission by the In2x research team for the WMT25 General Machine Translation Shared Task. Our submission focuses on Japanese-related translation tasks, aiming to explore a generalizable paradigm for extending large language models (LLMs) to other languages. This paradigm encompasses aspects such as data construction methods and reward model design. The ultimate goal is to enable large language model systems to achieve exceptional performance in low-resource or less commonly spoken languages.

[28] Reasoning is about giving reasons

Krunal Shah,Dan Roth

Main category: cs.CL

TL;DR: 这篇论文提出了一种中间表示方法（RLS），用于识别自然语言论证的逻辑结构，从而增强推理能力和解释生成。

Details

Motivation: 论文的动机是解决当前规则链方法在可解释性和理论上等效推理任务扩展能力方面的不足。 Method: 论文提出了一种中间表示方法（RLS），用于识别和提取自然语言论证的逻辑结构，并展示了其在三个流行推理数据集上的高准确性。 Result: 结果表明，在三个流行推理数据集中，该方法能够高准确地识别和提取自然语言论证的逻辑结构，从而显著扩展了推理能力和支持解释生成。 Conclusion: 论文得出结论，通过确定一种中间表示（RLS），可以显著增强自然语言论证的推理能力和解释生成，从而支持各种依赖论证逻辑结构的推理形式。 Abstract: Convincing someone of the truth value of a premise requires understanding and articulating the core logical structure of the argument which proves or disproves the premise. Understanding the logical structure of an argument refers to understanding the underlying "reasons" which make up the proof or disproof of the premise - as a function of the "logical atoms" in the argument. While it has been shown that transformers can "chain" rules to derive simple arguments, the challenge of articulating the "reasons" remains. Not only do current approaches to chaining rules suffer in terms of their interpretability, they are also quite constrained in their ability to accommodate extensions to theoretically equivalent reasoning tasks - a model trained to chain rules cannot support abduction or identify contradictions. In this work we suggest addressing these shortcomings by identifying an intermediate representation (which we call the Representation of the Logical Structure (RLS) of the argument) that possesses an understanding of the logical structure of a natural language argument - the logical atoms in the argument and the rules incorporating them. Given the logical structure, reasoning is deterministic and easy to compute. Therefore, our approach supports all forms of reasoning that depend on the logical structure of the natural language argument, including arbitrary depths of reasoning, on-the-fly mistake rectification and interactive discussion with respect to an argument. We show that we can identify and extract the logical structure of natural language arguments in three popular reasoning datasets with high accuracies, thus supporting explanation generation and extending the reasoning capabilities significantly.

[29] EmoTale: An Enacted Speech-emotion Dataset in Danish

Maja J. Hjuler,Harald V. Skat-Rørdam,Line H. Clemmensen,Sneha Das

Main category: cs.CL

TL;DR: This paper introduces EmoTale, a Danish and English emotional speech dataset, validated through competitive SER model performance using SSLM embeddings.

Details

Motivation: There is a lack of functional emotional speech datasets for smaller languages like Danish, and DES is currently the only available Danish emotional speech database. Method: Development of SER models using SSLM embeddings and openSMILE feature extractor; leave-one-speaker-out cross-validation was applied. Result: The best SER model achieved a UAR of 64.1% on EmoTale, showing its validity and predictive power. Conclusion: The EmoTale dataset demonstrates comparable performance to DES, with SSLM embeddings outperforming hand-crafted features in SER models. Abstract: While multiple emotional speech corpora exist for commonly spoken languages, there is a lack of functional datasets for smaller (spoken) languages, such as Danish. To our knowledge, Danish Emotional Speech (DES), published in 1997, is the only other database of Danish emotional speech. We present EmoTale; a corpus comprising Danish and English speech recordings with their associated enacted emotion annotations. We demonstrate the validity of the dataset by investigating and presenting its predictive power using speech emotion recognition (SER) models. We develop SER models for EmoTale and the reference datasets using self-supervised speech model (SSLM) embeddings and the openSMILE feature extractor. We find the embeddings superior to the hand-crafted features. The best model achieves an unweighted average recall (UAR) of 64.1% on the EmoTale corpus using leave-one-speaker-out cross-validation, comparable to the performance on DES.

[30] Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning

Guilhem Fauré,Mostafa Sadeghi,Sam Bigeard,Slim Ouni

Main category: cs.CL

TL;DR: This paper introduces improvements to the Progressive Transformers architecture for neural sign language production, focusing on pose encoding and semantic structuring, which result in significant performance improvements on the Phoenix14T dataset.

Details

Motivation: The high intra-class variability of signs due to signer morphology and stylistic variety in training data presents a challenge for neural sign language production. Method: Two enhancements were proposed to the Progressive Transformers architecture: encoding poses using bone rotations in quaternion space with a geodesic loss, and introducing a contrastive loss to structure decoder embeddings by semantic similarity. Result: On the Phoenix14T dataset, the contrastive loss improved Probability of Correct Keypoint by 16%, and combining it with quaternion-based pose encoding reduced Mean Bone Angle Error by 6%. Conclusion: Incorporating skeletal structure modeling and semantically guided contrastive objectives improves sign pose representations in Transformer-based SLP models. Abstract: One of the main challenges in neural sign language production (SLP) lies in the high intra-class variability of signs, arising from signer morphology and stylistic variety in the training data. To improve robustness to such variations, we propose two enhancements to the standard Progressive Transformers (PT) architecture (Saunders et al., 2020). First, we encode poses using bone rotations in quaternion space and train with a geodesic loss to improve the accuracy and clarity of angular joint movements. Second, we introduce a contrastive loss to structure decoder embeddings by semantic similarity, using either gloss overlap or SBERT-based sentence similarity, aiming to filter out anatomical and stylistic features that do not convey relevant semantic information. On the Phoenix14T dataset, the contrastive loss alone yields a 16% improvement in Probability of Correct Keypoint over the PT baseline. When combined with quaternion-based pose encoding, the model achieves a 6% reduction in Mean Bone Angle Error. These results point to the benefit of incorporating skeletal structure modeling and semantically guided contrastive objectives on sign pose representations into the training of Transformer-based SLP models.

[31] Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

Mukhammadsaid Mamasaidov,Azizullah Aral,Abror Shopulatov,Mironshoh Inomjonov

Main category: cs.CL

TL;DR: 本文研究了南乌兹别克语的自然语言处理资源的不足，并介绍了新的资源，包括开发集、数据集、模型和后处理方法，以促进对南乌兹别克语和其他低资源语言的未来研究。

Details

Motivation: 南乌兹别克语（uzs）在自然语言处理中代表性不足，尽管它有大约500万使用者，并且在语音学、词汇和正字法方面与北乌兹别克语（uzn）有显著差异。 Method: 本文提出了一种用于恢复阿拉伯语脚本半空格字符的后处理方法，并使用FLORES+开发集、平行句子数据集和微调的NLLB-200模型进行机器翻译。 Result: 本文的结果包括一个997句的FLORES+开发集，39994个来自词典、文学和网络资源的平行句子数据集，以及一个微调的NLLB-200模型（lutfiy），还有一个可以改善形态边界处理的后处理方法。 Conclusion: 本文得出结论是，所有数据集、模型和工具的发布将支持未来对南乌兹别克语和其他低资源语言的研究。 Abstract: Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.

[32] Continuous sentiment scores for literary and multilingual contexts

Laurits Lyngbaek,Pascale Feldkamp,Yuri Bizzoni,Kristoffer Nielbo,Kenneth Enevoldsen

Main category: cs.CL

TL;DR: The paper introduces a new method for continuous sentiment scoring in literary texts using concept vector projection, which outperforms traditional tools in capturing nuanced sentiment across languages and genres.

Details

Motivation: The motivation stems from the challenges traditional sentiment analysis tools face with literary texts, such as figurative language and coarse categorical outputs, particularly in low-resource languages. Method: The method involves a novel continuous sentiment scoring approach using concept vector projection, trained on multilingual literary data for improved sentiment analysis. Result: The proposed method outperforms existing tools on English and Danish texts, producing sentiment scores closely aligned with human ratings. Conclusion: The paper concludes that the introduced continuous sentiment scoring method based on concept vector projection outperforms existing tools in capturing nuanced sentiment expressions in literary texts across languages and genres. Abstract: Sentiment Analysis is widely used to quantify sentiment in text, but its application to literary texts poses unique challenges due to figurative language, stylistic ambiguity, as well as sentiment evocation strategies. Traditional dictionary-based tools often underperform, especially for low-resource languages, and transformer models, while promising, typically output coarse categorical labels that limit fine-grained analysis. We introduce a novel continuous sentiment scoring method based on concept vector projection, trained on multilingual literary data, which more effectively captures nuanced sentiment expressions across genres, languages, and historical periods. Our approach outperforms existing tools on English and Danish texts, producing sentiment scores whose distribution closely matches human ratings, enabling more accurate analysis and sentiment arc modeling in literature.

[33] Improving in-context learning with a better scoring function

Omar Naim,Swarnadeep Bhar,Jérôme Bolte,Nicholas Asher

Main category: cs.CL

TL;DR: This paper proposes scaled signed averaging (SSA) to overcome the limitations of Softmax in in-context learning for large language models, particularly improving performance on tasks involving first-order quantifiers and linear functions.

Details

Motivation: Recent studies have revealed limitations in the in-context learning ability of large language models using the Softmax scoring function. This paper aims to address these constraints, particularly for tasks involving first-order quantifiers and linear functions. Method: The paper analyzes limitations of Softmax in attention mechanisms for in-context learning, particularly on tasks with first-order quantifiers and linear functions. It proposes scaled signed averaging (SSA) as an alternative to Softmax and evaluates its performance in both encoder-only and decoder-only transformer models. Result: Empirical results show that scaled signed averaging (SSA) dramatically improves performance on the target tasks. Encoder-only and decoder-only transformer models using SSA match or exceed their Softmax-based counterparts in various linguistic probing tasks. Conclusion: Scaled signed averaging (SSA) improves performance on tasks involving first-order quantifiers and linear functions in in-context learning, outperforming Softmax-based models in linguistic probing tasks. Abstract: Large language models (LLMs) exhibit a remarkable capacity to learn by analogy, known as in-context learning (ICL). However, recent studies have revealed limitations in this ability. In this paper, we examine these limitations on tasks involving first-order quantifiers such as {\em all} and {\em some}, as well as on ICL with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these constraints. To address this, we propose \textbf{scaled signed averaging (SSA)}, a novel alternative to Softmax. Empirical results show that SSA dramatically improves performance on our target tasks. Furthermore, we evaluate both encoder-only and decoder-only transformers models with SSA, demonstrating that they match or exceed their Softmax-based counterparts across a variety of linguistic probing tasks.

[34] ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine

Junying Chen,Zhenyang Cai,Zhiheng Liu,Yunjin Yang,Rongsheng Wang,Qingying Xiao,Xiangyi Feng,Zhan Su,Jing Guo,Xiang Wan,Guangjun Yu,Haizhou Li,Benyou Wang

Main category: cs.CL

TL;DR: 本文介绍了ShizhenGPT，这是第一个为传统中医设计的多模态大型语言模型，旨在克服中医数据稀缺和多模态诊断的挑战，并在中医视觉理解和多模态感知方面表现出色。

Details

Motivation: 由于高质量中医数据的稀缺和中医诊断的多模态性质，大型语言模型在传统中医领域的潜力尚未被充分挖掘。 Method: 构建了一个包含大量文本和多模态数据的中医数据集，预训练并指令调优ShizhenGPT以实现深度中医知识和多模态推理。 Result: 实验表明，ShizhenGPT在中医资格考试评估中优于同规模的大型语言模型，并能与更大的专有模型竞争。 Conclusion: ShizhenGPT是第一个专为传统中医设计的多模态大语言模型，它在中医视觉理解方面领先，并展示了跨模态的统一感知能力，为中医领域的整体多模态感知和诊断铺平了道路。 Abstract: Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.

[35] The Digital Sous Chef -- A Comparative Study on Fine-Tuning Language Models for Recipe Generation

Shubham Pundhir,Ganesh Bagler

Main category: cs.CL

TL;DR: 本文提出了一个基于GPT-2的文本食谱生成方法，并通过改进分词策略显著提高了生成效果。

Details

Motivation: 文本食谱生成是自然语言生成中的一个基础任务。 Method: 通过对比微调的GPT-2大模型与GPT-2小模型以及传统LSTM/RNN基线模型，提出了一种目标化的分词策略，增强了领域特异性。 Result: 实验表明，基于大Transformer的方法在BERTScore（F1）上比最好的循环基线提高了>20%（0.92 vs 0.72），同时降低了69.8%的困惑度。 Conclusion: 作者讨论了现有的挑战，特别是关于事实准确性，并概述了如何将真实世界的约束和多模态输入整合到高级食谱生成研究中的基础性研究。 Abstract: We established a rigorous benchmark for text-based recipe generation, a fundamental task in natural language generation. We present a comprehensive comparative study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB. Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers. This approach addresses a critical limitation of generic tokenizers by preserving essential recipe structures and precise numerical quantities, thereby enhancing domain specificity. Performance is evaluated using a comprehensive suite of seven automatic metrics spanning fluency (BLEU-4, METEOR), coherence (ROUGE-L), semantic relevance (BERTScore), and diversity. Our experiments show that the large transformer-based approach yields a >20% relative improvement in BERTScore (F1) (0.92 vs 0.72) over the best recurrent baseline, while reducing perplexity by 69.8%. We conclude with a discussion of remaining challenges, particularly regarding factual accuracy, and outline how this foundational study paves the way for integrating real-world constraints and multi-modal inputs in advanced recipe generation research.

[36] Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

Guangzhan Wang,Hongyu Zhang,Beijun Shen,Xiaodong Gu

Main category: cs.CL

TL;DR: This paper introduces LMTransplant, a novel text augmentation method leveraging LLMs through a transplant-then-regenerate strategy, resulting in more diverse and creative outputs while maintaining original text attributes.

Details

Motivation: Traditional data augmentation methods focus on lexical-level rephrasing, resulting in semantically similar variations. While LLMs offer enhanced text augmentation through their knowledge emergence capability, controlling output style and structure remains challenging. Method: LMTransplant uses a transplant-then-regenerate strategy, where seed text is incorporated into an expanded context generated by LLMs, and then a variant is regenerated based on this context. Result: LMTransplant enables the creation of more diverse and creative content-level variants while preserving the core attributes of the original text. It outperforms existing methods across various text-related tasks and scales well with increasing amounts of augmented data. Conclusion: LMTransplant demonstrates superior performance and scalability over existing text augmentation methods by leveraging the knowledge embedded in LLMs. Abstract: Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their "knowledge emergence" capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of the original text. We evaluate LMTransplant across various text-related tasks, demonstrating its superior performance over existing text augmentation methods. Moreover, LMTransplant demonstrates exceptional scalability as the size of augmented data grows.

[37] Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference

Samir Abdaljalil,Erchin Serpedin,Khalid Qaraqe,Hasan Kurban

Main category: cs.CL

TL;DR: This paper introduces a controlled evaluation framework for multilingual NLI to assess LLM performance in monolingual and code-switched conditions. It finds that code-switching can improve performance, suggesting that lexical variation from translation acts as a regularization signal. The study highlights both the potential and limitations of current LLM cross-lingual reasoning and identifies code-switching as a promising approach for improving multilingual robustness.

Details

Motivation: The motivation behind this study is to explore the under-researched area of consistent, logically grounded alignment across languages in large language models (LLMs). As LLMs are increasingly used in multilingual contexts, understanding their performance and robustness in cross-lingual reasoning is essential. Method: The study introduces a controlled evaluation framework for multilingual NLI. It generates synthetic, logic-based premise-hypothesis pairs and translates them into typologically diverse languages. The framework tests both monolingual and mixed-language (code-switched) conditions. Semantic preservation is validated through embedding-based similarity analyses and cross-lingual alignment visualizations. Result: The results show that code-switching does not degrade, and can even improve, the performance of LLMs in multilingual NLI tasks. The translation-induced lexical variation appears to serve as a regularization signal. Additionally, the study validates the fidelity of translated premise-hypothesis pairs through embedding-based similarity analyses and cross-lingual alignment visualizations. Conclusion: The study concludes that code-switching can improve the performance of large language models (LLMs) in multilingual natural language inference (NLI), suggesting that translation-induced lexical variation acts as a regularization signal. This highlights the potential and identifies the brittleness of current LLM cross-lingual reasoning while emphasizing code-switching as a promising method for enhancing multilingual robustness. Abstract: Large language models (LLMs) are increasingly applied in multilingual contexts, yet their capacity for consistent, logically grounded alignment across languages remains underexplored. We present a controlled evaluation framework for multilingual natural language inference (NLI) that generates synthetic, logic-based premise-hypothesis pairs and translates them into a typologically diverse set of languages. This design enables precise control over semantic relations and allows testing in both monolingual and mixed-language (code-switched) conditions. Surprisingly, code-switching does not degrade, and can even improve, performance, suggesting that translation-induced lexical variation may serve as a regularization signal. We validate semantic preservation through embedding-based similarity analyses and cross-lingual alignment visualizations, confirming the fidelity of translated pairs. Our findings expose both the potential and the brittleness of current LLM cross-lingual reasoning, and identify code-switching as a promising lever for improving multilingual robustness. Code available at: https://github.com/KurbanIntelligenceLab/nli-stress-testing

[38] TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting

Jiaming Leng,Yunying Bi,Chuan Qin,Bing Yin,Yanyong Zhang,Chao Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为TransLLM的统一框架，结合时空建模和大语言模型，通过可学习提示生成解决城市交通中的多种任务问题。

Details

Motivation: 现有方法在小规模深度学习模型的任务特异性、数据需求以及大语言模型对结构化时空数据和数值推理的不足方面存在局限性。 Method: TransLLM通过轻量级时空编码器捕获复杂依赖关系，并利用基于强化学习的实例级提示路由机制动态个性化提示。 Result: 实验表明，TransLLM在多种数据集和任务中表现出色，优于十个基线模型，在监督和零样本设置中均显示出强泛化性和跨任务适应性。 Conclusion: TransLLM为城市交通系统中的多种任务提供了一个灵活且高效的基础框架，有效结合了时空建模与大语言模型的优势。 Abstract: Urban transportation systems encounter diverse challenges across multiple tasks, such as traffic forecasting, electric vehicle (EV) charging demand prediction, and taxi dispatch. Existing approaches suffer from two key limitations: small-scale deep learning models are task-specific and data-hungry, limiting their generalizability across diverse scenarios, while large language models (LLMs), despite offering flexibility through natural language interfaces, struggle with structured spatiotemporal data and numerical reasoning in transportation domains. To address these limitations, we propose TransLLM, a unified foundation framework that integrates spatiotemporal modeling with large language models through learnable prompt composition. Our approach features a lightweight spatiotemporal encoder that captures complex dependencies via dilated temporal convolutions and dual-adjacency graph attention networks, seamlessly interfacing with LLMs through structured embeddings. A novel instance-level prompt routing mechanism, trained via reinforcement learning, dynamically personalizes prompts based on input characteristics, moving beyond fixed task-specific templates. The framework operates by encoding spatiotemporal patterns into contextual representations, dynamically composing personalized prompts to guide LLM reasoning, and projecting the resulting representations through specialized output layers to generate task-specific predictions. Experiments across seven datasets and three tasks demonstrate the exceptional effectiveness of TransLLM in both supervised and zero-shot settings. Compared to ten baseline models, it delivers competitive performance on both regression and planning problems, showing strong generalization and cross-task adaptability. Our code is available at https://github.com/BiYunying/TransLLM.

[39] Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs

Skatje Myers,Dmitriy Dligach,Timothy A. Miller,Samantha Barr,Yanjun Gao,Matthew Churpek,Anoop Mayampurath,Majid Afshar

Main category: cs.CL

TL;DR: 本文研究了使用检索增强生成（RAG）方法处理电子健康记录（EHRs）的有效性，发现其表现接近或超过了使用最近临床笔记的方法，并且在处理更长文本时仍然保持效率和竞争力。

Details

Motivation: EHRs长、嘈杂且经常重复，对必须浏览它们的临床医生来说是一个重大挑战。大型语言模型（LLMs）提供了一个有前景的解决方案，但临床笔记的长度往往超过了最先进的模型的扩展上下文窗口。 Method: 使用真实住院患者的EHR，测试三种最先进的LLM在不同上下文提供量下的表现，使用目标文本检索或最近的临床笔记。 Result: RAG的表现接近或超过了使用最近笔记的方法，并且在需要显著较少输入token的情况下接近使用模型完整上下文的表现。 Conclusion: RAG是一个具有竞争力且高效的方法，即使在处理文本能力不断增强的新模型中，仍然保持竞争力。 Abstract: Electronic health records (EHRs) are long, noisy, and often redundant, posing a major challenge for the clinicians who must navigate them. Large language models (LLMs) offer a promising solution for extracting and reasoning over this unstructured text, but the length of clinical notes often exceeds even state-of-the-art models' extended context windows. Retrieval-augmented generation (RAG) offers an alternative by retrieving task-relevant passages from across the entire EHR, potentially reducing the amount of required input tokens. In this work, we propose three clinical tasks designed to be replicable across health systems with minimal effort: 1) extracting imaging procedures, 2) generating timelines of antibiotic use, and 3) identifying key diagnoses. Using EHRs from actual hospitalized patients, we test three state-of-the-art LLMs with varying amounts of provided context, using either targeted text retrieval or the most recent clinical notes. We find that RAG closely matches or exceeds the performance of using recent notes, and approaches the performance of using the models' full context while requiring drastically fewer input tokens. Our results suggest that RAG remains a competitive and efficient approach even as newer models become capable of handling increasingly longer amounts of text.

[40] Long Chain-of-Thought Reasoning Across Languages

Josh Barua,Seun Eisape,Kayo Yin,Alane Suhr

Main category: cs.CL

TL;DR: 这篇论文探讨了长链推理（CoT）在不同语言中的表现，发现其效果因语言而异，通过微调Qwen模型并分析多语言数据集，作者揭示了跨语言推理的潜力与挑战，并指出数据质量与规模之间的权衡因语言而异。

Details

Motivation: 论文的动机是观察到尽管长链推理（CoT）显著提升了大语言模型（LLM）的推理能力，但这一过程几乎完全以英语为中心。因此，作者旨在探索多语言环境下长链推理的有效性及影响因素。 Method: 论文的方法包括构建两个流行的英语推理数据集的翻译版本，微调Qwen 2.5（7B）和Qwen 3（8B）模型，并对法语、日语、拉脱维亚语和斯瓦希里语中的长链推理生成进行系统研究。 Result: 实验揭示了三个关键发现：1）使用英语作为中介语言的效果因语言而异，对法语无益，对日语和拉脱维亚语有所提升，但对斯瓦希里语效果不足；2）Qwen 3的广泛多语言预训练缩小了但未消除跨语言性能差距，而仅使用1000条推理轨迹的轻量级微调即可在斯瓦希里语中提高30%以上的表现；3）数据质量与规模之间的权衡依赖于语言，对于英语和法语，小规模精心策划的数据集已足够，而对于斯瓦希里语和拉脱维亚语，更大但更嘈杂的语料库更为有效。 Conclusion: 论文的结论是，跨语言的长链推理（CoT）效果因语言而异，使用英语作为中介语言的效果在不同语言中表现不一，而Qwen 3的广泛多语言预训练缩小了但未消除跨语言性能差距。此外，数据质量和规模的权衡取决于语言，并且轻量级的微调可以显著提高如斯瓦希里语的表现。 Abstract: Scaling inference through long chains-of-thought (CoTs) has unlocked impressive reasoning capabilities in large language models (LLMs), yet the reasoning process remains almost exclusively English-centric. We construct translated versions of two popular English reasoning datasets, fine-tune Qwen 2.5 (7B) and Qwen 3 (8B) models, and present a systematic study of long CoT generation across French, Japanese, Latvian, and Swahili. Our experiments reveal three key findings. First, the efficacy of using English as a pivot language varies by language: it provides no benefit for French, improves performance when used as the reasoning language for Japanese and Latvian, and proves insufficient for Swahili where both task comprehension and reasoning remain poor. Second, extensive multilingual pretraining in Qwen 3 narrows but does not eliminate the cross-lingual performance gap. A lightweight fine-tune using only 1k traces still improves performance by over 30\% in Swahili. Third, data quality versus scale trade-offs are language dependent: small, carefully curated datasets suffice for English and French, whereas larger but noisier corpora prove more effective for Swahili and Latvian. Together, these results clarify when and why long CoTs transfer across languages and provide translated datasets to foster equitable multilingual reasoning research.

[41] MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

Ailing Yu,Lan Yao,Jingnan Liu,Zhe Chen,Jiajun Yin,Yuan Wang,Xinhao Liao,Zhiling Ye,Ji Li,Yun Yue,Hansong Xiao,Hualei Zhou,Chunxiao Guo,Peng Wei,Jinjie Gu

Main category: cs.CL

TL;DR: MedResearcher-R1-32B 通过医学知识图谱和定制检索引擎的结合，在医学领域超越了更大的专有系统。

Details

Motivation: 现成的通用深度研究代理在医学领域面临挑战，主要表现为医学基准测试中的有限准确性。 Method: 开发了一种新的数据合成框架，利用医学知识图谱生成复杂的多跳问答对，并集成定制的私有医学检索引擎。 Result: 生成了跨越12个医学专科的2100多条多样化轨迹，并在医学基准测试中实现了新的最先进成果。 Conclusion: MedResearcher-R1-32B 模型通过领域特定的架构、工具设计和训练数据构建创新，使较小的开源模型在特定领域超越更大的专有系统。 Abstract: Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts.We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting the longest chains from subgraphs around rare medical entities to generate complex multi-hop question-answer pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions.Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our MedResearcher-R1-32B model demonstrates exceptional performance, establishing new state-of-the-art results on medical benchmarks while maintaining competitive performance on general deep research tasks. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains.

[42] Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

Haokun Lin,Haobo Xu,Yichen Wu,Ziyu Guo,Renrui Zhang,Zhichao Lu,Ying Wei,Qingfu Zhang,Zhenan Sun

Main category: cs.CL

TL;DR: This paper presents the first systematic study on quantizing diffusion-based language models, exploring post-training quantization methods to enable efficient deployment on edge devices.

Details

Motivation: Diffusion large language models (dLLMs) face deployment challenges on edge devices due to their large parameter scale and high resource demands, while the applicability of PTQ to dLLMs remains largely unexplored. Method: The researchers implemented state-of-the-art post-training quantization (PTQ) methods and conducted a comprehensive evaluation across multiple task types and model variants, analyzing quantization along four dimensions: bit-width, quantization method, task category, and model type. Result: The study identified activation outliers that hinder low-bit quantization and presented a systematic evaluation of PTQ methods on dLLMs, offering insights into their quantization behavior. Conclusion: The study provides practical insights into the quantization behavior of diffusion large language models (dLLMs) under various configurations, aiming to support future research in efficient dLLM deployment. Abstract: Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community.

cs.CV [Back]

[43] A comparative study of some wavelet and sampling operators on various features of an image

Digvijay Singh,Rahul Shukla,Karunesh Kumar Singh

Main category: cs.CV

TL;DR: This paper analyzes the convergence and approximation properties of sampling Kantorovich and related operators in image analysis, showing that their effectiveness varies depending on image features and conditions.

Details

Motivation: The motivation is to understand the convergence properties and approximation behavior of positive sampling Kantorovich operators and related mathematical tools in image analysis, particularly under non-ideal conditions. Method: The study uses sampling Kantorovich (SK), Gaussian, Bilateral, and thresholding wavelet-based operators to analyze convergence and approximation properties. Error measurement and evaluation of mathematical parameters like MSE, SI, SSI, SMPI, and ENL are conducted at various resolution levels. Result: The research demonstrates the performance of various operators through numerical examples, including the analysis of a 2D Shepp-Logan Phantom slice. It highlights the justification of the Fundamental Theorem of Approximation and shows how different operators perform under varying image features and conditions. Conclusion: The research concludes that different operators have varying effectiveness in studying different image features due to the uneven nature of images under non-ideal conditions. Some operators perform well for specific features while others do not. Abstract: This research includes the study of some positive sampling Kantorovich operators (SK operators) and their convergence properties. A comprehensive analysis of both local and global approximation properties is presented using sampling Kantorovich (SK), Gaussian, Bilateral and the thresholding wavelet-based operators in the framework of SK-operators. Explicitly, we start the article by introducing the basic terminology and state the fundamental theorem of approximation (FTA) by imposing the various required conditions corresponding to the various defined operators. We measure the error and study the other mathematical parameters such as the mean square error (MSE), the speckle index (SI), the speckle suppression index (SSI), the speckle mean preservation index (SMPI), and the equivalent number of looks (ENL) at various levels of resolution parameters. The nature of these operators are demonstrated via an example under ideal conditions in tabulated form at a certain level of samples. Eventually, another numerical example is illustrated to discuss the region of interest (ROI) via SI, SSI and SMPI of 2D Shepp-Logan Phantom taken slice from the 3D image, which gives the justification of the fundamental theorem of approximation (FTA). At the end of the derivation and illustrations we observe that the various operators have their own significance while studying the various features of the image because of the uneven nature of an image (non-ideal condition). Therefore, to some extent, some operators work well and some do not for some specific features of the image.

[44] Federated Action Recognition for Smart Worker Assistance Using FastPose

Vinit Hegiste,Vidit Goyal,Tatjana Legler,Martin Ruskowski

Main category: cs.CV

TL;DR: This paper presents a federated learning approach for privacy-preserving human activity recognition in industrial settings, showing improved performance over centralized methods.

Details

Motivation: The motivation is to address the need for accurate and real-time recognition of worker actions in smart manufacturing environments while preserving privacy, particularly in scenarios where centralized datasets are impractical due to privacy concerns. Method: The paper employs a federated learning framework for pose-based HAR using a custom skeletal dataset. Two temporal backbones, LSTM and Transformer encoder, are trained and evaluated under four paradigms: centralized, local (per-client), FL with weighted federated averaging (FedAvg), and federated ensemble learning (FedEnsemble). Result: On the global test set, the FL Transformer improved over centralized training by +12.4 percentage points, with FedEnsemble delivering a +16.3 percentage points gain. For an unseen external client, FL and FedEnsemble exceeded centralized accuracy by +52.6 and +58.3 percentage points, respectively. Conclusion: The study concludes that federated learning (FL) not only preserves privacy but also significantly enhances cross-user generalization for pose-based human activity recognition (HAR), making it a practical solution for scalable and privacy-aware HAR in heterogeneous industrial settings. Abstract: In smart manufacturing environments, accurate and real-time recognition of worker actions is essential for productivity, safety, and human-machine collaboration. While skeleton-based human activity recognition (HAR) offers robustness to lighting, viewpoint, and background variations, most existing approaches rely on centralized datasets, which are impractical in privacy-sensitive industrial scenarios. This paper presents a federated learning (FL) framework for pose-based HAR using a custom skeletal dataset of eight industrially relevant upper-body gestures, captured from five participants and processed using a modified FastPose model. Two temporal backbones, an LSTM and a Transformer encoder, are trained and evaluated under four paradigms: centralized, local (per-client), FL with weighted federated averaging (FedAvg), and federated ensemble learning (FedEnsemble). On the global test set, the FL Transformer improves over centralized training by +12.4 percentage points, with FedEnsemble delivering a +16.3 percentage points gain. On an unseen external client, FL and FedEnsemble exceed centralized accuracy by +52.6 and +58.3 percentage points, respectively. These results demonstrate that FL not only preserves privacy but also substantially enhances cross-user generalization, establishing it as a practical solution for scalable, privacy-aware HAR in heterogeneous industrial settings.

[45] LENS: Learning to Segment Anything with Unified Reinforced Reasoning

Lianghui Zhu,Bin Ouyang,Yuxuan Zhang,Tianheng Cheng,Rui Hu,Haocheng Shen,Longjin Ran,Xiaoxin Chen,Li Yu,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: LENS是一个强化学习框架，通过链式推理提升文本提示图像分割的泛化能力，性能优于现有方法。

Details

Motivation: 现有监督微调方法在测试时忽略显式链式推理（CoT），限制了其在未见过的提示和领域中的泛化能力。 Method: 引入LENS，一种可扩展的强化学习框架，统一优化推理过程和分割，并使用跨句子、框和段级线索的奖励。 Result: LENS在RefCOCO、RefCOCO+和RefCOCOg基准上平均cIoU达81.2%，比GLaMM高出5.6%。 Conclusion: LENS通过RL驱动的CoT推理提供了一种可推广的文本提示分割模型，优于现有的微调方法。 Abstract: Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at https://github.com/hustvl/LENS.

[46] RynnEC: Bringing MLLMs into Embodied World

Ronghao Dang,Yuqian Yuan,Yunxuan Mao,Kehan Li,Jiangpin Liu,Zhikai Wang,Xin Li,Fan Wang,Deli Zhao

Main category: cs.CV

TL;DR: RynnEC是一种用于具身认知的视频多模态大语言模型，通过区域编码器和掩码解码器实现灵活的区域级视频交互，并在多种任务上表现优异。

Details

Motivation: 解决标注3D数据集稀缺的问题，并为具身认知提供一个区域中心的视频多模态大语言模型。 Method: RynnEC基于一个通用的视觉-语言基础模型，引入了区域编码器和掩码解码器，实现了灵活的区域级视频交互。此外，作者还提出了一种以自我为中心的视频数据生成管道和RynnEC-Bench基准测试。 Result: 尽管架构紧凑，RynnEC在物体属性理解、物体分割和空间推理方面达到了最先进的性能。 Conclusion: RynnEC实现了在物体属性理解、物体分割和空间推理方面的最先进的性能，并为具身智能体的大脑提供了一种以区域为中心的视频范式，促进了在各种具身任务中的泛化能力。 Abstract: We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC

[47] Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

Md Ashiqur Rahman,Chiao-An Yang,Michael N. Cheng,Lim Jun Hao,Jeremiah Jiang,Teck-Yian Lim,Raymond A. Yeh

Main category: cs.CV

TL;DR: This paper proposes a deep equilibrium canonicalizer (DEC) to enhance local scale equivariance in computer vision models, demonstrating improved performance and consistency across multiple pre-trained networks on the ImageNet benchmark.

Details

Motivation: Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. Method: We present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Result: On the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets. Conclusion: DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT. Abstract: Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT. Our code is available at https://github.com/ashiq24/local-scale-equivariance.

[48] CLIPSym: Delving into Symmetry Detection with CLIP

Tinghan Yang,Md Ashiqur Rahman,Raymond A. Yeh

Main category: cs.CV

TL;DR: 本文提出CLIPSym，利用CLIP模型和新设计的解码器与提示技术，实现了更优的对称性检测性能。

Details

Motivation: 由于对称性在计算机视觉中的重要性以及其检测的挑战性，同时借助视觉-语言模型（如CLIP）的进步，探索如何利用CLIP模型来增强对称性检测。 Method: 提出了一种名为CLIPSym的方法，结合CLIP的图像和语言编码器，采用基于Transformer和G卷积的混合旋转等变解码器，并开发了语义感知提示分组（SAPG）技术以更好地整合语义线索。 Result: CLIPSym在三个标准对称检测数据集（DENDI、SDRW和LDRS）上均优于现有最先进方法，并通过消融实验验证了CLIP预训练、等变解码器和SAPG技术的有效性。 Conclusion: CLIPSym有效利用CLIP的图像和语言编码器以及提出的旋转等变解码器和SAPG技术，在对称检测任务上优于当前最先进的方法。 Abstract: Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP's image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and $G$-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP's language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP's pre-training, the proposed equivariant decoder, and the SAPG technique. The code is available at https://github.com/timyoung2333/CLIPSym.

[49] A Survey on Video Anomaly Detection via Deep Learning: Human, Vehicle, and Environment

Ghazal Alinezhad Noghre,Armin Danesh Pazho,Hamed Tabkhi

Main category: cs.CV

TL;DR: This survey provides a comprehensive overview of Video Anomaly Detection (VAD), organizing the literature across different supervision levels and adaptive learning techniques, while identifying key challenges and contributions in human-centric, vehicle-centric, and environment-centric applications.

Details

Motivation: The field of Video Anomaly Detection (VAD) is fragmented across domains and learning paradigms. The authors aim to provide a comprehensive perspective and structured foundation for advancing research and applications in this area. Method: The paper systematically organizes existing literature on VAD across various supervision levels and adaptive learning methods, while analyzing methodologies and applications in three major categories: human-centric, vehicle-centric, and environment-centric scenarios. Result: The survey identifies the fundamental contributions and limitations of current VAD methodologies, and provides a structured foundation for advancing theoretical understanding and real-world applicability of VAD systems. Conclusion: The survey concludes that consolidating insights from subfields can advance both theoretical understanding and real-world applicability of VAD systems, highlighting open challenges in research and deployment. Abstract: Video Anomaly Detection (VAD) has emerged as a pivotal task in computer vision, with broad relevance across multiple fields. Recent advances in deep learning have driven significant progress in this area, yet the field remains fragmented across domains and learning paradigms. This survey offers a comprehensive perspective on VAD, systematically organizing the literature across various supervision levels, as well as adaptive learning methods such as online, active, and continual learning. We examine the state of VAD across three major application categories: human-centric, vehicle-centric, and environment-centric scenarios, each with distinct challenges and design considerations. In doing so, we identify fundamental contributions and limitations of current methodologies. By consolidating insights from subfields, we aim to provide the community with a structured foundation for advancing both theoretical understanding and real-world applicability of VAD systems. This survey aims to support researchers by providing a useful reference, while also drawing attention to the broader set of open challenges in anomaly detection, including both fundamental research questions and practical obstacles to real-world deployment.

[50] Accelerating Image Classification with Graph Convolutional Neural Networks using Voronoi Diagrams

Mustafa Mohammadi Gharasuie,Luis Rueda

Main category: cs.CV

TL;DR: This study introduces a new image classification framework using GCNs with Voronoi diagrams, resulting in the faster normalized Voronoi Graph Convolution Network (NVGCN), which improves pre-processing time and classification accuracy.

Details

Motivation: Recent advances in image classification have been driven by the integration of Graph Convolutional Networks (GCNs), which offer a new way to handle complex data structures. This study aims to further improve image classification by introducing a novel method that leverages the capabilities of GCNs and Voronoi diagrams. Method: The study employs an innovative framework that uses Graph Convolutional Networks (GCNs) in conjunction with Voronoi diagrams for image classification. It utilizes a graph-based representation of images where pixels or regions are treated as vertices of a graph, simplified through Delaunay triangulations. Result: The model demonstrates significant improvements in pre-processing time and classification accuracy on benchmark datasets, particularly in scenarios involving complex scenes and fine-grained categories, surpassing existing state-of-the-art models. Conclusion: This research contributes to the field of image classification by introducing a novel approach that combines GCNs with Voronoi diagrams, leading to the development of the normalized Voronoi Graph Convolution Network (NVGCN), which outperforms traditional GCNs in terms of speed. Abstract: Recent advances in image classification have been significantly propelled by the integration of Graph Convolutional Networks (GCNs), offering a novel paradigm for handling complex data structures. This study introduces an innovative framework that employs GCNs in conjunction with Voronoi diagrams to peform image classification, leveraging their exceptional capability to model relational data. Unlike conventional convolutional neural networks, our approach utilizes a graph-based representation of images, where pixels or regions are treated as vertices of a graph, which are then simplified in the form of the corresponding Delaunay triangulations. Our model yields significant improvement in pre-processing time and classification accuracy on several benchmark datasets, surpassing existing state-of-the-art models, especially in scenarios that involve complex scenes and fine-grained categories. The experimental results, validated via cross-validation, underscore the potential of integrating GCNs with Voronoi diagrams in advancing image classification tasks. This research contributes to the field by introducing a novel approach to image classification, while opening new avenues for developing graph-based learning paradigms in other domains of computer vision and non-structured data. In particular, we have proposed a new version of the GCN in this paper, namely normalized Voronoi Graph Convolution Network (NVGCN), which is faster than the regular GCN.

[51] Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Thanh-Dat Truong,Huu-Thien Tran,Tran Thai Son,Bhiksha Raj,Khoa Luu

Main category: cs.CV

TL;DR: 本文提出了一种改进大型多模态模型对齐和理解能力的新方法，通过引入新的任务和损失函数，在多个基准测试中实现了性能突破。

Details

Motivation: 现有的LMMs在鲁棒性和泛化能力方面存在局限性，主要由于视觉和文本特征之间的对齐和相关性问题。 Method: 引入了两个新任务（重建图像顺序和文本顺序）到LMM的预训练和微调阶段，同时提出了一种新的有向标记方法和图像到响应引导损失函数。 Result: 所提出的方法在学术任务导向和指令跟随LMM基准测试中均达到了最先进的性能。 Conclusion: 该论文提出了一种简单而高效的学习机制，显著提升了大型多模态模型（LMMs）在视觉和文本模态对齐方面的鲁棒性和理解能力，并在多个基准测试中达到了最先进的性能。 Abstract: Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM's pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.

[52] Effect of Data Augmentation on Conformal Prediction for Diabetic Retinopathy

Rizwan Ahamed,Annahita Amireskandari,Joel Palko,Carol Laxson,Binod Bhattarai,Prashnna Gyawali

Main category: cs.CV

TL;DR: 本文研究了不同数据增强策略对糖尿病视网膜病变分级任务中共形预测性能的影响，并发现混合样本策略可提高模型的预测准确性及不确定性量化效果。

Details

Motivation: 为了在糖尿病视网膜病变分级等高风险任务中实现深度学习模型的临床部署，需要可靠的不确定性量化。 Method: 使用DDR数据集，评估了两种骨干架构（ResNet-50和CoaT）在五种增强策略下的表现，并分析了对共形预测指标的影响。 Result: 样本混合策略（如Mixup和CutMix）不仅提高了预测准确性，还提供了更可靠和高效的不确定性估计，而CLAHE等方法可能对模型确定性产生负面影响。 Conclusion: 研究结果表明，与下游不确定性量化共同设计增强策略对于构建真正可信的医学图像AI系统至关重要。 Abstract: The clinical deployment of deep learning models for high-stakes tasks such as diabetic retinopathy (DR) grading requires demonstrable reliability. While models achieve high accuracy, their clinical utility is limited by a lack of robust uncertainty quantification. Conformal prediction (CP) offers a distribution-free framework to generate prediction sets with statistical guarantees of coverage. However, the interaction between standard training practices like data augmentation and the validity of these guarantees is not well understood. In this study, we systematically investigate how different data augmentation strategies affect the performance of conformal predictors for DR grading. Using the DDR dataset, we evaluate two backbone architectures -- ResNet-50 and a Co-Scale Conv-Attentional Transformer (CoaT) -- trained under five augmentation regimes: no augmentation, standard geometric transforms, CLAHE, Mixup, and CutMix. We analyze the downstream effects on conformal metrics, including empirical coverage, average prediction set size, and correct efficiency. Our results demonstrate that sample-mixing strategies like Mixup and CutMix not only improve predictive accuracy but also yield more reliable and efficient uncertainty estimates. Conversely, methods like CLAHE can negatively impact model certainty. These findings highlight the need to co-design augmentation strategies with downstream uncertainty quantification in mind to build genuinely trustworthy AI systems for medical imaging.

[53] Tooth-Diffusion: Guided 3D CBCT Synthesis with Fine-Grained Tooth Conditioning

Said Djafar Said,Torkan Gholamalizadeh,Mostafa Mehdipour Ghazi

Main category: cs.CV

TL;DR: This paper proposes a novel conditional diffusion framework for generating realistic 3D dental CBCT scans with precise control over individual teeth, enabling applications in dental diagnosis, treatment planning, and data augmentation.

Details

Motivation: The motivation is to address the challenge of generating anatomically realistic dental CBCT scans with fine-grained control, which is crucial for diagnosis and treatment planning in dentistry. Method: The paper introduces a 3D dental volume generation framework using wavelet-based denoising diffusion, FiLM conditioning, and masked loss functions, guided by tooth-level binary attributes for precise control over tooth presence and configuration. Result: The results demonstrate strong fidelity and generalization with low FID scores, robust inpainting performance, and SSIM values above 0.91 even on unseen scans, indicating high-quality 3D dental volume generation. Conclusion: This paper concludes that the proposed conditional diffusion framework enables realistic and localized modification of dental CBCT scans, offering benefits for surgical planning, patient communication, and data augmentation in dental AI workflows. Abstract: Despite the growing importance of dental CBCT scans for diagnosis and treatment planning, generating anatomically realistic scans with fine-grained control remains a challenge in medical image synthesis. In this work, we propose a novel conditional diffusion framework for 3D dental volume generation, guided by tooth-level binary attributes that allow precise control over tooth presence and configuration. Our approach integrates wavelet-based denoising diffusion, FiLM conditioning, and masked loss functions to focus learning on relevant anatomical structures. We evaluate the model across diverse tasks, such as tooth addition, removal, and full dentition synthesis, using both paired and distributional similarity metrics. Results show strong fidelity and generalization with low FID scores, robust inpainting performance, and SSIM values above 0.91 even on unseen scans. By enabling realistic, localized modification of dentition without rescanning, this work opens opportunities for surgical planning, patient communication, and targeted data augmentation in dental AI workflows. The codes are available at: https://github.com/djafar1/tooth-diffusion.

[54] GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

Elena Alegret Regalado,Kunyi Li,Sen Wang,Siyun Liang,Michael Niemeyer,Stefano Gasperini,Nassir Navab,Federico Tombari

Main category: cs.CV

TL;DR: GALA is a new framework for open-vocabulary 3D scene understanding that improves performance while reducing memory usage.

Details

Motivation: Existing methods struggle to capture fine-grained, language-aware 3D representations from 2D images. Method: GALA uses a cross-attention module with two learnable codebooks for encoding view-independent semantic embeddings, along with self-supervised contrastive learning to distill a scene-specific 3D instance feature field. Result: Extensive experiments on real-world datasets validate the effectiveness of GALA in open-vocabulary 3D scene understanding. Conclusion: GALA demonstrates remarkable open-vocabulary performance on both 2D and 3D tasks while reducing memory consumption. Abstract: 3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.

[55] Multi-Rationale Explainable Object Recognition via Contrastive Conditional Inference

Ali Rasekh,Sepehr Kazemi Ranjbar,Simon Gottschalk

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的对比条件推理框架，以解决基于CLIP的可解释对象识别中的限制，并引入了一个包含多个真实解释的基准数据集。

Details

Motivation: 现有方法依赖于基于提示的条件处理，存在CLIP文本编码器的限制，并且只能提供弱条件处理解释结构。此外，之前的数据集通常局限于单一且可能有噪声的解释，无法捕捉到图像特征的全部多样性。 Method: 提出了一种对比条件推理框架，显式建模图像嵌入、类别标签和解释之间的概率关系。 Result: 该方法在多解释可解释对象识别基准上取得了最先进的结果，包括强大的零样本性能，并在分类准确性和解释质量方面树立了新标准。 Conclusion: 本文不仅提出了新的框架，还建立了一个更全面的评估未来可解释对象识别模型的标准。 Abstract: Explainable object recognition using vision-language models such as CLIP involves predicting accurate category labels supported by rationales that justify the decision-making process. Existing methods typically rely on prompt-based conditioning, which suffers from limitations in CLIP's text encoder and provides weak conditioning on explanatory structures. Additionally, prior datasets are often restricted to single, and frequently noisy, rationales that fail to capture the full diversity of discriminative image features. In this work, we introduce a multi-rationale explainable object recognition benchmark comprising datasets in which each image is annotated with multiple ground-truth rationales, along with evaluation metrics designed to offer a more comprehensive representation of the task. To overcome the limitations of previous approaches, we propose a contrastive conditional inference (CCI) framework that explicitly models the probabilistic relationships among image embeddings, category labels, and rationales. Without requiring any training, our framework enables more effective conditioning on rationales to predict accurate object categories. Our approach achieves state-of-the-art results on the multi-rationale explainable object recognition benchmark, including strong zero-shot performance, and sets a new standard for both classification accuracy and rationale quality. Together with the benchmark, this work provides a more complete framework for evaluating future models in explainable object recognition. The code will be made available online.

[56] OccluNet: Spatio-Temporal Deep Learning for Occlusion Detection on DSA

Anushka A. Kore,Frank G. te Nijenhuis,Matthijs van der Sluijs,Wim van Zwam,Charles Majoie,Geert Lycklama à Nijeholt,Danny Ruijters,Frans Vos,Sandra Cornelissen,Ruisheng Su,Theo van Walsum

Main category: cs.CV

TL;DR: OccluNet是一种用于自动检测DSA序列中血管闭塞的时空深度学习模型，显著优于基线模型。

Details

Motivation: 在急性缺血性卒中（AIS）中，准确检测血管闭塞对于血管内血栓切除术（EVT）至关重要，但数字减影血管造影（DSA）序列的解读由于解剖结构复杂和时间限制而面临挑战。 Method: 提出了OccluNet，一种结合YOLOX和基于变压器的时间注意力机制的时空深度学习模型。 Result: OccluNet在MR CLEAN Registry的DSA图像上评估显示其能够捕捉时间一致的特征，精度和召回率分别达到89.02%和74.87%。 Conclusion: OccluNet显著优于基线模型，两种注意力变体均达到相似性能。 Abstract: Accurate detection of vascular occlusions during endovascular thrombectomy (EVT) is critical in acute ischemic stroke (AIS). Interpretation of digital subtraction angiography (DSA) sequences poses challenges due to anatomical complexity and time constraints. This work proposes OccluNet, a spatio-temporal deep learning model that integrates YOLOX, a single-stage object detector, with transformer-based temporal attention mechanisms to automate occlusion detection in DSA sequences. We compared OccluNet with a YOLOv11 baseline trained on either individual DSA frames or minimum intensity projections. Two spatio-temporal variants were explored for OccluNet: pure temporal attention and divided space-time attention. Evaluation on DSA images from the MR CLEAN Registry revealed the model's capability to capture temporally consistent features, achieving precision and recall of 89.02% and 74.87%, respectively. OccluNet significantly outperformed the baseline models, and both attention variants attained similar performance. Source code is available at https://github.com/anushka-kore/OccluNet.git

[57] Pixels to Play: A Foundation Model for 3D Gameplay

Yuguang Yue,Chris Green,Samuel Hunt,Irakli Salia,Wenzhe Shi,Jonathan J Hunt

Main category: cs.CV

TL;DR: Pixels2Play-0.1 (P2P0.1) is a transformer-based AI agent that learns to play diverse 3D games with human-like behavior using only pixel inputs, trained via behavior cloning with labeled and unlabeled data, aiming to enable applications like AI teammates and controllable NPCs.

Details

Motivation: Motivated by use cases such as AI teammates, controllable NPCs, personalized live-streamers, and assistive testers, the research aims to develop an agent that relies on the same pixel stream available to players and generalizes to new games. Method: P2P0.1 is trained end-to-end using behavior cloning, combining labeled human gameplay demonstrations with unlabeled public videos where actions are imputed via an inverse-dynamics model. The architecture employs a decoder-only transformer with auto-regressive action output, optimized for performance on a single consumer GPU. Result: The model demonstrates competent gameplay across simple Roblox and classic MS-DOS titles, with ablation studies showing the effectiveness of unlabeled data. The paper outlines steps needed to achieve expert-level, text-conditioned control. Conclusion: P2P0.1 is a foundation model capable of playing a wide range of 3D video games with human-like behavior, and it can generalize to new titles with minimal game-specific engineering. Abstract: We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.

Guile Wu,David Huang,Dongfeng Bai,Bingbing Liu

Main category: cs.CV

TL;DR: 本文提出了一种统一的多模态多视角视频生成框架，用于自动驾驶，通过一个包含模态共享和特定组件的扩散模型，实现了高保真和可控性的视频生成，实验结果优于当前最先进的方法。

Details

Motivation: 现有的自动驾驶视频生成方法主要集中在RGB视频生成上，缺乏支持多模态视频生成的能力，而多模态数据对于全面的城市场景理解至关重要。 Method: 构建了一个包含模态共享组件和模态特定组件的统一扩散变压器模型，并利用多样化的条件输入对可控场景结构和内容线索进行编码。 Result: 在nuScenes数据集上的实验结果表明，该方法在多模态多视角城市场景视频生成方面超越了现有最先进的方法，具有高保真度和可控性。 Conclusion: 本文提出了一种新的自动驾驶多模态多视角视频生成方法，并在nuScenes数据集上证明了其在生成高保真和可控性多模态多视角城市场景视频上的有效性。 Abstract: Video generation has recently shown superiority in urban scene synthesis for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage diverse conditioning inputs to encode controllable scene structure and content cues into the unified diffusion model for multi-modal multi-view video generation. In this way, our approach is capable of generating multi-modal multi-view driving scene videos in a unified framework. Our experiments on the challenging real-world autonomous driving dataset, nuScenes, show that our approach can generate multi-modal multi-view urban scene videos with high fidelity and controllability, surpassing the state-of-the-art methods.

[59] Inter-Class Relational Loss for Small Object Detection: A Case Study on License Plates

Dian Ning,Dong Seog Han

Main category: cs.CV

TL;DR: This paper proposes an inter-class relational loss to improve small object detection by leveraging spatial relationships between objects, achieving significant performance gains without additional tuning.

Details

Motivation: The motivation stems from the limitation of IoU-based losses in updating gradients for small objects due to flat gradients, which leads to poor learning performance for small objects during multi-object detection. Method: The method introduces an inter-class relational loss that leverages spatial relationships between objects (e.g., a car plate and a car) to guide learning. This loss adds punishment inversely proportional to the overlap between bounding boxes, enhancing gradient updates for small objects using information from larger ones. Result: The results show an improvement of 10.3% and 1.6% in mAP$^{ ext{test}}_{50}$ for YOLOv12-T and UAV-DETR models, respectively, demonstrating the effectiveness of the proposed method in enhancing detection performance. Conclusion: The paper concludes that the proposed inter-class relational loss effectively improves small object detection performance in one-stage multi-object detection tasks, achieving significant gains in mAP without additional hyperparameter tuning. Abstract: In one-stage multi-object detection tasks, various intersection over union (IoU)-based solutions aim at smooth and stable convergence near the targets during training. However, IoU-based losses fail to correctly update the gradient of small objects due to an extremely flat gradient. During the update of multiple objects, the learning of small objects' gradients suffers more because of insufficient gradient updates. Therefore, we propose an inter-class relational loss to efficiently update the gradient of small objects while not sacrificing the learning efficiency of other objects based on the simple fact that an object has a spatial relationship to another object (e.g., a car plate is attached to a car in a similar position). When the predicted car plate's bounding box is not within its car, a loss punishment is added to guide the learning, which is inversely proportional to the overlapped area of the car's and predicted car plate's bounding box. By leveraging the spatial relationship at the inter-class level, the loss guides small object predictions using larger objects and enhances latent information in deeper feature maps. In this paper, we present twofold contributions using license plate detection as a case study: (1) a new small vehicle multi-license plate dataset (SVMLP), featuring diverse real-world scenarios with high-quality annotations; and (2) a novel inter-class relational loss function designed to promote effective detection performance. We highlight the proposed ICR loss penalty can be easily added to existing IoU-based losses and enhance the performance. These contributions improve the standard mean Average Precision (mAP) metric, achieving gains of 10.3% and 1.6% in mAP$^{\text{test}}_{50}$ for YOLOv12-T and UAV-DETR, respectively, without any additional hyperparameter tuning. Code and dataset will be available soon.

[60] HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation

Gaston Gustavo Rios

Main category: cs.CV

TL;DR: 本文提出了一种基于CMLPe的轻量级手语生成模型，利用合成数据预训练方法提升了手语识别的准确率，并在多个数据集上取得了最先进的结果。

Details

Motivation: 手语识别（SLR）模型由于训练数据不足面临显著性能限制，需要提升数据可用性的方法。 Method: 基于CMLPe的轻量级手势生成模型，结合合成数据预训练方法。 Result: 在LSFB和DiSPLaY数据集上建立了新的最先进结果，并验证了合成数据预训练方法的有效性。 Conclusion: 合成数据预训练方法与传统增强方法结合使用时具有互补的优势，并在多个数据集中显著提升了识别准确率。 Abstract: Sign Language Recognition (SLR) models face significant performance limitations due to insufficient training data availability. In this article, we address the challenge of limited data in SLR by introducing a novel and lightweight sign generation model based on CMLPe. This model, coupled with a synthetic data pretraining approach, consistently improves recognition accuracy, establishing new state-of-the-art results for the LSFB and DiSPLaY datasets using our Mamba-SL and Transformer-SL classifiers. Our findings reveal that synthetic data pretraining outperforms traditional augmentation methods in some cases and yields complementary benefits when implemented alongside them. Our approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that achieve significant performance improvements across diverse datasets.

[61] Deep Learning for Taxol Exposure Analysis: A New Cell Image Dataset and Attention-Based Baseline Model

Sean Fletcher,Gabby Scott,Douglas Currie,Xin Zhang,Yuqi Song,Bruce MacLeod

Main category: cs.CV

TL;DR: This paper introduces a new microscopy image dataset of Taxol-treated C6 glioma cells and proposes the ResAttention-KNN model for automated Taxol concentration classification, aiming to support high-throughput and real-time biomedical analysis.

Details

Motivation: Monitoring the effects of Taxol at the cellular level is crucial for clinical evaluation and biomedical research, but current detection methods are expensive, labor-intensive, and unsuitable for high-throughput or real-time analysis. Deep learning approaches offer promise in automating this process. Method: The study introduces a new microscopy image dataset of C6 glioma cells treated with varying concentrations of Taxol and proposes a baseline model named ResAttention-KNN, combining ResNet-50 with Convolutional Block Attention Modules and using a k-Nearest Neighbors classifier for classification. Result: A new microscopy image dataset was created, and the proposed ResAttention-KNN model integrates attention-based refinement and non-parametric classification to enhance robustness and interpretability for Taxol concentration classification. Conclusion: Both the dataset and the ResAttention-KNN model are publicly released to support reproducibility and future research in vision-based biomedical analysis. Abstract: Monitoring the effects of the chemotherapeutic agent Taxol at the cellular level is critical for both clinical evaluation and biomedical research. However, existing detection methods require specialized equipment, skilled personnel, and extensive sample preparation, making them expensive, labor-intensive, and unsuitable for high-throughput or real-time analysis. Deep learning approaches have shown great promise in medical and biological image analysis, enabling automated, high-throughput assessment of cellular morphology. Yet, no publicly available dataset currently exists for automated morphological analysis of cellular responses to Taxol exposure. To address this gap, we introduce a new microscopy image dataset capturing C6 glioma cells treated with varying concentrations of Taxol. To provide an effective solution for Taxol concentration classification and establish a benchmark for future studies on this dataset, we propose a baseline model named ResAttention-KNN, which combines a ResNet-50 with Convolutional Block Attention Modules and uses a k-Nearest Neighbors classifier in the learned embedding space. This model integrates attention-based refinement and non-parametric classification to enhance robustness and interpretability. Both the dataset and implementation are publicly released to support reproducibility and facilitate future research in vision-based biomedical analysis.

[62] Learning Point Cloud Representations with Pose Continuity for Depth-Based Category-Level 6D Object Pose Estimation

Zhujun Li,Shuo Zhang,Ioannis Stamos

Main category: cs.CV

TL;DR: HRC-Pose is a novel depth-only framework for category-level object pose estimation that uses contrastive learning to capture the continuity of 6D poses, decouples rotation and translation components, and outperforms state-of-the-art methods on benchmark datasets.

Details

Motivation: Existing approaches for category-level object pose estimation rely solely on 6D poses as supervisory signals without capturing the intrinsic continuity of poses, leading to prediction inconsistencies and reduced generalization to unseen poses. This work aims to address this limitation by explicitly modeling the continuity of 6D poses. Method: HRC-Pose uses contrastive learning to learn continuous point cloud representations while decoupling object pose into rotation and translation components. It introduces a multi-task, multi-category contrastive learning strategy based on a 6D pose-aware hierarchical ranking scheme and separately processes rotation-aware and translation-aware embeddings. Result: HRC-Pose successfully learns continuous feature spaces and demonstrates superior performance on the REAL275 and CAMERA25 benchmarks compared to existing depth-only methods. Conclusion: HRC-Pose is an effective and real-time depth-only framework that outperforms existing state-of-the-art methods in category-level object pose estimation, showing strong potential for real-world applications. Abstract: Category-level object pose estimation aims to predict the 6D pose and 3D size of objects within given categories. Existing approaches for this task rely solely on 6D poses as supervisory signals without explicitly capturing the intrinsic continuity of poses, leading to inconsistencies in predictions and reduced generalization to unseen poses. To address this limitation, we propose HRC-Pose, a novel depth-only framework for category-level object pose estimation, which leverages contrastive learning to learn point cloud representations that preserve the continuity of 6D poses. HRC-Pose decouples object pose into rotation and translation components, which are separately encoded and leveraged throughout the network. Specifically, we introduce a contrastive learning strategy for multi-task, multi-category scenarios based on our 6D pose-aware hierarchical ranking scheme, which contrasts point clouds from multiple categories by considering rotational and translational differences as well as categorical information. We further design pose estimation modules that separately process the learned rotation-aware and translation-aware embeddings. Our experiments demonstrate that HRC-Pose successfully learns continuous feature spaces. Results on REAL275 and CAMERA25 benchmarks show that our method consistently outperforms existing depth-only state-of-the-art methods and runs in real-time, demonstrating its effectiveness and potential for real-world applications. Our code is at https://github.com/zhujunli1993/HRC-Pose.

[63] Taming Transformer for Emotion-Controllable Talking Face Generation

Ziqi Zhang,Cheng Deng

Main category: cs.CV

TL;DR: The paper proposes a new method for emotion-controllable talking face generation that effectively models audio-visual relationships and synthesizes identity preserving emotional videos, outperforming existing methods on the MEAD dataset.

Details

Motivation: The motivation is to overcome challenges in emotion-controllable talking face generation, particularly in modeling multimodal relationships related to specific emotions and synthesizing identity preserving emotional videos. Method: The method involves pre-training strategies to disentangle audio and quantize videos, the use of emotion-anchor representation, and an autoregressive transformer for modeling and predicting visual tokens. Result: The result is a novel method that effectively tackles the emotion-controllable talking face generation task, showing superiority both qualitatively and quantitatively on the MEAD dataset. Conclusion: The paper concludes that their proposed method for emotion-controllable talking face generation outperforms existing methods, as demonstrated by extensive experiments on the MEAD dataset. Abstract: Talking face generation is a novel and challenging generation task, aiming at synthesizing a vivid speaking-face video given a specific audio. To fulfill emotion-controllable talking face generation, current methods need to overcome two challenges: One is how to effectively model the multimodal relationship related to the specific emotion, and the other is how to leverage this relationship to synthesize identity preserving emotional videos. In this paper, we propose a novel method to tackle the emotion-controllable talking face generation task discretely. Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA) representation that integrates the emotional information into visual tokens. Finally, we introduce an autoregressive transformer to model the global distribution of the visual tokens under the given conditions and further predict the index sequence for synthesizing the manipulated videos. We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios. Extensive experiments demonstrate the superiorities of our method both qualitatively and quantitatively.

[64] FastTracker: Real-Time and Accurate Visual Tracking

Hamidreza Hashempoor,Yu Dong Hwang

Main category: cs.CV

TL;DR: 这篇论文介绍了一个通用的多目标跟踪框架，特别强调在复杂交通场景中的车辆跟踪。该框架包括遮挡感知的重新识别机制和道路结构感知的轨迹片段优化策略，实验结果显示其在多个基准测试中具有稳健的性能表现。

Details

Motivation: 传统多目标跟踪系统主要针对行人跟踪设计，对其他目标类别的泛化能力有限。该研究旨在开发一个能够处理多种目标类型的通用跟踪框架，特别是用于复杂交通场景中的车辆跟踪。 Method: 该方法包括两个关键组件：（1）一种遮挡感知的重新识别机制，增强了对严重遮挡物体的身份保持能力；（2）一种道路结构感知的轨迹片段优化策略，利用语义场景先验信息（如车道方向、人行横道和道路边界）来提高轨迹的连续性和准确性。 Result: 实验结果表明，所提出的方法在新引入的数据集和多个公共基准测试中均表现出稳健的性能，在MOT17和MOT20测试集上的HOTA得分分别为66.4和65.7。 Conclusion: 该论文提出了一种通用的多目标跟踪框架，不仅在车辆跟踪方面表现出色，而且在传统基准测试中也取得了良好的成绩，证明了其在通用目标跟踪中的有效性。 Abstract: Conventional multi-object tracking (MOT) systems are predominantly designed for pedestrian tracking and often exhibit limited generalization to other object categories. This paper presents a generalized tracking framework capable of handling multiple object types, with a particular emphasis on vehicle tracking in complex traffic scenes. The proposed method incorporates two key components: (1) an occlusion-aware re-identification mechanism that enhances identity preservation for heavily occluded objects, and (2) a road-structure-aware tracklet refinement strategy that utilizes semantic scene priors such as lane directions, crosswalks, and road boundaries to improve trajectory continuity and accuracy. In addition, we introduce a new benchmark dataset comprising diverse vehicle classes with frame-level tracking annotations, specifically curated to support evaluation of vehicle-focused tracking methods. Extensive experimental results demonstrate that the proposed approach achieves robust performance on both the newly introduced dataset and several public benchmarks, highlighting its effectiveness in general-purpose object tracking. While our framework is designed for generalized multi-class tracking, it also achieves strong performance on conventional benchmarks, with HOTA scores of 66.4 on MOT17 and 65.7 on MOT20 test sets. Code and Benchmark are available: github.com/Hamidreza-Hashempoor/FastTracker, huggingface.co/datasets/Hamidreza-Hashemp/FastTracker-Benchmark.

[65] TCFNet: Bidirectional face-bone transformation via a Transformer-based coarse-to-fine point movement network

Runshi Zhang,Bimeng Jie,Yang He,Junchen Wang

Main category: cs.CV

TL;DR: This paper proposes a novel Transformer-based coarse-to-fine point movement network (TCFNet) for accurate face-bone shape transformation simulation, outperforming traditional and deep learning-based methods.

Details

Motivation: Traditional biomechanical simulation methods are limited by computational time, labor-intensive data processing, and low accuracy. Deep learning-based methods have limitations in processing large-scale points and require complex preprocessing and postprocessing. Method: Transformer-based coarse-to-fine point movement network (TCFNet) with local information aggregation network (LIA-Net) and auxiliary loss. Result: TCFNet achieves outstanding evaluation metrics and visualization results compared to existing state-of-the-art methods. Conclusion: The proposed TCFNet is effective in accurately simulating face-bone shape transformations and outperforms existing methods. Abstract: Computer-aided surgical simulation is a critical component of orthognathic surgical planning, where accurately simulating face-bone shape transformations is significant. The traditional biomechanical simulation methods are limited by their computational time consumption levels, labor-intensive data processing strategies and low accuracy. Recently, deep learning-based simulation methods have been proposed to view this problem as a point-to-point transformation between skeletal and facial point clouds. However, these approaches cannot process large-scale points, have limited receptive fields that lead to noisy points, and employ complex preprocessing and postprocessing operations based on registration. These shortcomings limit the performance and widespread applicability of such methods. Therefore, we propose a Transformer-based coarse-to-fine point movement network (TCFNet) to learn unique, complicated correspondences at the patch and point levels for dense face-bone point cloud transformations. This end-to-end framework adopts a Transformer-based network and a local information aggregation network (LIA-Net) in the first and second stages, respectively, which reinforce each other to generate precise point movement paths. LIA-Net can effectively compensate for the neighborhood precision loss of the Transformer-based network by modeling local geometric structures (edges, orientations and relative position features). The previous global features are employed to guide the local displacement using a gated recurrent unit. Inspired by deformable medical image registration, we propose an auxiliary loss that can utilize expert knowledge for reconstructing critical organs.Compared with the existing state-of-the-art (SOTA) methods on gathered datasets, TCFNet achieves outstanding evaluation metrics and visualization results. The code is available at https://github.com/Runshi-Zhang/TCFNet.

[66] QuadINR: Hardware-Efficient Implicit Neural Representations Through Quadratic Activation

Wenyong Zhou,Boyu Li,Jiachen Ren,Taiqiang Wu,Zhilin Ai,Zhengwu Liu,Ngai Wong

Main category: cs.CV

TL;DR: QuadINR is a hardware-efficient method for neural representations that improves performance while significantly reducing power and resource consumption.

Details

Motivation: To address spectral bias in INRs without incurring significant hardware overhead. Method: Utilizes piecewise quadratic activation functions and an N-stage pipeline framework for efficient hardware implementation. Result: Achieved up to 2.06dB PSNR improvement, reduced area to 1914μm², and lowered dynamic power to 6.14mW, with up to 97% reduction in resource/power consumption and 93% improvement in latency. Conclusion: QuadINR provides a hardware-efficient solution for Implicit Neural Representations, significantly reducing resource and power consumption while improving performance. Abstract: Implicit Neural Representations (INRs) encode discrete signals continuously while addressing spectral bias through activation functions (AFs). Previous approaches mitigate this bias by employing complex AFs, which often incur significant hardware overhead. To tackle this challenge, we introduce QuadINR, a hardware-efficient INR that utilizes piecewise quadratic AFs to achieve superior performance with dramatic reductions in hardware consumption. The quadratic functions encompass rich harmonic content in their Fourier series, delivering enhanced expressivity for high-frequency signals, as verified through Neural Tangent Kernel (NTK) analysis. We develop a unified $N$-stage pipeline framework that facilitates efficient hardware implementation of various AFs in INRs. We demonstrate FPGA implementations on the VCU128 platform and an ASIC implementation in a 28nm process. Experiments across images and videos show that QuadINR achieves up to 2.06dB PSNR improvement over prior work, with an area of only 1914$\mu$m$^2$ and a dynamic power of 6.14mW, reducing resource and power consumption by up to 97\% and improving latency by up to 93\% vs existing baselines.

[67] Img2ST-Net: Efficient High-Resolution Spatial Omics Prediction from Whole Slide Histology Images via Fully Convolutional Image-to-Image Learning

Junchao Zhu,Ruining Deng,Junlin Guo,Tianyuan Yao,Juming Xiong,Chongyu Qu,Mengmeng Yin,Yu Wang,Shilin Zhao,Haichun Yang,Daguang Xu,Yucheng Tang,Yuankai Huo

Main category: cs.CV

TL;DR: Img2ST-Net is a novel framework that efficiently generates high-resolution spatial transcriptomics (ST) data from histology images using a fully convolutional architecture, addressing computational challenges and preserving spatial organization.

Details

Motivation: The increasing resolution of spatial transcriptomics (ST) data introduces significant computational and modeling challenges, especially due to inefficiency and instability of conventional spot-by-spot sequential regression frameworks at high resolutions. High-resolution ST also presents issues with extreme sparsity and low expression levels, complicating prediction and evaluation. Method: Img2ST-Net uses a fully convolutional architecture to generate dense, HD gene expression maps in a parallelized manner by modeling HD ST data as super-pixel representations. Additionally, SSIM-ST, a structural-similarity-based evaluation metric, was introduced to enhance robustness under sparse expression patterns. Result: Img2ST-Net improves computational efficiency while better preserving the spatial organization intrinsic to spatial omics data, offering efficient and parallel high-resolution ST prediction with enhanced robustness under sparse expression patterns. Conclusion: Img2ST-Net provides a scalable and robust framework for high-resolution ST prediction, laying the groundwork for next-generation ST modeling that is efficient, accurate, and resolution-aware. Abstract: Recent advances in multi-modal AI have demonstrated promising potential for generating the currently expensive spatial transcriptomics (ST) data directly from routine histology images, offering a means to reduce the high cost and time-intensive nature of ST data acquisition. However, the increasing resolution of ST, particularly with platforms such as Visium HD achieving 8um or finer, introduces significant computational and modeling challenges. Conventional spot-by-spot sequential regression frameworks become inefficient and unstable at this scale, while the inherent extreme sparsity and low expression levels of high-resolution ST further complicate both prediction and evaluation. To address these limitations, we propose Img2ST-Net, a novel histology-to-ST generation framework for efficient and parallel high-resolution ST prediction. Unlike conventional spot-by-spot inference methods, Img2ST-Net employs a fully convolutional architecture to generate dense, HD gene expression maps in a parallelized manner. By modeling HD ST data as super-pixel representations, the task is reformulated from image-to-omics inference into a super-content image generation problem with hundreds or thousands of output channels. This design not only improves computational efficiency but also better preserves the spatial organization intrinsic to spatial omics data. To enhance robustness under sparse expression patterns, we further introduce SSIM-ST, a structural-similarity-based evaluation metric tailored for high-resolution ST analysis. We present a scalable, biologically coherent framework for high-resolution ST prediction. Img2ST-Net offers a principled solution for efficient and accurate ST inference at scale. Our contributions lay the groundwork for next-generation ST modeling that is robust and resolution-aware. The source code has been made publicly available at https://github.com/hrlblab/Img2ST-Net.

[68] CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities

Yue Gong,Shanyuan Liu,Liuzhuozheng Li,Jian Zhu,Bo Cheng,Liebucha Wu,Xiaoyu Wu,Yuhang Ma,Dawei Leng,Yuhui Yin

Main category: cs.CV

TL;DR: CTA-Flux 是一种新的中文文本适配方法，它通过 MultiModal Diffusion Transformer (MMDiT) 直接控制英文文本到图像模型 Flux，从而有效提升中文语义的理解能力和图像生成质量，同时保持与现有插件的兼容性。

Details

Motivation: Flux 模型在处理英文文本输入时表现出色，但在处理非英文提示时效果较差，尤其是由于训练数据集的英文中心主义倾向导致的语言和文化偏见。现有的方法无法有效解决中文语义的问题，因此需要一种新的方法来提升模型对中文语义的理解和图像生成质量。 Method: CTA-Flux 利用 MultiModal Diffusion Transformer (MMDiT) 直接控制 Flux 模型，以减少参数数量并增强对中文语义的理解。 Result: CTA-Flux 在支持中英文提示的同时，实现了更优的图像生成质量、视觉真实感以及对中文语义的忠实表达。 Conclusion: CTA-Flux 通过直接控制基于英文的 TTI 模型 Flux，有效提升了模型对中文语义的理解能力，同时保持了与现有 TTI 插件的兼容性。 Abstract: We proposed the Chinese Text Adapter-Flux (CTA-Flux). An adaptation method fits the Chinese text inputs to Flux, a powerful text-to-image (TTI) generative model initially trained on the English corpus. Despite the notable image generation ability conditioned on English text inputs, Flux performs poorly when processing non-English prompts, particularly due to linguistic and cultural biases inherent in predominantly English-centric training datasets. Existing approaches, such as translating non-English prompts into English or finetuning models for bilingual mappings, inadequately address culturally specific semantics, compromising image authenticity and quality. To address this issue, we introduce a novel method to bridge Chinese semantic understanding with compatibility in English-centric TTI model communities. Existing approaches relying on ControlNet-like architectures typically require a massive parameter scale and lack direct control over Chinese semantics. In comparison, CTA-flux leverages MultiModal Diffusion Transformer (MMDiT) to control the Flux backbone directly, significantly reducing the number of parameters while enhancing the model's understanding of Chinese semantics. This integration significantly improves the generation quality and cultural authenticity without extensive retraining of the entire model, thus maintaining compatibility with existing text-to-image plugins such as LoRA, IP-Adapter, and ControlNet. Empirical evaluations demonstrate that CTA-flux supports Chinese and English prompts and achieves superior image generation quality, visual realism, and faithful depiction of Chinese semantics.

[69] MoCHA-former: Moiré-Conditioned Hybrid Adaptive Transformer for Video Demoiréing

Jeahun Sung,Changhyun Roh,Chanho Eom,Jihyong Oh

Main category: cs.CV

TL;DR: 本文提出MoCHA-former，通过创新的DMAD和STAD组件，有效解决视频捕捉中的莫尔纹问题，在多个指标上表现优异。

Details

Motivation: 屏幕捕捉中因相机CFA和显示器子像素间的频率混叠导致的莫尔纹严重影响视频质量，但现有方法在处理此类问题时存在多种局限性。 Method: 提出了基于Transformer的MoCHA-former模型，包含DMAD和STAD两个关键组件，分别用于解耦莫尔纹与内容、空间-时间自适应去莫尔纹。 Result: MoCHA-former在PSNR、SSIM和LPIPS等指标上均优于现有方法，且无需显式对齐模块即可确保时间一致性。 Conclusion: MoCHA-former有效解决了视频捕捉中的莫尔纹问题，并在多个数据集和评估指标上超越了现有方法。 Abstract: Recent advances in portable imaging have made camera-based screen capture ubiquitous. Unfortunately, frequency aliasing between the camera's color filter array (CFA) and the display's sub-pixels induces moir\'e patterns that severely degrade captured photos and videos. Although various demoir\'eing models have been proposed to remove such moir\'e patterns, these approaches still suffer from several limitations: (i) spatially varying artifact strength within a frame, (ii) large-scale and globally spreading structures, (iii) channel-dependent statistics and (iv) rapid temporal fluctuations across frames. We address these issues with the Moir\'e Conditioned Hybrid Adaptive Transformer (MoCHA-former), which comprises two key components: Decoupled Moir\'e Adaptive Demoir\'eing (DMAD) and Spatio-Temporal Adaptive Demoir\'eing (STAD). DMAD separates moir\'e and content via a Moir\'e Decoupling Block (MDB) and a Detail Decoupling Block (DDB), then produces moir\'e-adaptive features using a Moir\'e Conditioning Block (MCB) for targeted restoration. STAD introduces a Spatial Fusion Block (SFB) with window attention to capture large-scale structures, and a Feature Channel Attention (FCA) to model channel dependence in RAW frames. To ensure temporal consistency, MoCHA-former performs implicit frame alignment without any explicit alignment module. We analyze moir\'e characteristics through qualitative and quantitative studies, and evaluate on two video datasets covering RAW and sRGB domains. MoCHA-former consistently surpasses prior methods across PSNR, SSIM, and LPIPS.

[70] HyperDiff: Hypergraph Guided Diffusion Model for 3D Human Pose Estimation

Bing Han,Yuhua Huang,Pan Gao

Main category: cs.CV

TL;DR: HyperDiff结合扩散模型和HyperGCN，解决了单目三维人体姿态估计中的深度模糊和遮挡问题，提升了姿态估计的准确性和适应性。

Details

Motivation: 单目三维人体姿态估计在2D到3D转换过程中常遇到深度模糊和遮挡问题，而传统方法可能忽视骨骼结构信息中的多尺度特征，影响姿态估计的准确性。 Method: HyperDiff整合了扩散模型与HyperGCN。扩散模型用于捕捉数据的不确定性，缓解深度模糊和遮挡问题；HyperGCN作为去噪器，利用多粒度结构来精确建模关节之间的高阶相关性。 Result: 实验结果表明，HyperDiff在Human3.6M和MPI-INF-3DHP数据集上达到了最先进的性能，尤其是在复杂姿态下表现出更强的去噪能力。 Conclusion: HyperDiff是一个新的三维人体姿态估计方法，通过结合扩散模型和HyperGCN，有效解决了深度模糊和遮挡问题，同时能够灵活适应不同的计算资源需求，以平衡性能和效率。 Abstract: Monocular 3D human pose estimation (HPE) often encounters challenges such as depth ambiguity and occlusion during the 2D-to-3D lifting process. Additionally, traditional methods may overlook multi-scale skeleton features when utilizing skeleton structure information, which can negatively impact the accuracy of pose estimation. To address these challenges, this paper introduces a novel 3D pose estimation method, HyperDiff, which integrates diffusion models with HyperGCN. The diffusion model effectively captures data uncertainty, alleviating depth ambiguity and occlusion. Meanwhile, HyperGCN, serving as a denoiser, employs multi-granularity structures to accurately model high-order correlations between joints. This improves the model's denoising capability especially for complex poses. Experimental results demonstrate that HyperDiff achieves state-of-the-art performance on the Human3.6M and MPI-INF-3DHP datasets and can flexibly adapt to varying computational resources to balance performance and efficiency.

[71] FOCUS: Frequency-Optimized Conditioning of DiffUSion Models for mitigating catastrophic forgetting during Test-Time Adaptation

Gabriel Tjio,Jie Zhang,Xulei Yang,Yun Xing,Nhat Chung,Xiaofeng Cao,Ivor W. Tsang,Chee Keong Kwoh,Qing Guo

Main category: cs.CV

TL;DR: 本文提出了一种新的频率适应方法FOCUS，通过分离图像的高低频信息，实现了在测试时的模型适应，有效防止了任务相关知识的遗忘。

Details

Motivation: 测试时适应需要在保留知识和适应领域变化之间取得平衡，传统方法可能会导致任务相关知识的遗忘。 Method: 提出了一种新的频率混合数据增强方法FrequencyMix，并利用Y-FPN来分离图像的高低频信息。 Result: 在15种损坏类型和3个数据集上展示了其在语义分割和单目深度估计方面的有效性，并实现了最先进的平均性能。 Conclusion: FOCUS不仅提高了独立性能，还补充了现有的模型适应方法，即使在有限的监督下也能减轻灾难性遗忘。 Abstract: Test-time adaptation enables models to adapt to evolving domains. However, balancing the tradeoff between preserving knowledge and adapting to domain shifts remains challenging for model adaptation methods, since adapting to domain shifts can induce forgetting of task-relevant knowledge. To address this problem, we propose FOCUS, a novel frequency-based conditioning approach within a diffusion-driven input-adaptation framework. Utilising learned, spatially adaptive frequency priors, our approach conditions the reverse steps during diffusion-driven denoising to preserve task-relevant semantic information for dense prediction. FOCUS leverages a trained, lightweight, Y-shaped Frequency Prediction Network (Y-FPN) that disentangles high and low frequency information from noisy images. This minimizes the computational costs involved in implementing our approach in a diffusion-driven framework. We train Y-FPN with FrequencyMix, a novel data augmentation method that perturbs the images across diverse frequency bands, which improves the robustness of our approach to diverse corruptions. We demonstrate the effectiveness of FOCUS for semantic segmentation and monocular depth estimation across 15 corruption types and three datasets, achieving state-of-the-art averaged performance. In addition to improving standalone performance, FOCUS complements existing model adaptation methods since we can derive pseudo labels from FOCUS-denoised images for additional supervision. Even under limited, intermittent supervision with the pseudo labels derived from the FOCUS denoised images, we show that FOCUS mitigates catastrophic forgetting for recent model adaptation methods.

[72] MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion

Fei Peng,Junqiang Wu,Yan Li,Tingting Gao,Di Zhang,Huiyuan Fu

Main category: cs.CV

TL;DR: MUSE是一个统一的合成框架，通过串联交叉注意机制，实现布局控制和文本引导的结合，解决了多主体合成中的空间控制和身份保持问题。

Details

Motivation: 现有的文本到图像扩散模型在多主体组合合成方面面临空间控制和身份保持的挑战。 Method: MUSE框架使用串联交叉注意（CCA）机制，将布局规范与文本引导结合，并采用渐进式两阶段训练策略。 Result: MUSE在布局可控的多主体合成任务中表现出优于现有方法的空间精度和身份一致性。 Conclusion: MUSE有效地解决了布局控制和主体合成的挑战，实现了零样本端到端生成，提高了空间精度和身份一致性。 Abstract: Existing text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images guided by textual prompts. However, achieving multi-subject compositional synthesis with precise spatial control remains a significant challenge. In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. While recent advancements have separately improved layout control and subject synthesis, existing approaches struggle to simultaneously satisfy the dual requirements of spatial precision and identity preservation in this composite task. To bridge this gap, we propose MUSE, a unified synthesis framework that employs concatenated cross-attention (CCA) to seamlessly integrate layout specifications with textual guidance through explicit semantic space expansion. The proposed CCA mechanism enables bidirectional modality alignment between spatial constraints and textual descriptions without interference. Furthermore, we design a progressive two-stage training strategy that decomposes the LMS task into learnable sub-objectives for effective optimization. Extensive experiments demonstrate that MUSE achieves zero-shot end-to-end generation with superior spatial accuracy and identity consistency compared to existing solutions, advancing the frontier of controllable image synthesis. Our code and model are available at https://github.com/pf0607/MUSE.

[73] Reconstruction Using the Invisible: Intuition from NIR and Metadata for Enhanced 3D Gaussian Splatting

Gyusam Chang,Tuan-Anh Vu,Vivek Alumootil,Harris Song,Deanna Pham,Sangpil Kim,M. Khalid Jawed

Main category: cs.CV

TL;DR: 本文介绍了NIRPlant数据集和NIRSplat方法，用于解决农业场景中的3D重建问题。

Details

Motivation: 农业场景中的不均匀照明、遮挡和有限视野给3D重建带来了独特挑战，而3D高斯随机化（3DGS）在此领域的应用尚未得到充分探索。 Method: 引入了NIRPlant数据集，包含近红外（NIR）图像、RGB图像、文本元数据、深度和LiDAR数据，并提出了NIRSplat方法，采用跨注意力机制和基于3D点的位置编码进行多模态高斯随机化。 Result: 实验表明，NIRSplat在具有挑战性的农业场景中优于现有的3DGS、CoR-GS和InstantSplat等方法。 Conclusion: NIRSplat在农业场景的3D重建中表现出色，结合了多模态数据和创新的架构设计。 Abstract: While 3D Gaussian Splatting (3DGS) has rapidly advanced, its application in agriculture remains underexplored. Agricultural scenes present unique challenges for 3D reconstruction methods, particularly due to uneven illumination, occlusions, and a limited field of view. To address these limitations, we introduce \textbf{NIRPlant}, a novel multimodal dataset encompassing Near-Infrared (NIR) imagery, RGB imagery, textual metadata, Depth, and LiDAR data collected under varied indoor and outdoor lighting conditions. By integrating NIR data, our approach enhances robustness and provides crucial botanical insights that extend beyond the visible spectrum. Additionally, we leverage text-based metadata derived from vegetation indices, such as NDVI, NDWI, and the chlorophyll index, which significantly enriches the contextual understanding of complex agricultural environments. To fully exploit these modalities, we propose \textbf{NIRSplat}, an effective multimodal Gaussian splatting architecture employing a cross-attention mechanism combined with 3D point-based positional encoding, providing robust geometric priors. Comprehensive experiments demonstrate that \textbf{NIRSplat} outperforms existing landmark methods, including 3DGS, CoR-GS, and InstantSplat, highlighting its effectiveness in challenging agricultural scenarios. The code and dataset are publicly available at: https://github.com/StructuresComp/3D-Reconstruction-NIR

[74] Generalizable Engagement Estimation in Conversation via Domain Prompting and Parallel Attention

Yangche Yu,Yin Chen,Jia Li,Peng Jia,Yu Zhang,Li Dai,Zhenzhen Hu,Meng Wang,Richang Hong

Main category: cs.CV

TL;DR: 本文提出了一种名为DAPA的新型框架，通过引入Domain Prompting机制和Parallel Cross-Attention模块，实现了在跨文化和跨语言基准测试中的最先进水平，显著提高了参与度估计的准确性和系统的稳健部署。

Details

Motivation: 为了克服在不同领域间泛化能力差和建模复杂交互动态的挑战，需要进行准确的参与度估计以适应人机交互系统。 Method: DAPA（Domain-Adaptive Parallel Attention）框架引入了Domain Prompting机制和Parallel Cross-Attention模块，以实现领域适应和捕捉交互同步性。 Result: DAPA在多个跨文化及跨语言基准测试中达到了最先进水平，在NoXi-J测试集上的Concordance Correlation Coefficient（CCC）绝对提升了0.45。 Conclusion: DAPA实现了跨文化和跨语言基准测试的新技术水平，并在MultiMediate'25的多领域参与度估计挑战中获得了第一名，证明了其在建模复杂交互动态和实现稳健部署方面的优势。 Abstract: Accurate engagement estimation is essential for adaptive human-computer interaction systems, yet robust deployment is hindered by poor generalizability across diverse domains and challenges in modeling complex interaction dynamics.To tackle these issues, we propose DAPA (Domain-Adaptive Parallel Attention), a novel framework for generalizable conversational engagement modeling. DAPA introduces a Domain Prompting mechanism by prepending learnable domain-specific vectors to the input, explicitly conditioning the model on the data's origin to facilitate domain-aware adaptation while preserving generalizable engagement representations. To capture interactional synchrony, the framework also incorporates a Parallel Cross-Attention module that explicitly aligns reactive (forward BiLSTM) and anticipatory (backward BiLSTM) states between participants.Extensive experiments demonstrate that DAPA establishes a new state-of-the-art performance on several cross-cultural and cross-linguistic benchmarks, notably achieving an absolute improvement of 0.45 in Concordance Correlation Coefficient (CCC) over a strong baseline on the NoXi-J test set. The superiority of our method was also confirmed by winning the first place in the Multi-Domain Engagement Estimation Challenge at MultiMediate'25.

[75] D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis

Yuhang Guo,Kaijun Deng,Siyang Song,Jindong Xie,Wenhui Ma,Linlin Shen

Main category: cs.CV

TL;DR: 提出了一种新的方法D^3-Talker，用于解决3D说话头合成中的挑战。

Details

Motivation: 3D说话头合成面临从头训练新模型需要长时间说话头视频的问题。 Method: 构建静态3D高斯属性场，并使用音频和面部运动信号控制两个不同的高斯属性变形场。 Result: 实验表明，D^3-Talker在高保真渲染和准确的音频唇同步方面优于现有技术。 Conclusion: D^3-Talker有效解决了有限训练数据下的3D说话头合成问题。 Abstract: A key challenge in 3D talking head synthesis lies in the reliance on a long-duration talking head video to train a new model for each target identity from scratch. Recent methods have attempted to address this issue by extracting general features from audio through pre-training models. However, since audio contains information irrelevant to lip motion, existing approaches typically struggle to map the given audio to realistic lip behaviors in the target face when trained on only a few frames, causing poor lip synchronization and talking head image quality. This paper proposes D^3-Talker, a novel approach that constructs a static 3D Gaussian attribute field and employs audio and Facial Motion signals to independently control two distinct Gaussian attribute deformation fields, effectively decoupling the predictions of general and personalized deformations. We design a novel similarity contrastive loss function during pre-training to achieve more thorough decoupling. Furthermore, we integrate a Coarse-to-Fine module to refine the rendered images, alleviating blurriness caused by head movements and enhancing overall image quality. Extensive experiments demonstrate that D^3-Talker outperforms state-of-the-art methods in both high-fidelity rendering and accurate audio-lip synchronization with limited training data. Our code will be provided upon acceptance.

[76] Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering

Shanlin Sun,Yifan Wang,Hanwen Zhang,Yifeng Xiong,Qin Ren,Ruogu Fang,Xiaohui Xie,Chenyu You

Main category: cs.CV

TL;DR: Ouroboros是一种新的框架，使用两个单步扩散模型来相互增强前向和反向渲染，提高了处理速度和结果一致性。

Details

Motivation: 现有的多步扩散模型通常独立处理前向和反向渲染问题，导致循环不一致性和较慢的推理速度。 Method: 扩展了内在分解方法至室内外场景，并引入了循环一致性机制，确保前向和反向渲染输出的一致性。 Result: 实验结果表明，Ouroboros在多种场景中实现了最先进的性能，同时在推理速度上显著优于其他基于扩散的方法。此外，Ouroboros可以无训练地转移到视频分解中，减少视频序列中的时间不一致性，同时保持高质量的每帧反向渲染。 Conclusion: Ouroboros框架通过两个单步扩散模型处理前向和反向渲染，实现了相互增强，提高了推理速度并减少了视频序列中的时间不一致性。 Abstract: While multi-step diffusion models have advanced both forward and inverse rendering, existing approaches often treat these problems independently, leading to cycle inconsistency and slow inference speed. In this work, we present Ouroboros, a framework composed of two single-step diffusion models that handle forward and inverse rendering with mutual reinforcement. Our approach extends intrinsic decomposition to both indoor and outdoor scenes and introduces a cycle consistency mechanism that ensures coherence between forward and inverse rendering outputs. Experimental results demonstrate state-of-the-art performance across diverse scenes while achieving substantially faster inference speed compared to other diffusion-based methods. We also demonstrate that Ouroboros can transfer to video decomposition in a training-free manner, reducing temporal inconsistency in video sequences while maintaining high-quality per-frame inverse rendering.

[77] DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing

Weitao Wang,Zichen Wang,Hongdeng Shen,Yulei Lu,Xirui Fan,Suhui Wu,Jun Zhang,Haoqian Wang,Hao Zhang

Main category: cs.CV

TL;DR: DreamSwapV 是一种用于视频定制的遮罩引导、对象无关的端到端框架，可有效地将视频中的任何对象替换为用户指定的对象。

Details

Motivation: 随着视频生成的快速发展，对定制视频编辑的需求日益增长，其中对象替换是关键组成部分，但尚未得到充分探索。现有的替换方法要么专注于狭窄领域，要么依赖于间接编辑范式或模糊的文本提示，这降低了最终的保真度。 Method: 引入了多种条件和专门的条件融合模块，并设计了自适应遮罩策略以适应不同规模和属性的对象。通过精心设计的两阶段数据集构建和训练方案，实现对象的高效替换和上下文交互的改进。 Result: DreamSwapV 在现有方法上表现优异，通过在 VBench 指标和 DreamSwapV-Benchmark 上的全面实验验证了其性能。 Conclusion: DreamSwapV 提供了一种高效、灵活的视频对象替换解决方案，能够满足视频定制需求，同时在对象替换质量和上下文交互方面取得了显著改进。 Abstract: With the rapid progress of video generation, demand for customized video editing is surging, where subject swapping constitutes a key component yet remains under-explored. Prevailing swapping approaches either specialize in narrow domains--such as human-body animation or hand-object interaction--or rely on some indirect editing paradigm or ambiguous text prompts that compromise final fidelity. In this paper, we propose DreamSwapV, a mask-guided, subject-agnostic, end-to-end framework that swaps any subject in any video for customization with a user-specified mask and reference image. To inject fine-grained guidance, we introduce multiple conditions and a dedicated condition fusion module that integrates them efficiently. In addition, an adaptive mask strategy is designed to accommodate subjects of varying scales and attributes, further improving interactions between the swapped subject and its surrounding context. Through our elaborate two-phase dataset construction and training scheme, our DreamSwapV outperforms existing methods, as validated by comprehensive experiments on VBench indicators and our first introduced DreamSwapV-Benchmark.

Boxiao Pan,Adam W. Harley,C. Karen Liu,Leonidas J. Guibas

Main category: cs.CV

TL;DR: This paper introduces a framework for predicting future 6D head poses in egocentric navigation and presents a new dataset (AND) collected using Project Aria glasses. The model learns human-like behaviors and generalizes well to new environments.

Details

Motivation: The motivation is to address the challenge of predicting collision-free future trajectories in egocentric navigation, which is important for applications like humanoid robotics, VR/AR, and assistive navigation. The lack of suitable training data also drives the need for a new dataset. Method: The authors propose a framework that uses temporally aggregated 3D latent features to model geometric and semantic constraints in the environment. They also introduce a new dataset, the Aria Navigation Dataset (AND), collected using Project Aria glasses, to address the lack of training data. Result: The experiments show that the model successfully learns human-like navigation behaviors such as waiting/slowing down, rerouting, and looking around for traffic. The dataset provides 4 hours of real-world egocentric navigation data, contributing to the learning of real-world policies. Conclusion: The paper concludes that their proposed framework effectively predicts future 6D head poses from egocentric videos, capturing human-like navigation behaviors and generalizing well to unseen environments. Abstract: The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR / AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gathering behavior expressed through head-turning events. To solve this task, we propose a framework that reasons over temporally aggregated 3D latent features, which models the geometric and semantic constraints for both the static and dynamic parts of the environment. Motivated by the lack of training data in this space, we further contribute a data collection pipeline using the Project Aria glasses, and present a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies. Extensive experiments show that our model learns human-like navigation behaviors such as waiting / slowing down, rerouting, and looking around for traffic while generalizing to unseen environments. Check out our project webpage at https://sites.google.com/stanford.edu/lookout.

[79] Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration

Haoran Bai,Xiaoxu Chen,Canqian Yang,Zongyao He,Sibin Deng,Ying Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为Vivid-VR的视频修复方法，通过改进ControlNet和引入概念蒸馏策略，显著提升了视频生成的质量和可控性。

Details

Motivation: 传统可控视频生成方法在微调时常常因多模态对齐不完善而出现分布漂移，导致纹理真实感和时间连贯性下降，需要改进可控性和生成质量。 Method: 基于DiT的视频生成修复方法，利用ControlNet控制生成过程，并提出概念蒸馏训练策略和重新设计的控制架构，包括控制特征投影仪和双分支ControlNet连接器。 Result: Vivid-VR在多个基准测试中表现优异，有效提升了视频修复中的纹理真实感、视觉生动性和时间一致性。 Conclusion: Vivid-VR在合成和真实世界基准测试以及AIGC视频中都表现出色，实现了令人印象深刻的纹理真实感、视觉生动性和时间一致性，并且代码和检查点已公开。 Abstract: We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency. The codes and checkpoints are publicly available at https://github.com/csbhr/Vivid-VR.

[80] WeedSense: Multi-Task Learning for Weed Segmentation, Height Estimation, and Growth Stage Classification

Toqi Tahamid Sarker,Khaled R Ahmed,Taminul Islam,Cristiana Bernardi Rankrape,Karla Gage

Main category: cs.CV

TL;DR: WeedSense是一种高效的多任务深度学习模型，用于同时进行杂草分割、高度估计和生长阶段分类，具有高性能和实时推理能力。

Details

Motivation: 杂草管理是农业中的关键挑战，需要有效的监测和分析策略来实现可持续农业实践和特定地点管理方法。 Method: 采用双路径编码器和基于Transformer的多任务分叉解码器进行多尺度特征生成和融合，使用包含16种杂草种类和11周生长周期的新型数据集进行训练和评估。 Result: WeedSense在数据集上取得了89.78%的mIoU、1.67厘米的MAE和99.99%的准确率，推理速度达到160 FPS，参数量减少32.4%，推理速度提高了3倍。 Conclusion: WeedSense是一个新颖的多任务学习架构，能够高效地进行语义分割、高度估计和生长阶段分类，表现出优于现有模型的性能，并实现了实时推理。 Abstract: Weed management represents a critical challenge in agriculture, significantly impacting crop yields and requiring substantial resources for control. Effective weed monitoring and analysis strategies are crucial for implementing sustainable agricultural practices and site-specific management approaches. We introduce WeedSense, a novel multi-task learning architecture for comprehensive weed analysis that jointly performs semantic segmentation, height estimation, and growth stage classification. We present a unique dataset capturing 16 weed species over an 11-week growth cycle with pixel-level annotations, height measurements, and temporal labels. WeedSense leverages a dual-path encoder incorporating Universal Inverted Bottleneck blocks and a Multi-Task Bifurcated Decoder with transformer-based feature fusion to generate multi-scale features and enable simultaneous prediction across multiple tasks. WeedSense outperforms other state-of-the-art models on our comprehensive evaluation. On our multi-task dataset, WeedSense achieves mIoU of 89.78% for segmentation, 1.67cm MAE for height estimation, and 99.99% accuracy for growth stage classification while maintaining real-time inference at 160 FPS. Our multitask approach achieves 3$\times$ faster inference than sequential single-task execution and uses 32.4% fewer parameters. Please see our project page at weedsense.github.io.

[81] SATURN: Autoregressive Image Generation Guided by Scene Graphs

Thanh-Nhan Vo,Trong-Thuan Nguyen,Tam V. Nguyen,Minh-Triet Tran

Main category: cs.CV

TL;DR: SATURN improves text-to-image generation by incorporating scene graphs into a modified VAR-CLIP model, achieving better performance on layout and object relationships without extra modules or multi-stage training.

Details

Motivation: State-of-the-art text-to-image models struggle to capture layout and object relationships implied by complex prompts, while scene graphs provide a natural structural prior. Method: SATURN (Structured Arrangement of Triplets for Unified Rendering Networks), a lightweight extension to VAR-CLIP that translates a scene graph into a salience-ordered token sequence. Result: On the Visual Genome dataset, SATURN reduces FID from 56.45% to 21.62% and increases the Inception Score from 16.03 to 24.78. Conclusion: SATURN effectively combines structural awareness with state-of-the-art autoregressive fidelity. Abstract: State-of-the-art text-to-image models excel at photorealistic rendering but often struggle to capture the layout and object relationships implied by complex prompts. Scene graphs provide a natural structural prior, yet previous graph-guided approaches have typically relied on heavy GAN or diffusion pipelines, which lag behind modern autoregressive architectures in both speed and fidelity. We introduce SATURN (Structured Arrangement of Triplets for Unified Rendering Networks), a lightweight extension to VAR-CLIP that translates a scene graph into a salience-ordered token sequence, enabling a frozen CLIP-VQ-VAE backbone to interpret graph structure while fine-tuning only the VAR transformer. On the Visual Genome dataset, SATURN reduces FID from 56.45% to 21.62% and increases the Inception Score from 16.03 to 24.78, outperforming prior methods such as SG2IM and SGDiff without requiring extra modules or multi-stage training. Qualitative results further confirm improvements in object count fidelity and spatial relation accuracy, showing that SATURN effectively combines structural awareness with state-of-the-art autoregressive fidelity.

[82] Virtual Community: An Open World for Humans, Robots, and Society

Qinhong Zhou,Hongxin Zhang,Xiangye Lin,Zheyuan Zhang,Yutian Chen,Wenjun Liu,Zunzhe Zhang,Sunli Chen,Lixing Fang,Qiushi Lyu,Xinyu Sun,Jincheng Yang,Zeyuan Wang,Bao Chi Dang,Zhehuan Chen,Daksha Ladia,Jiageng Liu,Chuang Gan

Main category: cs.CV

TL;DR: This paper introduces Virtual Community, an open-world platform for studying human-robot interactions and coexistence, featuring a physics-based simulator and real-world aligned environments, along with new challenges for evaluating multi-agent reasoning and cooperation.

Details

Motivation: The motivation stems from the rapid progress in AI and robotics, which may lead to profound societal changes as humans and robots coexist in shared communities. The authors aim to explore the opportunities and challenges of this future by studying embodied social intelligence at scale. Method: The authors developed Virtual Community, an open-world platform based on a universal physics engine and real-world 3D scenes, which includes a multi-agent physics simulator and a community generation pipeline. They also proposed two novel challenges—Community Planning Challenge and Community Robot Challenge—to evaluate multi-agent reasoning, planning, and cooperation. Result: The authors evaluated various baselines on the proposed challenges and demonstrated the difficulties in high-level open-world task planning and low-level cooperation controls. The results show that Virtual Community can serve as a foundation for further research on human-robot coexistence. Conclusion: The paper concludes that Virtual Community provides a scalable platform for studying embodied social intelligence and human-robot coexistence in open-world environments, and it introduces challenges that highlight the complexities of multi-agent reasoning, planning, and cooperation. Abstract: The rapid progress in AI and Robotics may lead to a profound societal transformation, as humans and robots begin to coexist within shared communities, introducing both opportunities and challenges. To explore this future, we present Virtual Community-an open-world platform for humans, robots, and society-built on a universal physics engine and grounded in real-world 3D scenes. With Virtual Community, we aim to study embodied social intelligence at scale: 1) How robots can intelligently cooperate or compete; 2) How humans develop social relations and build community; 3) More importantly, how intelligent robots and humans can co-exist in an open world. To support these, Virtual Community features: 1) An open-source multi-agent physics simulator that supports robots, humans, and their interactions within a society; 2) A large-scale, real-world aligned community generation pipeline, including vast outdoor space, diverse indoor scenes, and a community of grounded agents with rich characters and appearances. Leveraging Virtual Community, we propose two novel challenges. The Community Planning Challenge evaluates multi-agent reasoning and planning ability in open-world settings, such as cooperating to help agents with daily activities and efficiently connecting other agents. The Community Robot Challenge requires multiple heterogeneous robots to collaborate in solving complex open-world tasks. We evaluate various baselines on these tasks and demonstrate the challenges in both high-level open-world task planning and low-level cooperation controls. We hope that Virtual Community will unlock further study of human-robot coexistence within open-world environments.

[83] PB-IAD: Utilizing multimodal foundation models for semantic industrial anomaly detection in dynamic manufacturing environments

Bernd Hofmann,Albert Scheck,Joerg Franke,Patrick Bruendl

Main category: cs.CV

TL;DR: PB-IAD is a novel framework that uses foundation models for industrial anomaly detection, offering better adaptability and performance in dynamic and data-sparse environments without requiring data science expertise.

Details

Motivation: Traditional statistical and data-driven methods for industrial anomaly detection are limited by their reliance on large annotated datasets and lack of flexibility in dynamic settings. This work aims to overcome these limitations using foundation models. Method: PB-IAD uses a prompt-based approach leveraging the multimodal and reasoning capabilities of foundation models like GPT-4.1. It includes a pre-processing module and a prompt template for incorporating domain-specific knowledge. Result: PB-IAD demonstrated superior performance in anomaly detection, especially in data-sparse and low-shot scenarios, outperforming state-of-the-art methods like PatchCore through semantic instructions alone. Conclusion: The PB-IAD framework effectively addresses key challenges in industrial anomaly detection, including data sparsity, agile adaptability, and domain user centricity, proving to be a promising solution for dynamic production environments. Abstract: The detection of anomalies in manufacturing processes is crucial to ensure product quality and identify process deviations. Statistical and data-driven approaches remain the standard in industrial anomaly detection, yet their adaptability and usability are constrained by the dependence on extensive annotated datasets and limited flexibility under dynamic production conditions. Recent advances in the perception capabilities of foundation models provide promising opportunities for their adaptation to this downstream task. This paper presents PB-IAD (Prompt-based Industrial Anomaly Detection), a novel framework that leverages the multimodal and reasoning capabilities of foundation models for industrial anomaly detection. Specifically, PB-IAD addresses three key requirements of dynamic production environments: data sparsity, agile adaptability, and domain user centricity. In addition to the anomaly detection, the framework includes a prompt template that is specifically designed for iteratively implementing domain-specific process knowledge, as well as a pre-processing module that translates domain user inputs into effective system prompts. This user-centric design allows domain experts to customise the system flexibly without requiring data science expertise. The proposed framework is evaluated by utilizing GPT-4.1 across three distinct manufacturing scenarios, two data modalities, and an ablation study to systematically assess the contribution of semantic instructions. Furthermore, PB-IAD is benchmarked to state-of-the-art methods for anomaly detection such as PatchCore. The results demonstrate superior performance, particularly in data-sparse scenarios and low-shot settings, achieved solely through semantic instructions.

[84] Adversarial Generation and Collaborative Evolution of Safety-Critical Scenarios for Autonomous Vehicles

Jiangfan Liu,Yongkang Guo,Fangzhi Zhong,Tianyuan Zhang,Zonglei Jing,Siyuan Liang,Jiakai Wang,Mingchuan Zhang,Aishan Liu,Xianglong Liu

Main category: cs.CV

TL;DR: The paper proposes ScenGE, a framework for generating safety-critical scenarios for autonomous vehicles using adversarial agent behaviors and complex traffic flow amplification, demonstrating improved collision detection and model robustness.

Details

Motivation: Current approaches for generating safety-critical scenarios for autonomous vehicles rely on predefined threat patterns or rule-based strategies, limiting their ability to expose diverse and unforeseen failure modes. ScenGE aims to overcome these limitations. Method: The paper introduces ScenGE, which includes Meta-Scenario Generation and Complex Scenario Evolution. Meta-Scenario Generation uses a large language model to infer adversarial agent behaviors, while Complex Scenario Evolution amplifies threats using background vehicles and optimizes key agent trajectories. Result: Experiments show ScenGE uncovers more severe collision cases (+31.96%) on average than SoTA baselines. It can be applied to large model-based AV systems and deployed on different simulators. Adversarial training on the generated scenarios improves model robustness. Conclusion: ScenGE provides a framework for generating safety-critical scenarios for autonomous vehicles, enhancing their safety evaluation. It demonstrates effectiveness in uncovering collision cases and improving model robustness through adversarial training. The framework is validated through real-world tests and human evaluation. Abstract: The generation of safety-critical scenarios in simulation has become increasingly crucial for safety evaluation in autonomous vehicles prior to road deployment in society. However, current approaches largely rely on predefined threat patterns or rule-based strategies, which limit their ability to expose diverse and unforeseen failure modes. To overcome these, we propose ScenGE, a framework that can generate plentiful safety-critical scenarios by reasoning novel adversarial cases and then amplifying them with complex traffic flows. Given a simple prompt of a benign scene, it first performs Meta-Scenario Generation, where a large language model, grounded in structured driving knowledge, infers an adversarial agent whose behavior poses a threat that is both plausible and deliberately challenging. This meta-scenario is then specified in executable code for precise in-simulator control. Subsequently, Complex Scenario Evolution uses background vehicles to amplify the core threat introduced by Meta-Scenario. It builds an adversarial collaborator graph to identify key agent trajectories for optimization. These perturbations are designed to simultaneously reduce the ego vehicle's maneuvering space and create critical occlusions. Extensive experiments conducted on multiple reinforcement learning based AV models show that ScenGE uncovers more severe collision cases (+31.96%) on average than SoTA baselines. Additionally, our ScenGE can be applied to large model based AV systems and deployed on different simulators; we further observe that adversarial training on our scenarios improves the model robustness. Finally, we validate our framework through real-world vehicle tests and human evaluation, confirming that the generated scenarios are both plausible and critical. We hope our paper can build up a critical step towards building public trust and ensuring their safe deployment.

[85] WISE-FUSE: Efficient Whole Slide Image Encoding via Coarse-to-Fine Patch Selection with VLM and LLM Knowledge Fusion

Yonghan Shin,SeungKyu Kim,Won-Ki Jeong

Main category: cs.CV

TL;DR: WISE-FUSE is a framework that speeds up the processing of large-scale whole slide images in pathology by focusing on diagnostically relevant regions, significantly reducing computation time without sacrificing accuracy.

Details

Motivation: Whole slide images in computational pathology are computationally expensive to process due to their gigapixel scale, often requiring days or weeks for preprocessing and training. This makes WSI encoding a major bottleneck in real-world deployment. Method: WISE-FUSE computes similarity scores between low-resolution patches and class-specific textual descriptions using a knowledge distillation mechanism. It selects informative regions based on these scores, selectively encodes high-resolution patches, and fuses them with textual embeddings. Result: Experiments show that WISE-FUSE reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or better than exhaustive patch processing. Conclusion: WISE-FUSE offers a scalable and practical solution for computational pathology by reducing WSI encoding time while maintaining or improving diagnostic performance. Abstract: Whole slide images (WSIs) in computational pathology (CPath) pose a major computational challenge due to their gigapixel scale, often requiring the processing of tens to hundreds of thousands of high-resolution patches per slide. This results in prohibitive encoding costs, with preprocessing and training times extending to days or even weeks-making WSI encoding the most significant bottleneck in real-world deployment. In this work, we propose WISE-FUSE, an adaptive WSI encoding framework that leverages pathology-domain vision-language models and large language models to address this challenge by selectively processing diagnostically relevant regions. WISE-FUSE first computes similarity scores between low-resolution patches and class-specific textual descriptions using a knowledge distillation mechanism that preserves fine-grained diagnostic features. Based on these similarity scores, we select a small subset of informative regions for the target task, which quickly eliminates irrelevant patches at the coarse level. The corresponding high-resolution patches are then selectively encoded and fused with textual embeddings to reinforce diagnostic context. Extensive experiments demonstrate that WISE-FUSE reduces WSI encoding time by over threefold while achieving diagnostic performance comparable to or surpassing that of exhaustive patch processing, offering a scalable and practical solution for CPath.

[86] Improving OCR using internal document redundancy

Diego Belzarena,Seginus Mowlavi,Aitor Artola,Camilo Mariño,Marina Gardella,Ignacio Ramírez,Antoine Tadros,Roy He,Natalia Bottaioli,Boshra Rajaei,Gregory Randall,Jean-Michel Morel

Main category: cs.CV

TL;DR: This paper proposes an unsupervised OCR enhancement method using an extended GMM to exploit character redundancy, showing improvements in low-quality document recognition.

Details

Motivation: Current OCR systems struggle with low-quality data due to limited exploitation of document redundancy, particularly in printed documents with high inter-domain variability. Method: An extended Gaussian Mixture Model (GMM) was introduced, alternating an Expectation-Maximization (EM) algorithm with intra-cluster realignment and normality statistical testing. Result: Improvements were demonstrated in documents with various degradation levels, including Uruguayan military archives and European newspapers from the 17th to mid-20th century. Conclusion: The proposed unsupervised method enhances OCR systems' performance by leveraging character redundancy within documents, especially for low-quality data. Abstract: Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.

[87] A Comprehensive Review of Agricultural Parcel and Boundary Delineation from Remote Sensing Images: Recent Progress and Future Perspectives

Juepeng Zheng,Zi Ye,Yibin Wen,Jianxi Huang,Zhiwei Zhang,Qingmei Li,Qiong Hu,Baodong Xu,Lingyuan Zhao,Haohuan Fu

Main category: cs.CV

TL;DR: This review paper explores Agricultural Parcel and Boundary Delineation (APBD) methods using remote sensing data, categorizing approaches and discussing future research directions and applications.

Details

Motivation: The motivation is to provide a clear understanding of the current state of Agricultural Parcel and Boundary Delineation (APBD) research and its application to remote sensing images, enabling cost-efficient and high-accuracy agricultural inventory and analysis. Method: The paper conducts a comprehensive review of recent APBD papers, categorizing methods into traditional image processing, traditional machine learning, and deep learning-based methods. It also includes a meta-data analysis covering algorithm, study site, crop type, sensor type, and evaluation method. Result: The paper systematically reviews APBD-related research, categorizing methods and discussing deep learning-based approaches, such as semantic segmentation-based, object detection-based, and Transformer-based methods. It also addresses APBD-related issues like multi-sensor data and comparisons between different learning and algorithmic approaches. Conclusion: This review aims to provide a knowledge map of existing APBD efforts and discusses future prospects and potential hot topics in APBD research, helping researchers keep track of its development and tendencies. Abstract: Powered by advances in multiple remote sensing sensors, the production of high spatial resolution images provides great potential to achieve cost-efficient and high-accuracy agricultural inventory and analysis in an automated way. Lots of studies that aim at providing an inventory of the level of each agricultural parcel have generated many methods for Agricultural Parcel and Boundary Delineation (APBD). This review covers APBD methods for detecting and delineating agricultural parcels and systematically reviews the past and present of APBD-related research applied to remote sensing images. With the goal to provide a clear knowledge map of existing APBD efforts, we conduct a comprehensive review of recent APBD papers to build a meta-data analysis, including the algorithm, the study site, the crop type, the sensor type, the evaluation method, etc. We categorize the methods into three classes: (1) traditional image processing methods (including pixel-based, edge-based and region-based); (2) traditional machine learning methods (such as random forest, decision tree); and (3) deep learning-based methods. With deep learning-oriented approaches contributing to a majority, we further discuss deep learning-based methods like semantic segmentation-based, object detection-based and Transformer-based methods. In addition, we discuss five APBD-related issues to further comprehend the APBD domain using remote sensing data, such as multi-sensor data in APBD task, comparisons between single-task learning and multi-task learning in the APBD domain, comparisons among different algorithms and different APBD tasks, etc. Finally, this review proposes some APBD-related applications and a few exciting prospects and potential hot topics in future APBD research. We hope this review help researchers who involved in APBD domain to keep track of its development and tendency.

[88] Making Pose Representations More Expressive and Disentangled via Residual Vector Quantization

Sukhyun Jeong,Hong-Gi Shin,Yong-Hoon Choi

Main category: cs.CV

TL;DR: This paper proposes a new motion generation method that improves controllability and detail capture by combining pose codes with continuous motion features using RVQ.

Details

Motivation: Discrete pose codes alone fail to capture fine-grained motion details, which limits the expressiveness of controllable motion generation models. This work aims to address this limitation. Method: The method uses residual vector quantization (RVQ) to integrate continuous motion features with pose code-based latent representations for enhanced motion generation. Result: On the HumanML3D dataset, the model achieved a reduction in FID from 0.041 to 0.015 and an improvement in Top-1 R-Precision from 0.508 to 0.510, with qualitative analysis confirming enhanced controllability. Conclusion: The proposed method enhances controllable motion generation by combining pose code-based representations with continuous motion features through RVQ, achieving improved performance and controllability. Abstract: Recent progress in text-to-motion has advanced both 3D human motion generation and text-based motion control. Controllable motion generation (CoMo), which enables intuitive control, typically relies on pose code representations, but discrete pose codes alone cannot capture fine-grained motion details, limiting expressiveness. To overcome this, we propose a method that augments pose code-based latent representations with continuous motion features using residual vector quantization (RVQ). This design preserves the interpretability and manipulability of pose codes while effectively capturing subtle motion characteristics such as high-frequency details. Experiments on the HumanML3D dataset show that our model reduces Frechet inception distance (FID) from 0.041 to 0.015 and improves Top-1 R-Precision from 0.508 to 0.510. Qualitative analysis of pairwise direction similarity between pose codes further confirms the model's controllability for motion editing.

[89] Locality-aware Concept Bottleneck Model

Sujin Jeon,Hyundo Lee,Eungseo Kim,Sanghack Lee,Byoung-Tak Zhang,Inwoo Hwang

Main category: cs.CV

TL;DR: 本文提出了一种新的概念瓶颈模型(LCBM)，通过原型学习提高概念的空间定位准确性，解决了现有无标签CBM在定位方面的不足。

Details

Motivation: 现有的无标签CBM在定位图像中的概念时往往失败，会关注与概念无关的区域。为了提高概念空间定位的准确性，本文提出了LCBM。 Method: 提出了一种名为Locality-aware Concept Bottleneck Model (LCBM)的框架，为每个概念分配一个原型，并通过鼓励原型编码相似的局部区域来学习这些原型，从而利用原型帮助模型识别每个概念应该从哪个局部区域进行预测。 Result: 实验结果表明，LCBM能有效识别图像中存在的概念，并在保持可比分类性能的同时提高了定位效果。 Conclusion: LCBM通过利用基础模型的信息和原型学习，提高了概念的空间定位准确性，同时保持了可比的分类性能。 Abstract: Concept bottleneck models (CBMs) are inherently interpretable models that make predictions based on human-understandable visual cues, referred to as concepts. As obtaining dense concept annotations with human labeling is demanding and costly, recent approaches utilize foundation models to determine the concepts existing in the images. However, such label-free CBMs often fail to localize concepts in relevant regions, attending to visually unrelated regions when predicting concept presence. To this end, we propose a framework, coined Locality-aware Concept Bottleneck Model (LCBM), which utilizes rich information from foundation models and adopts prototype learning to ensure accurate spatial localization of the concepts. Specifically, we assign one prototype to each concept, promoted to represent a prototypical image feature of that concept. These prototypes are learned by encouraging them to encode similar local regions, leveraging foundation models to assure the relevance of each prototype to its associated concept. Then we use the prototypes to facilitate the learning process of identifying the proper local region from which each concept should be predicted. Experimental results demonstrate that LCBM effectively identifies present concepts in the images and exhibits improved localization while maintaining comparable classification performance.

[90] GOGS: High-Fidelity Geometry and Relighting for Glossy Objects via Gaussian Surfels

Xingyuan Yang,Min Wei

Main category: cs.CV

TL;DR: 本文提出了一种高效的逆向渲染框架GOGS，用于解决高光物体在RGB图像下的重建与重照明问题，取得了更优的几何和材料建模效果。

Details

Motivation: 现有方法在处理高光物体的逆向渲染时存在多视角不一致、高频表面噪声、结构伪影以及材料属性模糊等问题，导致重照明结果不真实。 Method: 提出了一种基于2D高斯surfels的双阶段框架GOGS，第一阶段通过基于物理的渲染和几何先验实现鲁棒的表面重建，第二阶段通过蒙特卡洛重要性采样和可微分2D高斯光线追踪进行材料分解并优化高光细节。 Result: GOGS在几何重建、材料分离和新光照下的重照明效果均达到最先进水平，解决了现有方法的不足。 Conclusion: GOGS实现了几何重建、材料分离和新光照下的照片级真实感重照明的最先进性能，超越了现有的逆向渲染方法。 Abstract: Inverse rendering of glossy objects from RGB imagery remains fundamentally limited by inherent ambiguity. Although NeRF-based methods achieve high-fidelity reconstruction via dense-ray sampling, their computational cost is prohibitive. Recent 3D Gaussian Splatting achieves high reconstruction efficiency but exhibits limitations under specular reflections. Multi-view inconsistencies introduce high-frequency surface noise and structural artifacts, while simplified rendering equations obscure material properties, leading to implausible relighting results. To address these issues, we propose GOGS, a novel two-stage framework based on 2D Gaussian surfels. First, we establish robust surface reconstruction through physics-based rendering with split-sum approximation, enhanced by geometric priors from foundation models. Second, we perform material decomposition by leveraging Monte Carlo importance sampling of the full rendering equation, modeling indirect illumination via differentiable 2D Gaussian ray tracing and refining high-frequency specular details through spherical mipmap-based directional encoding that captures anisotropic highlights. Extensive experiments demonstrate state-of-the-art performance in geometry reconstruction, material separation, and photorealistic relighting under novel illuminations, outperforming existing inverse rendering approaches.

[91] Safety-Critical Learning for Long-Tail Events: The TUM Traffic Accident Dataset

Walter Zimmer,Ross Greer,Xingcheng Zhou,Rui Song,Marc Pavel,Daniel Lehmberg,Ahmed Ghita,Akshay Gopalkrishnan,Mohan Trivedi,Alois Knoll

Main category: cs.CV

TL;DR: The paper presents the TUMTraf-A dataset and the Accid3nD accident detection model, aiming to improve understanding and detection of highway accidents.

Details

Motivation: Despite efforts to improve transportation safety, accidents still occur regularly, necessitating a deeper understanding of these events and the development of effective detection models. Method: The paper introduces the TUM Traffic Accident (TUMTraf-A) dataset, which includes labeled 2D and 3D boxes and track IDs, and proposes the Accid3nD accident detection model that integrates a rule-based approach with a learning-based approach. Result: The TUM Traffic Accident (TUMTraf-A) dataset was created, containing ten sequences of vehicle crashes with extensive labeled data, and the proposed Accid3nD model demonstrated robustness in experiments and ablation studies. Conclusion: The paper concludes that accidents are unavoidable and sporadic outcomes of traffic networks, and the proposed Accid3nD model effectively detects accidents by combining rule-based and learning-based approaches. Abstract: Even though a significant amount of work has been done to increase the safety of transportation networks, accidents still occur regularly. They must be understood as an unavoidable and sporadic outcome of traffic networks. We present the TUM Traffic Accident (TUMTraf-A) dataset, a collection of real-world highway accidents. It contains ten sequences of vehicle crashes at high-speed driving with 294,924 labeled 2D and 93,012 labeled 3D boxes and track IDs within 48,144 labeled frames recorded from four roadside cameras and LiDARs at 10 Hz. The dataset contains ten object classes and is provided in the OpenLABEL format. We propose Accid3nD, an accident detection model that combines a rule-based approach with a learning-based one. Experiments and ablation studies on our dataset show the robustness of our proposed method. The dataset, model, and code are available on our project website: https://tum-traffic-dataset.github.io/tumtraf-a.

[92] Controllable Latent Space Augmentation for Digital Pathology

Sofiène Boutaj,Marin Scalbert,Pierre Marza,Florent Couzinie-Devy,Maria Vakalopoulou,Stergios Christodoulidis

Main category: cs.CV

TL;DR: HistAug is a generative model for controllable augmentations in digital pathology that improves MIL model performance, especially in low-data regimes.

Details

Motivation: Digital pathology presents challenges due to the high resolution of whole slide images and the scarcity of dense supervision signals. While Multiple Instance Learning (MIL) is suitable for slide-level tasks, training robust models requires large and diverse datasets, which are often not available. Image augmentation techniques can help increase data variability and reduce overfitting, but existing methods are either prohibitively expensive or lack control over transformation semantics. Method: HistAug conditions on explicit patch-level transformations (e.g., hue, erosion) to generate realistic augmented embeddings while preserving initial semantic information, enabling efficient processing of a large number of patches in a single forward pass. Result: Experiments across multiple slide-level tasks and diverse organs show that HistAug outperforms existing methods, particularly in low-data regimes. Ablation studies confirm the benefits of learned transformations over noise-based perturbations and highlight the importance of uniform WSI-wise augmentation. Conclusion: HistAug is a fast and efficient generative model that improves MIL model performance in digital pathology by enabling controllable augmentations in the latent space. Abstract: Whole slide image (WSI) analysis in digital pathology presents unique challenges due to the gigapixel resolution of WSIs and the scarcity of dense supervision signals. While Multiple Instance Learning (MIL) is a natural fit for slide-level tasks, training robust models requires large and diverse datasets. Even though image augmentation techniques could be utilized to increase data variability and reduce overfitting, implementing them effectively is not a trivial task. Traditional patch-level augmentation is prohibitively expensive due to the large number of patches extracted from each WSI, and existing feature-level augmentation methods lack control over transformation semantics. We introduce HistAug, a fast and efficient generative model for controllable augmentations in the latent space for digital pathology. By conditioning on explicit patch-level transformations (e.g., hue, erosion), HistAug generates realistic augmented embeddings while preserving initial semantic information. Our method allows the processing of a large number of patches in a single forward pass efficiently, while at the same time consistently improving MIL model performance. Experiments across multiple slide-level tasks and diverse organs show that HistAug outperforms existing methods, particularly in low-data regimes. Ablation studies confirm the benefits of learned transformations over noise-based perturbations and highlight the importance of uniform WSI-wise augmentation. Code is available at https://github.com/MICS-Lab/HistAug.

[93] Reliable Smoke Detection via Optical Flow-Guided Feature Fusion and Transformer-Based Uncertainty Modeling

Nitish Kumar Mahala,Muzammil Khan,Pushpendra Kumar

Main category: cs.CV

TL;DR: 本文提出了一种基于单目图像的烟雾检测信息融合框架，利用一种双阶段不确定性感知的移位窗口变压器，提高了火灾早期预警系统的可靠性。

Details

Motivation: 烟雾的时空动态受光照变化、流体运动和环境噪声的影响，传统探测器难以可靠检测。本文旨在开发一种无需多传感器阵列的高可靠性烟雾检测方法。 Method: 提出了一种基于光学流估计的四色定理启发式双相水平集分数阶变分模型，并结合高斯混合模型将颜色编码的光流图与外观线索融合，生成烟雾区域的二值分割掩码。然后将这些融合表示输入到一种新的移位窗口变压器中，该变压器增强了多尺度不确定性估计头，并在双阶段学习过程中进行训练。 Result: 实验结果表明，该方法在多个评估指标和与现有最先进方法的比较分析中表现出优越的泛化能力和鲁棒性，为监视、工业安全和自主监测应用中的早期火灾检测提供了一种可靠的解决方案。 Conclusion: 该方法提高了火灾早期预警系统的烟雾检测准确性和可靠性，具有广泛的应用前景。 Abstract: Fire outbreaks pose critical threats to human life and infrastructure, necessitating high-fidelity early-warning systems that detect combustion precursors such as smoke. However, smoke plumes exhibit complex spatiotemporal dynamics influenced by illumination variability, flow kinematics, and environmental noise, undermining the reliability of traditional detectors. To address these challenges without the logistical complexity of multi-sensor arrays, we propose an information-fusion framework by integrating smoke feature representations extracted from monocular imagery. Specifically, a Two-Phase Uncertainty-Aware Shifted Windows Transformer for robust and reliable smoke detection, leveraging a novel smoke segmentation dataset, constructed via optical flow-based motion encoding, is proposed. The optical flow estimation is performed with a four-color-theorem-inspired dual-phase level-set fractional-order variational model, which preserves motion discontinuities. The resulting color-encoded optical flow maps are fused with appearance cues via a Gaussian Mixture Model to generate binary segmentation masks of the smoke regions. These fused representations are fed into the novel Shifted-Windows Transformer, which is augmented with a multi-scale uncertainty estimation head and trained under a two-phase learning regimen. First learning phase optimizes smoke detection accuracy, while during the second phase, the model learns to estimate plausibility confidence in its predictions by jointly modeling aleatoric and epistemic uncertainties. Extensive experiments using multiple evaluation metrics and comparative analysis with state-of-the-art approaches demonstrate superior generalization and robustness, offering a reliable solution for early fire detection in surveillance, industrial safety, and autonomous monitoring applications.

[94] Incremental Object Detection with Prompt-based Methods

Matthias Neuwirth-Trapp,Maarten Bieshaar,Danda Pani Paudel,Luc Van Gool

Main category: cs.CV

TL;DR: The paper explores the use of visual prompt-based methods in incremental object detection and finds that combining these methods with replaying previous data achieves the best results.

Details

Motivation: The motivation is to understand the generalizability of visual prompt-based methods in incremental object detection, as no prior work has applied such methods to this field. Method: The paper analyzes three different prompt-based methods and provides a wide range of reference baselines for comparison. Result: The empirical results show that the tested prompt-based approaches underperform in the complex domain-incremental learning setting. Conclusion: Visual prompt-based methods can be improved for incremental object detection by combining them with replaying previous data. Abstract: Visual prompt-based methods have seen growing interest in incremental learning (IL) for image classification. These approaches learn additional embedding vectors while keeping the model frozen, making them efficient to train. However, no prior work has applied such methods to incremental object detection (IOD), leaving their generalizability unclear. In this paper, we analyze three different prompt-based methods under a complex domain-incremental learning setting. We additionally provide a wide range of reference baselines for comparison. Empirically, we show that the prompt-based approaches we tested underperform in this setting. However, a strong yet practical method, combining visual prompts with replaying a small portion of previous data, achieves the best results. Together with additional experiments on prompt length and initialization, our findings offer valuable insights for advancing prompt-based IL in IOD.

[95] UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling

Peiming Li,Ziyi Wang,Yulin Yuan,Hong Liu,Xiangming Meng,Junsong Yuan,Mengyuan Liu

Main category: cs.CV

TL;DR: The paper proposes UST-SSM, a novel method for modeling point cloud videos by addressing spatio-temporal disorder, achieving effective recognition of dynamic human actions.

Details

Motivation: Point cloud videos are effective for recognizing subtle human actions but are limited by spatio-temporal disorder. Selective State Space Models (SSMs) have linear complexity but struggle with unordered data, necessitating a new approach to model such videos effectively. Method: The Unified Spatio-Temporal State Space Model (UST-SSM) is introduced, featuring Spatial-Temporal Selection Scanning (STSS), Spatio-Temporal Structure Aggregation (STSA), and Temporal Interaction Sampling (TIS) to process unordered point cloud data and enhance temporal interactions. Result: The experimental results on MSR-Action3D, NTU RGB+D, and Synthia 4D datasets confirm the effectiveness of the proposed UST-SSM method. Conclusion: UST-SSM effectively addresses the spatio-temporal disorder in point cloud videos, improving performance in recognizing dynamic human actions. Abstract: Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://github.com/wangzy01/UST-SSM.

[96] SMTrack: End-to-End Trained Spiking Neural Networks for Multi-Object Tracking in RGB Videos

Pengzhi Zhong,Xinzhe Wang,Dan Zeng,Qihua Zhou,Feixiang He,Shuiwang Li

Main category: cs.CV

TL;DR: 本文提出了SMTrack，这是第一个用于标准RGB视频上端到端多目标跟踪的直接训练的深度SNN框架，通过引入Asa-NWDLoss和TrackTrack身份模块，提升了跟踪性能。

Details

Motivation: 尽管受脑启发的脉冲神经网络（SNNs）在低功耗计算中具有巨大潜力，但其在视觉任务中的应用主要局限于图像分类、目标检测和基于事件的跟踪，而在使用常规RGB视频流的复杂时间任务如多目标跟踪（MOT）方面的潜力尚未得到充分探索。 Method: 提出了一种直接训练的深度SNN框架SMTrack，用于标准RGB视频上的端到端多目标跟踪。引入了自适应和尺度感知的归一化Wasserstein距离损失（Asa-NWDLoss），并结合了TrackTrack身份模块来保持对象轨迹的鲁棒性和一致性。 Result: 在BEE24、MOT17、MOT20和DanceTrack数据集上的广泛评估表明，SMTrack实现了与领先的基于ANN的MOT方法相当的性能。 Conclusion: SMTrack展现出在复杂场景中进行鲁棒且准确的SNN跟踪的潜力，并实现了与领先的基于人工神经网络的MOT方法相当的性能。 Abstract: Brain-inspired Spiking Neural Networks (SNNs) exhibit significant potential for low-power computation, yet their application in visual tasks remains largely confined to image classification, object detection, and event-based tracking. In contrast, real-world vision systems still widely use conventional RGB video streams, where the potential of directly-trained SNNs for complex temporal tasks such as multi-object tracking (MOT) remains underexplored. To address this challenge, we propose SMTrack-the first directly trained deep SNN framework for end-to-end multi-object tracking on standard RGB videos. SMTrack introduces an adaptive and scale-aware Normalized Wasserstein Distance loss (Asa-NWDLoss) to improve detection and localization performance under varying object scales and densities. Specifically, the method computes the average object size within each training batch and dynamically adjusts the normalization factor, thereby enhancing sensitivity to small objects. For the association stage, we incorporate the TrackTrack identity module to maintain robust and consistent object trajectories. Extensive evaluations on BEE24, MOT17, MOT20, and DanceTrack show that SMTrack achieves performance on par with leading ANN-based MOT methods, advancing robust and accurate SNN-based tracking in complex scenarios.

[97] AnchorSync: Global Consistency Optimization for Long Video Editing

Zichi Liu,Yinggui Wang,Tao Wei,Chao Ma

Main category: cs.CV

TL;DR: AnchorSync improves long-term video editing by maintaining structural and temporal consistency, producing high-quality results.

Details

Motivation: Editing long videos is challenging due to the need for global consistency and temporal coherence, which existing methods often fail to maintain. Method: AnchorSync decouples video editing into sparse anchor frame editing and intermediate frame interpolation, using a progressive denoising process and multimodal guidance. Result: AnchorSync achieves high-fidelity, coherent video edits and surpasses prior methods in both visual quality and temporal stability. Conclusion: AnchorSync offers a novel diffusion-based solution for long-term video editing, outperforming previous methods in visual quality and temporal stability. Abstract: Editing long videos remains a challenging task due to the need for maintaining both global consistency and temporal coherence across thousands of frames. Existing methods often suffer from structural drift or temporal artifacts, particularly in minute-long sequences. We introduce AnchorSync, a novel diffusion-based framework that enables high-quality, long-term video editing by decoupling the task into sparse anchor frame editing and smooth intermediate frame interpolation. Our approach enforces structural consistency through a progressive denoising process and preserves temporal dynamics via multimodal guidance. Extensive experiments show that AnchorSync produces coherent, high-fidelity edits, surpassing prior methods in visual quality and temporal stability.

[98] Towards PerSense++: Advancing Training-Free Personalized Instance Segmentation in Dense Images

Muhammad Ibraheem Siddiqui,Muhammad Umer Sheikh,Hassan Abid,Kevin Henry,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 本文提出了一种名为PerSense的个性化实例分割框架，用于密集图像中的分割任务，并进一步提出了PerSense++以增强在杂乱场景中的鲁棒性。

Details

Motivation: 密集视觉场景中的分割面临遮挡、背景杂乱和尺度变化等挑战，因此需要一种新的方法来提升分割效果。 Method: PerSense采用了一个实例检测模块（IDM）和一个点提示选择模块（PPSM），并通过反馈机制优化密度图质量。PerSense++引入了多样性感知的示例选择策略、混合IDM和无关掩膜拒绝模块（IMRM）来提升性能。 Result: 实验表明PerSense++在多个基准测试中优于现有方法，尤其是在密集场景中表现突出。 Conclusion: PerSense及其增强版本PerSense++为密集图像中的个性化实例分割提供了一个有效的解决方案。 Abstract: Segmentation in dense visual scenes poses significant challenges due to occlusions, background clutter, and scale variations. To address this, we introduce PerSense, an end-to-end, training-free, and model-agnostic one-shot framework for Personalized instance Segmentation in dense images. PerSense employs a novel Instance Detection Module (IDM) that leverages density maps (DMs) to generate instance-level candidate point prompts, followed by a Point Prompt Selection Module (PPSM) that filters false positives via adaptive thresholding and spatial gating. A feedback mechanism further enhances segmentation by automatically selecting effective exemplars to improve DM quality. We additionally present PerSense++, an enhanced variant that incorporates three additional components to improve robustness in cluttered scenes: (i) a diversity-aware exemplar selection strategy that leverages feature and scale diversity for better DM generation; (ii) a hybrid IDM combining contour and peak-based prompt generation for improved instance separation within complex density patterns; and (iii) an Irrelevant Mask Rejection Module (IMRM) that discards spatially inconsistent masks using outlier analysis. Finally, to support this underexplored task, we introduce PerSense-D, a dedicated benchmark for personalized segmentation in dense images. Extensive experiments across multiple benchmarks demonstrate that PerSense++ outperforms existing methods in dense settings.

[99] GeMS: Efficient Gaussian Splatting for Extreme Motion Blur

Gopi Raju Matta,Trisha Reddypalli,Vemunuri Divya Madhuri,Kaushik Mitra

Main category: cs.CV

TL;DR: GeMS is a 3D Gaussian Splatting framework that reconstructs scenes directly from severely motion-blurred images, overcoming limitations of existing methods that require sharp images. GeMS-E enhances this with event-based deblurring.

Details

Motivation: Existing methods like ExBluRF, Deblur-GS, and BAD-Gaussians fail under severe motion blur due to reliance on sharp images or unreliable feature correspondences. This work aims to enable 3D reconstruction directly from severely blurred images. Method: GeMS integrates (1) VGGSfM for pose estimation and point cloud generation from blurred inputs, (2) 3DGS-MCMC for robust scene initialization, and (3) joint optimization of camera trajectories and Gaussian parameters. GeMS-E adds (4) EDI deblurring for progressive refinement. Result: Both GeMS and GeMS-E achieve state-of-the-art performance on synthetic and real-world datasets, with GeMS-E further improving pose estimation, point cloud generation, and overall reconstruction through EDI deblurring. Conclusion: GeMS and GeMS-E represent the first frameworks to address extreme motion blur within 3D Gaussian Splatting directly from severely blurred inputs, achieving state-of-the-art performance on synthetic and real-world datasets. Abstract: We introduce GeMS, a framework for 3D Gaussian Splatting (3DGS) designed to handle severely motion-blurred images. State-of-the-art deblurring methods for extreme blur, such as ExBluRF, as well as Gaussian Splatting-based approaches like Deblur-GS, typically assume access to sharp images for camera pose estimation and point cloud generation, an unrealistic assumption. Methods relying on COLMAP initialization, such as BAD-Gaussians, also fail due to unreliable feature correspondences under severe blur. To address these challenges, we propose GeMS, a 3DGS framework that reconstructs scenes directly from extremely blurred images. GeMS integrates: (1) VGGSfM, a deep learning-based Structure-from-Motion pipeline that estimates poses and generates point clouds directly from blurred inputs; (2) 3DGS-MCMC, which enables robust scene initialization by treating Gaussians as samples from a probability distribution, eliminating heuristic densification and pruning; and (3) joint optimization of camera trajectories and Gaussian parameters for stable reconstruction. While this pipeline produces strong results, inaccuracies may remain when all inputs are severely blurred. To mitigate this, we propose GeMS-E, which integrates a progressive refinement step using events: (4) Event-based Double Integral (EDI) deblurring restores sharper images that are then fed into GeMS, improving pose estimation, point cloud generation, and overall reconstruction. Both GeMS and GeMS-E achieve state-of-the-art performance on synthetic and real-world datasets. To our knowledge, this is the first framework to address extreme motion blur within 3DGS directly from severely blurred inputs.

[100] Seeing Further on the Shoulders of Giants: Knowledge Inheritance for Vision Foundation Models

Jiabo Huang,Chen Chen,Lingjuan Lyu

Main category: cs.CV

TL;DR: This paper introduces a model-driven approach for training vision foundation models by combining knowledge from multiple pretrained models, eliminating reliance on large-scale labeled data and outperforming traditional data-centric methods across diverse vision tasks.

Details

Motivation: The motivation stems from the limitations of data-centric methods that require vast, high-quality labeled data and the underutilization of pretrained open-source vision models that can transfer core domain knowledge. Method: The method involves joint knowledge transfer and preservation by unifying multiple pretrained teacher models in a shared latent space and using an adapter module to integrate knowledge from specific teachers into a general-purpose teacher. Result: The proposed method outperforms existing data-centric models in four fundamental vision tasks: image classification, object detection, semantic segmentation, and instance segmentation. Conclusion: The proposed model-driven approach effectively trains vision foundation models by leveraging existing pretrained models, offering superior performance over data-centric models across multiple vision tasks. Abstract: Vision foundation models (VFMs) are predominantly developed using data-centric methods. These methods require training on vast amounts of data usually with high-quality labels, which poses a bottleneck for most institutions that lack both large-scale data and high-end GPUs. On the other hand, many open-source vision models have been pretrained on domain-specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under-explored in empowering the development of a general-purpose VFM. In this paper, we presents a new model-driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre-trained teacher models in a shared latent space to mitigate the ``imbalanced transfer'' issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general-purpose teacher as a knowledge base for integrating knowledge from the remaining purpose-specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers' expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data-centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.

[101] GSFix3D: Diffusion-Guided Repair of Novel Views in Gaussian Splatting

Jiaxin Wei,Stefan Leutenegger,Simon Schaefer

Main category: cs.CV

TL;DR: 本文提出了GSFix3D，结合扩散模型与3D表示，提高了在欠约束区域的视觉保真度，实现鲁棒的新视角修复，仅需最小的场景特定微调即达到SOTA性能。

Details

Motivation: 3D Gaussian Splatting在极端新视角或部分观测区域生成高质量渲染仍具挑战，而扩散模型在特定场景信息认知上的不足限制了其在3D重建中的准确应用。 Method: 引入了GSFix3D，其核心是GSFixer，一个通过自定义微调协议获得的潜在扩散模型，并提出了随机掩码增强策略以实现对缺失区域的合理修复。 Result: 在具有挑战性的基准测试中，GSFix3D和GSFixer仅需对捕获数据进行最小的场景特定微调即可达到最先进的性能，且对潜在姿态误差具有鲁棒性。 Conclusion: GSFix3D通过从扩散模型中提炼先验知识，提升了在欠约束区域的视觉保真度，同时保持了与观察到的场景细节的一致性。 Abstract: Recent developments in 3D Gaussian Splatting have significantly enhanced novel view synthesis, yet generating high-quality renderings from extreme novel viewpoints or partially observed regions remains challenging. Meanwhile, diffusion models exhibit strong generative capabilities, but their reliance on text prompts and lack of awareness of specific scene information hinder accurate 3D reconstruction tasks. To address these limitations, we introduce GSFix3D, a novel framework that improves the visual fidelity in under-constrained regions by distilling prior knowledge from diffusion models into 3D representations, while preserving consistency with observed scene details. At its core is GSFixer, a latent diffusion model obtained via our customized fine-tuning protocol that can leverage both mesh and 3D Gaussians to adapt pretrained generative models to a variety of environments and artifact types from different reconstruction methods, enabling robust novel view repair for unseen camera poses. Moreover, we propose a random mask augmentation strategy that empowers GSFixer to plausibly inpaint missing regions. Experiments on challenging benchmarks demonstrate that our GSFix3D and GSFixer achieve state-of-the-art performance, requiring only minimal scene-specific fine-tuning on captured data. Real-world test further confirms its resilience to potential pose errors. Our code and data will be made publicly available. Project page: https://gsfix3d.github.io.

[102] Multiscale Video Transformers for Class Agnostic Segmentation in Autonomous Driving

Leila Cheshmi,Mennatullah Siam

Main category: cs.CV

TL;DR: This paper proposes an efficient, class-agnostic video transformer for detecting unknown objects using motion cues, enabling robust and real-time dense prediction in autonomous driving.

Details

Motivation: The motivation stems from the need to handle unknown objects and unforeseen driving scenarios in autonomous driving, where traditional segmentation methods fail due to reliance on known classes and computational inefficiency. Method: The method uses multi-stage multiscale query-memory decoding and a scale-specific random drop-token mechanism, avoiding optical flow and relying solely on motion cues for unknown object detection. Result: Evaluated on DAVIS'16, KITTI, and Cityscapes, the method outperforms multiscale baselines in performance while maintaining efficiency in GPU memory and runtime. Conclusion: The proposed video transformer demonstrates efficiency and accuracy in class-agnostic segmentation, offering a promising solution for real-time dense prediction in autonomous driving. Abstract: Ensuring safety in autonomous driving is a complex challenge requiring handling unknown objects and unforeseen driving scenarios. We develop multiscale video transformers capable of detecting unknown objects using only motion cues. Video semantic and panoptic segmentation often relies on known classes seen during training, overlooking novel categories. Recent visual grounding with large language models is computationally expensive, especially for pixel-level output. We propose an efficient video transformer trained end-to-end for class-agnostic segmentation without optical flow. Our method uses multi-stage multiscale query-memory decoding and a scale-specific random drop-token to ensure efficiency and accuracy, maintaining detailed spatiotemporal features with a shared, learnable memory module. Unlike conventional decoders that compress features, our memory-centric design preserves high-resolution information at multiple scales. We evaluate on DAVIS'16, KITTI, and Cityscapes. Our method consistently outperforms multiscale baselines while being efficient in GPU memory and run-time, demonstrating a promising direction for real-time, robust dense prediction in safety-critical robotics.

[103] Improved Mapping Between Illuminations and Sensors for RAW Images

Abhijith Punnappurath,Luxi Zhao,Hoang Le,Abdelrahman Abdelhamed,SaiKiran Kumar Tedla,Michael S. Brown

Main category: cs.CV

TL;DR: 本文介绍了一种针对RAW图像处理的轻量级神经网络方法，并发布了一个包含多种照明条件、多个摄像头和场景的全新数据集。

Details

Motivation: 由于RAW图像具有传感器特定的RGB值和由于传感器对场景照明光谱特性的响应而产生的强烈色彩偏移，使得深度学习方法在捕捉RAW数据集时面临挑战。 Method: 使用一个定制的可调照明光谱的灯箱捕捉多个场景，提出了一种轻量级的神经网络方法用于照明和传感器映射。 Result: 本文引入了首个包含390种照明条件、4个摄像头和18个场景的数据集，并通过该数据集展示了一种优于现有方法的轻量级神经网络方法。 Conclusion: 本文介绍了一种轻量级的神经网络方法，用于照明和传感器映射，并在训练神经ISP的下游任务中展示了其优势。 Abstract: RAW images are unprocessed camera sensor output with sensor-specific RGB values based on the sensor's color filter spectral sensitivities. RAW images also incur strong color casts due to the sensor's response to the spectral properties of scene illumination. The sensor- and illumination-specific nature of RAW images makes it challenging to capture RAW datasets for deep learning methods, as scenes need to be captured for each sensor and under a wide range of illumination. Methods for illumination augmentation for a given sensor and the ability to map RAW images between sensors are important for reducing the burden of data capture. To explore this problem, we introduce the first-of-its-kind dataset comprising carefully captured scenes under a wide range of illumination. Specifically, we use a customized lightbox with tunable illumination spectra to capture several scenes with different cameras. Our illumination and sensor mapping dataset has 390 illuminations, four cameras, and 18 scenes. Using this dataset, we introduce a lightweight neural network approach for illumination and sensor mapping that outperforms competing methods. We demonstrate the utility of our approach on the downstream task of training a neural ISP. Link to project page: https://github.com/SamsungLabs/illum-sensor-mapping.

[104] Fusing Monocular RGB Images with AIS Data to Create a 6D Pose Estimation Dataset for Marine Vessels

Fabian Holst,Emre Gülsoylu,Simone Frintrop

Main category: cs.CV

TL;DR: 论文介绍了一种通过融合单目图像和AIS数据来生成海上船只6D姿态估计数据集的新方法，提出了BONK-pose数据集，并验证了其在6D姿态估计中的有效性。

Details

Motivation: 解决仅依赖AIS进行位置信息获取的局限性，如设备可靠性、数据篡改和传输延迟等问题。 Method: 结合使用单目RGB图像和AIS数据，利用YOLOX-X目标检测网络检测船只，并比较了两种变换方法（单应性变换和Perspective-n-Point）以对齐AIS数据与图像坐标。 Result: Perspective-n-Point方法相比基于单应性的方法具有更低的投影误差，YOLOX-X模型在IoU阈值为0.5时达到了0.80的平均精度（mAP）；生成了BONK-pose数据集（包含3753张带有3D边界框注释的图像）和一组包含1000张带有2D边界框注释的图像。 Conclusion: 论文提出了一种新颖的技术，通过融合单目RGB图像和AIS数据来生成用于海上船只的6D姿态估计数据集。这种方法解决了仅依赖AIS进行位置信息获取的局限性，并且可以无需手动注释即可创建6D姿态估计数据集。 Abstract: The paper presents a novel technique for creating a 6D pose estimation dataset for marine vessels by fusing monocular RGB images with Automatic Identification System (AIS) data. The proposed technique addresses the limitations of relying purely on AIS for location information, caused by issues like equipment reliability, data manipulation, and transmission delays. By combining vessel detections from monocular RGB images, obtained using an object detection network (YOLOX-X), with AIS messages, the technique generates 3D bounding boxes that represent the vessels' 6D poses, i.e. spatial and rotational dimensions. The paper evaluates different object detection models to locate vessels in image space. We also compare two transformation methods (homography and Perspective-n-Point) for aligning AIS data with image coordinates. The results of our work demonstrate that the Perspective-n-Point (PnP) method achieves a significantly lower projection error compared to homography-based approaches used before, and the YOLOX-X model achieves a mean Average Precision (mAP) of 0.80 at an Intersection over Union (IoU) threshold of 0.5 for relevant vessel classes. We show indication that our approach allows the creation of a 6D pose estimation dataset without needing manual annotation. Additionally, we introduce the Boats on Nordelbe Kehrwieder (BONK-pose), a publicly available dataset comprising 3753 images with 3D bounding box annotations for pose estimation, created by our data fusion approach. This dataset can be used for training and evaluating 6D pose estimation networks. In addition we introduce a set of 1000 images with 2D bounding box annotations for ship detection from the same scene.

[105] 6-DoF Object Tracking with Event-based Optical Flow and Frames

Zhichao Li,Arren Glover,Chiara Bartolozzi,Lorenzo Natale

Main category: cs.CV

TL;DR: This paper proposes a 6-DoF object pose tracking method combining event-based optical flow and RGB-based pose estimation, demonstrating effectiveness in high-speed scenarios.

Details

Motivation: Tracking the 6-DoF pose of objects in real time is crucial in robotics, but becomes challenging at high speeds due to motion blur and frame rate limitations of conventional cameras. Event cameras offer potential advantages with high temporal resolution and low latency. Method: The method combines event-based optical flow for object motion measurement with a global object pose estimator using RGB data. The event-based optical flow tracks the 6-DoF velocity of objects, which is then integrated with low-frequency pose estimates to achieve accurate high-speed pose tracking. Result: The algorithm was tested on both synthetic and real-world data, showing effectiveness in high-speed motion scenarios. Conclusion: The proposed algorithm effectively tracks the 6-DoF pose of high-speed objects by integrating event-based optical flow with RGB-based global pose estimation, leveraging the advantages of both sensor types. Abstract: Tracking the position and orientation of objects in space (i.e., in 6-DoF) in real time is a fundamental problem in robotics for environment interaction. It becomes more challenging when objects move at high-speed due to frame rate limitations in conventional cameras and motion blur. Event cameras are characterized by high temporal resolution, low latency and high dynamic range, that can potentially overcome the impacts of motion blur. Traditional RGB cameras provide rich visual information that is more suitable for the challenging task of single-shot object pose estimation. In this work, we propose using event-based optical flow combined with an RGB based global object pose estimator for 6-DoF pose tracking of objects at high-speed, exploiting the core advantages of both types of vision sensors. Specifically, we propose an event-based optical flow algorithm for object motion measurement to implement an object 6-DoF velocity tracker. By integrating the tracked object 6-DoF velocity with low frequency estimated pose from the global pose estimator, the method can track pose when objects move at high-speed. The proposed algorithm is tested and validated on both synthetic and real world data, demonstrating its effectiveness, especially in high-speed motion scenarios.

[106] Adversarial Hospital-Invariant Feature Learning for WSI Patch Classification

Mengliang Zhang,Jacob M. Luber

Main category: cs.CV

TL;DR: 该研究针对病理基础模型中由于医院差异导致的域偏差问题，提出了一种轻量级对抗框架，有效提升了模型的跨域泛化能力。

Details

Motivation: 不同医院的病理图像由于扫描硬件和预处理方式的差异，可能导致模型学习到医院特定特征，这对临床部署构成了风险。因此，需要一种能够减轻医院偏差的方法，以提高模型的跨域泛化能力。 Method: 构建了一个用于量化病理基础模型中域偏差的流水线，评估并比较了多个模型的性能，并提出了一种轻量级的对抗框架，通过可训练适配器和域分类器连接梯度反转层（GRL）来去除潜在的医院特定特征。 Result: 实验表明，该方法在保持甚至提升疾病分类性能的同时，显著降低了域可预测性，特别是在跨域（未见过的医院）场景下效果显著。此外，通过医院检测和特征空间可视化进一步验证了该方法的有效性。 Conclusion: 通过引入可训练适配器和域分类器，该研究提出了一种有效的对抗框架，能够在不修改编码器的情况下消除医院特定特征，从而提高病理图像分析的跨域泛化能力。 Abstract: Pathology foundation models (PFMs) have demonstrated remarkable potential in whole-slide image (WSI) diagnosis. However, pathology images from different hospitals often vary due to differences in scanning hardware and preprocessing styles, which may lead PFMs to inadvertently learn hospital-specific features, posing risks for clinical deployment. In this work, we present the first systematic study of domain bias in PFMs arising from hospital source characteristics. Specifically, we (1) construct a pipeline for quantifying domain bias in PFMs, (2) evaluate and compare the performance of multiple models, and (3) propose a lightweight adversarial framework that removes latent hospital-specific features from frozen representations without modifying the encoder itself. By introducing a trainable adapter and a domain classifier connected through a gradient reversal layer (GRL), our method learns task-discriminative yet domain-invariant representations. Experiments on multi-center histopathology datasets demonstrate that our approach substantially reduces domain predictability while maintaining or even improving disease classification performance, particularly in out-of-domain (unseen hospital) scenarios. Further analyses, including hospital detection and feature space visualization, confirm the effectiveness of our method in mitigating hospital bias. We will provide our code based on acceptance.

[107] MF-LPR$^2$: Multi-Frame License Plate Image Restoration and Recognition using Optical Flow

Kihyun Na,Junseok Oh,Youngkwan Cho,Bumjin Kim,Sungmin Cho,Jinyoung Choi,Injung Kim

Main category: cs.CV

TL;DR: 本文提出了一种新的多帧车牌恢复与识别框架 MF-LPR$^2$，通过帧对齐和聚合提高低质量图像的识别准确率，实验表明其性能显著优于现有方法。

Details

Motivation: 由于行车记录仪图像中的车牌区域经常受到低分辨率、运动模糊和眩光的影响，导致准确识别困难，而现有的生成模型无法可靠地恢复这种低质量图像，因此提出了 MF-LPR$^2$ 框架。 Method: 提出了一种新的多帧车牌恢复和识别框架 MF-LPR$^2$，利用光流估计器实现帧对齐，并设计了算法来检测和纠正错误的光流估计，以利用车牌图像序列中的时空一致性。 Result: MF-LPR$^2$ 在 PSNR、SSIM 和 LPIPS 指标上显著优于8个最新的恢复模型，并在识别准确率上达到了86.44%，超过了最好的单帧 LPR（14.04%）和多帧 LPR（82.55%）方法。 Conclusion: MF-LPR$^2$ 框架通过多帧对齐和聚合显著提高了低质量车牌图像的恢复质量和识别准确率，且优于现有的单帧和多帧LPR方法。 Abstract: License plate recognition (LPR) is important for traffic law enforcement, crime investigation, and surveillance. However, license plate areas in dash cam images often suffer from low resolution, motion blur, and glare, which make accurate recognition challenging. Existing generative models that rely on pretrained priors cannot reliably restore such poor-quality images, frequently introducing severe artifacts and distortions. To address this issue, we propose a novel multi-frame license plate restoration and recognition framework, MF-LPR$^2$, which addresses ambiguities in poor-quality images by aligning and aggregating neighboring frames instead of relying on pretrained knowledge. To achieve accurate frame alignment, we employ a state-of-the-art optical flow estimator in conjunction with carefully designed algorithms that detect and correct erroneous optical flow estimations by leveraging the spatio-temporal consistency inherent in license plate image sequences. Our approach enhances both image quality and recognition accuracy while preserving the evidential content of the input images. In addition, we constructed a novel Realistic LPR (RLPR) dataset to evaluate MF-LPR$^2$. The RLPR dataset contains 200 pairs of low-quality license plate image sequences and high-quality pseudo ground-truth images, reflecting the complexities of real-world scenarios. In experiments, MF-LPR$^2$ outperformed eight recent restoration models in terms of PSNR, SSIM, and LPIPS by significant margins. In recognition, MF-LPR$^2$ achieved an accuracy of 86.44%, outperforming both the best single-frame LPR (14.04%) and the multi-frame LPR (82.55%) among the eleven baseline models. The results of ablation studies confirm that our filtering and refinement algorithms significantly contribute to these improvements.

[108] DINOv3 with Test-Time Training for Medical Image Registration

Shansong Wang,Mojtaba Safari,Mingzhe Hu,Qiang Li,Chih-Wei Chang,Richard LJ Qiu,Xiaofeng Yang

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的医学图像配准方法，通过使用冻结的DINOv3编码器和在特征空间中优化变形场，解决了传统学习方法对大量训练数据的依赖问题。

Details

Motivation: 现有的医学图像配准方法，尤其是基于学习的方法，通常需要大量的训练数据，这限制了其在临床中的应用。为了解决这一问题，本文提出了一种无需训练的新方法。 Method: 该方法利用冻结的DINOv3编码器，并在测试时对特征空间中的变形场进行优化，从而实现医学图像的配准。 Result: 在两个典型的医学图像数据集（Abdomen MR-CT和ACDC心脏MRI）上，该方法均表现出色。在Abdomen MR-CT数据集上，平均Dice相似系数（DSC）达到0.790，95%百分位Hausdorff距离（HD95）为4.9±5.0，Log-Jacobian标准差（SDLogJ）为0.08±0.02。在ACDC心脏MRI数据集上，平均DSC提升至0.769，SDLogJ降低至0.11，HD95降低至4.8，显著优于初始对齐结果。 Conclusion: 该研究表明，在紧凑的基础特征空间中进行测试时优化，可以为临床图像配准提供一种实用且通用的解决方案，而无需额外的训练过程。 Abstract: Prior medical image registration approaches, particularly learning-based methods, often require large amounts of training data, which constrains clinical adoption. To overcome this limitation, we propose a training-free pipeline that relies on a frozen DINOv3 encoder and test-time optimization of the deformation field in feature space. Across two representative benchmarks, the method is accurate and yields regular deformations. On Abdomen MR-CT, it attained the best mean Dice score (DSC) of 0.790 together with the lowest 95th percentile Hausdorff Distance (HD95) of 4.9+-5.0 and the lowest standard deviation of Log-Jacobian (SDLogJ) of 0.08+-0.02. On ACDC cardiac MRI, it improves mean DSC to 0.769 and reduces SDLogJ to 0.11 and HD95 to 4.8, a marked gain over the initial alignment. The results indicate that operating in a compact foundation feature space at test time offers a practical and general solution for clinical registration without additional training.

[109] Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

Canyu Zhao,Xiaoman Li,Tianjian Feng,Zhiyue Zhao,Hao Chen,Chunhua Shen

Main category: cs.CV

TL;DR: Tinker是一种无需逐场景微调的高保真3D编辑框架，能够在仅使用少量图像的情况下生成多视角一致的编辑结果。

Details

Motivation: 现有的方法需要大量的逐场景优化来确保多视角一致性，而Tinker旨在通过零样本或少量样本来实现高效的3D编辑。 Method: 重新利用预训练扩散模型，开发了一个无需逐场景微调的框架，包含参考多视角编辑器和任意视角到视频合成器两个新组件。 Result: Tinker能够从仅一个或两个图像生成稳健的、多视角一致的编辑，同时构建了首个大规模多视角编辑数据集和数据流水线。 Conclusion: Tinker是一个实现零样本3D编辑的重要步骤，显著降低了通用3D内容创建的门槛，并在编辑、新视角合成和渲染增强任务中达到了最先进的性能。 Abstract: We introduce Tinker, a versatile framework for high-fidelity 3D editing that operates in both one-shot and few-shot regimes without any per-scene finetuning. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, Tinker delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Referring multi-view editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video synthesizer: Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, Tinker significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks. We believe that Tinker represents a key step towards truly scalable, zero-shot 3D editing. Project webpage: https://aim-uofa.github.io/Tinker

[110] Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

Haoyu Zhao,Jiaxi Gu,Shicong Wang,Xing Zhang,Hang Xu,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: This paper proposes a novel framework and inference pipeline for video-language retrieval that enhances performance without additional training, leveraging fine-grained features and keyword repetition insights.

Details

Motivation: The motivation is to address the challenges of achieving high accuracy and low training costs in video-language retrieval given the explosive growth of video streaming. Method: The method includes a coarse-to-fine strategy for learning fine-grained features, a Granularity-Aware Representation module for extracting detailed data, and an inference pipeline incorporating a voting mechanism and Matching Entropy metric. Result: Experimental results showed significant performance improvements, including a 2.1% increase in Recall@1 on the MSR-VTT dataset and a 1.6% increase on the DiDeMo dataset. Conclusion: The proposed framework achieves better video-language retrieval performance without additional training, outperforming previous approaches on multiple benchmarks. Abstract: The explosive growth of video streaming presents challenges in achieving high accuracy and low training costs for video-language retrieval. However, existing methods rely on large-scale pre-training to improve video retrieval performance, resulting in significant computational demands. Additionally, the fine-grained information in videos and texts remains underexplored. To alleviate these problems, we propose a novel framework to learn fine-grained features for better alignment and introduce an inference pipeline to improve performance without additional training. Specifically, we employ coarse-to-fine objectives to understand the semantic information of video-text pairs, including contrastive and matching learning. The fine-grained data used for training is obtained through the Granularity-Aware Representation module, which is designed based on similarity analysis between video frames and words in captions. Furthermore, we observe that the repetition of keywords in the original captions, referred to as "Repetition", can enhance retrieval performance and improve alignment between video and text. Based on this insight, we propose a novel and effective inference pipeline that incorporates a voting mechanism and a new Matching Entropy metric to achieve better retrieval performance without requiring additional pre-training. Experimental results on four benchmarks demonstrate that the proposed method outperforms previous approaches. Additionally, our inference pipeline achieves significant performance improvements, with a 2.1% increase in Recall@1 on the MSR-VTT dataset and a 1.6% increase on the DiDeMo dataset.

[111] TransLight: Image-Guided Customized Lighting Control with Generative Decoupling

Zongming Li,Lianghui Zhu,Haocheng Shen,Longjin Ran,Wenyu Liu,Xinggang Wang

Main category: cs.CV

TL;DR: TransLight is a new framework that enables high-quality, customizable transfer of light effects between images, overcoming limitations in existing methods by decoupling content and lighting using diffusion models.

Details

Motivation: Most existing illumination-editing approaches struggle to balance customized control of light effects with content integrity, especially when transferring complex light effects from one image to another. Method: The method involves a 'Generative Decoupling' strategy using two fine-tuned diffusion models to separate image content and light effects, followed by training an IC-Light model with image-content-light triplets and injecting a reference lighting image as a conditioning signal. Result: The proposed TransLight model successfully transfers light effects across disparate images with high fidelity and flexibility, outperforming existing techniques in customization and naturalness. Conclusion: TransLight provides a novel framework for high-fidelity and high-freedom transfer of light effects, offering customized illumination control and advancing research in illumination harmonization and editing. Abstract: Most existing illumination-editing approaches fail to simultaneously provide customized control of light effects and preserve content integrity. This makes them less effective for practical lighting stylization requirements, especially in the challenging task of transferring complex light effects from a reference image to a user-specified target image. To address this problem, we propose TransLight, a novel framework that enables high-fidelity and high-freedom transfer of light effects. Extracting the light effect from the reference image is the most critical and challenging step in our method. The difficulty lies in the complex geometric structure features embedded in light effects that are highly coupled with content in real-world scenarios. To achieve this, we first present Generative Decoupling, where two fine-tuned diffusion models are used to accurately separate image content and light effects, generating a newly curated, million-scale dataset of image-content-light triplets. Then, we employ IC-Light as the generative model and train our model with our triplets, injecting the reference lighting image as an additional conditioning signal. The resulting TransLight model enables customized and natural transfer of diverse light effects. Notably, by thoroughly disentangling light effects from reference images, our generative decoupling strategy endows TransLight with highly flexible illumination control. Experimental results establish TransLight as the first method to successfully transfer light effects across disparate images, delivering more customized illumination control than existing techniques and charting new directions for research in illumination harmonization and editing.

[112] EventSSEG: Event-driven Self-Supervised Segmentation with Probabilistic Attention

Lakshmi Annamalai,Chetan Singh Thakur

Main category: cs.CV

TL;DR: EventSSEG是一种用于道路分割的新方法，它利用事件相机和自我监督学习来实现低延迟、低计算需求和最小的标记数据需求。

Details

Motivation: 实现低延迟、低计算量的帧相机解决方案进行道路分割仍是一个挑战，事件相机提供了一个有希望的替代方案。 Method: 引入了EventSSEG，一种仅使用事件计算和概率注意力机制的道路分割方法，并采用基于事件的自我监督学习来避免对大量标记数据的需求。 Result: 在DSEC-Semantic和DDD17上的实验表明，EventSSEG实现了最先进的性能，同时所需的标记事件最少。 Conclusion: EventSSEG是一个有前景的方法，可以充分利用事件相机的能力并解决标记事件数据的缺乏问题。 Abstract: Road segmentation is pivotal for autonomous vehicles, yet achieving low latency and low compute solutions using frame based cameras remains a challenge. Event cameras offer a promising alternative. To leverage their low power sensing, we introduce EventSSEG, a method for road segmentation that uses event only computing and a probabilistic attention mechanism. Event only computing poses a challenge in transferring pretrained weights from the conventional camera domain, requiring abundant labeled data, which is scarce. To overcome this, EventSSEG employs event-based self supervised learning, eliminating the need for extensive labeled data. Experiments on DSEC-Semantic and DDD17 show that EventSSEG achieves state of the art performance with minimal labeled events. This approach maximizes event cameras capabilities and addresses the lack of labeled events.

[113] Lifespan Pancreas Morphology for Control vs Type 2 Diabetes using AI on Largescale Clinical Imaging

Lucas W. Remedios,Chloe Cho,Trent M. Schwartz,Dingjie Su,Gaurav Rudravaram,Chenyu Gao,Aravind R. Krishnan,Adam M. Saunders,Michael E. Kim,Shunxing Bao,Thomas A. Lasko,Alvin C. Powers,Bennett A. Landman,John Virostko

Main category: cs.CV

TL;DR: This study identifies significant differences in pancreas size and shape between patients with type 2 diabetes and non-diabetic controls, using AI-based analysis of CT and MRI scans. Lifespan trends show that the pancreas is smaller in type 2 diabetes, and MRI measurements differ from CT.

Details

Motivation: Understanding how the pancreas changes with age is critical for detecting deviations in type 2 diabetes and other pancreatic diseases. The motivation is to establish normative morphological aging trends and detect potential deviations linked to type 2 diabetes. Method: The study used a dataset of 2533 patients imaged with abdominal CT or MRI. Scans were resampled to 3mm isotropic resolution, pancreas segmentation was performed using automated methods, and 13 morphological features were extracted. GAMLSS regression was used to model pancreas morphology trends in 1350 patients matched for age, sex, and type 2 diabetes status. Result: When adjusting for confounders, 10 out of 13 morphological features showed significantly different aging trends between patients with type 2 diabetes and non-diabetic controls (p < 0.05 after multiple comparisons corrections). MRI appeared to yield different pancreas measurements compared to CT using the AI-based method. Conclusion: The study concludes that there are significant differences in pancreas size and shape between patients with type 2 diabetes and non-diabetic controls, reinforcing that the pancreas is smaller in type 2 diabetes. The research also provides lifespan trends of pancreas morphology and highlights differences in measurement outcomes between MRI and CT scans using AI-based methods. Abstract: Purpose: Understanding how the pancreas changes is critical for detecting deviations in type 2 diabetes and other pancreatic disease. We measure pancreas size and shape using morphological measurements from ages 0 to 90. Our goals are to 1) identify reliable clinical imaging modalities for AI-based pancreas measurement, 2) establish normative morphological aging trends, and 3) detect potential deviations in type 2 diabetes. Approach: We analyzed a clinically acquired dataset of 2533 patients imaged with abdominal CT or MRI. We resampled the scans to 3mm isotropic resolution, segmented the pancreas using automated methods, and extracted 13 morphological pancreas features across the lifespan. First, we assessed CT and MRI measurements to determine which modalities provide consistent lifespan trends. Second, we characterized distributions of normative morphological patterns stratified by age group and sex. Third, we used GAMLSS regression to model pancreas morphology trends in 1350 patients matched for age, sex, and type 2 diabetes status to identify any deviations from normative aging associated with type 2 diabetes. Results: When adjusting for confounders, the aging trends for 10 of 13 morphological features were significantly different between patients with type 2 diabetes and non-diabetic controls (p < 0.05 after multiple comparisons corrections). Additionally, MRI appeared to yield different pancreas measurements than CT using our AI-based method. Conclusions: We provide lifespan trends demonstrating that the size and shape of the pancreas is altered in type 2 diabetes using 675 control patients and 675 diabetes patients. Moreover, our findings reinforce that the pancreas is smaller in type 2 diabetes. Additionally, we contribute a reference of lifespan pancreas morphology from a large cohort of non-diabetic control patients in a clinical setting.

[114] MS-CLR: Multi-Skeleton Contrastive Learning for Human Action Recognition

Mert Kiray,Alvaro Ritter,Nassir Navab,Benjamin Busam

Main category: cs.CV

TL;DR: This paper proposes Multi-Skeleton Contrastive Learning (MS-CLR), a self-supervised framework that aligns pose representations across multiple skeleton conventions, achieving improved performance and state-of-the-art results in skeleton-based action recognition.

Details

Motivation: Existing contrastive learning methods rely on a single skeleton convention, limiting their ability to generalize across datasets with diverse joint structures and anatomical coverage. Method: Multi-Skeleton Contrastive Learning (MS-CLR) is proposed, which aligns pose representations across multiple skeleton conventions extracted from the same sequence. The ST-GCN architecture is adapted to handle skeletons with varying joint layouts and scales through a unified representation scheme. Result: Experiments on the NTU RGB+D 60 and 120 datasets demonstrate that MS-CLR consistently improves performance over strong single-skeleton contrastive learning baselines. A multi-skeleton ensemble further boosts performance, setting new state-of-the-art results on both datasets. Conclusion: MS-CLR improves performance in skeleton-based action recognition by aligning pose representations across multiple skeleton conventions and achieves state-of-the-art results. Abstract: Contrastive learning has gained significant attention in skeleton-based action recognition for its ability to learn robust representations from unlabeled data. However, existing methods rely on a single skeleton convention, which limits their ability to generalize across datasets with diverse joint structures and anatomical coverage. We propose Multi-Skeleton Contrastive Learning (MS-CLR), a general self-supervised framework that aligns pose representations across multiple skeleton conventions extracted from the same sequence. This encourages the model to learn structural invariances and capture diverse anatomical cues, resulting in more expressive and generalizable features. To support this, we adapt the ST-GCN architecture to handle skeletons with varying joint layouts and scales through a unified representation scheme. Experiments on the NTU RGB+D 60 and 120 datasets demonstrate that MS-CLR consistently improves performance over strong single-skeleton contrastive learning baselines. A multi-skeleton ensemble further boosts performance, setting new state-of-the-art results on both datasets.

[115] GaussianArt: Unified Modeling of Geometry and Motion for Articulated Objects

Licheng Shen,Saining Zhang,Honghan Li,Peilin Yang,Zihao Huang,Zongzheng Zhang,Hao Zhao

Main category: cs.CV

TL;DR: This paper proposes a unified representation using articulated 3D Gaussians to enhance the reconstruction and motion decomposition of complex multi-part objects, significantly outperforming prior methods.

Details

Motivation: The motivation stems from the limitations of prior methods that decouple geometry and motion, complicating the reconstruction pipeline and restricting scalability for objects with complex, multi-part articulation. Method: The method involves introducing a unified representation that jointly models geometry and motion using articulated 3D Gaussians, which improves robustness in motion decomposition. Result: The result is a scalable approach that supports articulated objects with up to 20 parts, outperforming previous methods that struggle beyond 2-3 parts. Extensive experiments validate its superior accuracy in geometry reconstruction and motion estimation. Conclusion: The paper concludes that the proposed unified representation of articulated 3D Gaussians significantly enhances the reconstruction and motion decomposition of complex articulated objects, demonstrating superior performance over existing methods. Abstract: Reconstructing articulated objects is essential for building digital twins of interactive environments. However, prior methods typically decouple geometry and motion by first reconstructing object shape in distinct states and then estimating articulation through post-hoc alignment. This separation complicates the reconstruction pipeline and restricts scalability, especially for objects with complex, multi-part articulation. We introduce a unified representation that jointly models geometry and motion using articulated 3D Gaussians. This formulation improves robustness in motion decomposition and supports articulated objects with up to 20 parts, significantly outperforming prior approaches that often struggle beyond 2--3 parts due to brittle initialization. To systematically assess scalability and generalization, we propose MPArt-90, a new benchmark consisting of 90 articulated objects across 20 categories, each with diverse part counts and motion configurations. Extensive experiments show that our method consistently achieves superior accuracy in part-level geometry reconstruction and motion estimation across a broad range of object types. We further demonstrate applicability to downstream tasks such as robotic simulation and human-scene interaction modeling, highlighting the potential of unified articulated representations in scalable physical modeling.

Table of Contents

cs.CL [Back]

[1] From Image Captioning to Visual Storytelling

[2] Benchmarking Sociolinguistic Diversity in Swahili NLP: A Taxonomy-Guided Approach

[3] Contrastive Analysis of Constituent Order Preferences Within Adverbial Roles in English and Chinese News: A Large-Language-Model-Driven Approach

[4] T-REX: Table -- Refute or Entail eXplainer

[5] Confidence Estimation for Text-to-SQL in Large Language Models

[6] Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models

[7] Punctuation and Predicates in Language Models

[8] DLLMQuant: Quantizing Diffusion-based Large Language Models

[9] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

[10] DPad: Efficient Diffusion Language Models with Suffix Dropout

[11] Comparing energy consumption and accuracy in text classification inference

[12] Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper

[13] Disentangling concept semantics via multilingual averaging in Sparse Autoencoders

[14] GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs

[15] Tokens with Meaning: A Hybrid Tokenization Approach for NLP

[16] A Joint Multitask Model for Morpho-Syntactic Parsing

[17] Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency

[18] SurveyGen-I: Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing

[19] Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever

[20] ISCA: A Framework for Interview-Style Conversational Agents

[21] ZPD-SCA: Unveiling the Blind Spots of LLMs in Assessing Students' Cognitive Abilities

[22] Credence Calibration Game? Calibrating Large Language Models through Structured Play

[23] DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement

[24] Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs

[25] Knowledge Graph-Infused Fine-Tuning for Structured Reasoning in Large Language Models

[26] NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

[27] In2x at WMT25 Translation Task

[28] Reasoning is about giving reasons

[29] EmoTale: An Enacted Speech-emotion Dataset in Danish

[30] Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning

[31] Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek

[32] Continuous sentiment scores for literary and multilingual contexts

[33] Improving in-context learning with a better scoring function

[34] ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine

[35] The Digital Sous Chef -- A Comparative Study on Fine-Tuning Language Models for Recipe Generation

[36] Transplant Then Regenerate: A New Paradigm for Text Data Augmentation

[37] Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference

[38] TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting

[39] Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs

[40] Long Chain-of-Thought Reasoning Across Languages

[41] MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework

[42] Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs

cs.CV [Back]

[43] A comparative study of some wavelet and sampling operators on various features of an image

[44] Federated Action Recognition for Smart Worker Assistance Using FastPose

[45] LENS: Learning to Segment Anything with Unified Reinforced Reasoning

[46] RynnEC: Bringing MLLMs into Embodied World

[47] Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

[48] CLIPSym: Delving into Symmetry Detection with CLIP

[49] A Survey on Video Anomaly Detection via Deep Learning: Human, Vehicle, and Environment

[50] Accelerating Image Classification with Graph Convolutional Neural Networks using Voronoi Diagrams

[51] Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

[52] Effect of Data Augmentation on Conformal Prediction for Diabetic Retinopathy

[53] Tooth-Diffusion: Guided 3D CBCT Synthesis with Fine-Grained Tooth Conditioning

[54] GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting

[55] Multi-Rationale Explainable Object Recognition via Contrastive Conditional Inference

[56] OccluNet: Spatio-Temporal Deep Learning for Occlusion Detection on DSA

[57] Pixels to Play: A Foundation Model for 3D Gameplay

[58] MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation

[59] Inter-Class Relational Loss for Small Object Detection: A Case Study on License Plates

[60] HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation

[61] Deep Learning for Taxol Exposure Analysis: A New Cell Image Dataset and Attention-Based Baseline Model

[62] Learning Point Cloud Representations with Pose Continuity for Depth-Based Category-Level 6D Object Pose Estimation

[63] Taming Transformer for Emotion-Controllable Talking Face Generation

[64] FastTracker: Real-Time and Accurate Visual Tracking

[65] TCFNet: Bidirectional face-bone transformation via a Transformer-based coarse-to-fine point movement network

[66] QuadINR: Hardware-Efficient Implicit Neural Representations Through Quadratic Activation

[67] Img2ST-Net: Efficient High-Resolution Spatial Omics Prediction from Whole Slide Histology Images via Fully Convolutional Image-to-Image Learning

[68] CTA-Flux: Integrating Chinese Cultural Semantics into High-Quality English Text-to-Image Communities

[69] MoCHA-former: Moiré-Conditioned Hybrid Adaptive Transformer for Video Demoiréing

[70] HyperDiff: Hypergraph Guided Diffusion Model for 3D Human Pose Estimation

[71] FOCUS: Frequency-Optimized Conditioning of DiffUSion Models for mitigating catastrophic forgetting during Test-Time Adaptation

[72] MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion

[73] Reconstruction Using the Invisible: Intuition from NIR and Metadata for Enhanced 3D Gaussian Splatting

[74] Generalizable Engagement Estimation in Conversation via Domain Prompting and Parallel Attention

[75] D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis

[76] Ouroboros: Single-step Diffusion Models for Cycle-consistent Forward and Inverse Rendering

[77] DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing