cs.CL [Back]

[1] CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation

Deepon Halder,Thanmay Jayakumar,Raj Dabre

Main category: cs.CL

TL;DR: 提出了一种名为CycleDistill的方法，通过使用大型语言模型和少量翻译示例迭代生成合成平行语料库，从而在没有大量平行语料库的情况下实现高质量的机器翻译。

Details

Motivation: 尽管大型语言模型（LLMs）能够执行少量机器翻译，但它们通常落后于使用平行语料库训练的专业机器翻译系统。对于资源匮乏的语言，平行语料库往往稀缺或不存在。 Method: CycleDistill通过零次或少量翻译从单语文本中迭代生成合成平行语料库，并使用该语料库对生成数据的模型进行微调以实现机器翻译。 Result: 在依赖单语文本的情况下，CycleDistill可以实现高质量的机器翻译，在三个印度语言的实验中，平均首次迭代就能比少量基线模型提高20-30个chrF点。 Conclusion: CycleDistill是一种利用LLMs和少量翻译示例来获得高质量机器翻译系统的新方法，它不需要大量的平行语料库。 Abstract: Large language models (LLMs), despite their ability to perform few-shot machine translation (MT), often lag behind dedicated MT systems trained on parallel corpora, which are crucial for high quality machine translation (MT). However, parallel corpora are often scarce or non-existent for low-resource languages. In this paper, we propose CycleDistill, a bootstrapping approach leveraging LLMs and few-shot translation to obtain high-quality MT systems. CycleDistill involves iteratively generating synthetic parallel corpora from monolingual corpora via zero- or few-shot MT, which is then used to fine-tune the model that was used for generating said data for MT. CycleDistill does not need parallel corpora beyond 1 to 4 few-shot examples, and in our experiments focusing on three Indian languages, by relying solely on monolingual corpora, it can achieve high-quality machine translation, improving upon a few-shot baseline model by over 20-30 chrF points on average in the first iteration. We also study the effect of leveraging softmax activations during the distillation process and observe mild improvements in translation quality.

[2] Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs

Travis Thompson,Seung-Hwan Lim,Paul Liu,Ruoying He,Dongkuan Xu

Main category: cs.CL

TL;DR: 本文提出了一种名为Inference-Scaled GraphRAG的新框架，通过在推理时进行计算扩展来增强基于LLM的图推理，从而提高多跳问答性能。

Details

Motivation: 传统RAG和GraphRAG方法在捕捉知识图谱节点间的关系结构方面存在不足，而LLM在密集型知识推理任务中表现不佳。 Method: 结合顺序扩展与深度链式推理遍历，并通过并行扩展与抽样路径中的多数投票进行改进。 Result: 在GRBench基准测试中显著提升了多跳问答性能，明显优于传统的GraphRAG和之前的图遍历基线。 Conclusion: Inference-Scaled GraphRAG 是一种实用且与架构无关的解决方案，可提升基于LLM的结构化知识推理能力。 Abstract: Large Language Models (LLMs) have achieved impressive capabilities in language understanding and generation, yet they continue to underperform on knowledge-intensive reasoning tasks due to limited access to structured context and multi-hop information. Retrieval-Augmented Generation (RAG) partially mitigates this by grounding generation in retrieved context, but conventional RAG and GraphRAG methods often fail to capture relational structure across nodes in knowledge graphs. We introduce Inference-Scaled GraphRAG, a novel framework that enhances LLM-based graph reasoning by applying inference-time compute scaling. Our method combines sequential scaling with deep chain-of-thought graph traversal, and parallel scaling with majority voting over sampled trajectories within an interleaved reasoning-execution loop. Experiments on the GRBench benchmark demonstrate that our approach significantly improves multi-hop question answering performance, achieving substantial gains over both traditional GraphRAG and prior graph traversal baselines. These findings suggest that inference-time scaling is a practical and architecture-agnostic solution for structured knowledge reasoning with LLMs

[3] Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation

Xinyi Ni,Haonan Jian,Qiuyang Wang,Vedanshi Chetan Shah,Pengyu Hong

Main category: cs.CL

TL;DR: Doc2Agent is a scalable pipeline that builds efficient, adaptable agents from unstructured API documentation, outperforming existing methods in performance and cost.

Details

Motivation: Most API-based agents use curated toolsets that don't reflect real-world complexity; Doc2Agent addresses this by enabling agents to handle arbitrary domains and unstructured API documentation. Method: Doc2Agent creates a pipeline that generates executable tools from API documentation and iteratively refines them using a code agent. Result: The approach showed a 55% relative performance improvement and 90% lower cost on the WebArena benchmark, and successfully adapted to a complex domain in glycomaterial science. Conclusion: Doc2Agent provides a scalable and adaptable solution for building tool agents from unstructured API documentation, showing significant performance and cost benefits. Abstract: REST APIs play important roles in enriching the action space of web agents, yet most API-based agents rely on curated and uniform toolsets that do not reflect the complexity of real-world APIs. Building tool-using agents for arbitrary domains remains a major challenge, as it requires reading unstructured API documentation, testing APIs and inferring correct parameters. We propose Doc2Agent, a scalable pipeline to build agents that can call Python-based tools generated from API documentation. Doc2Agent generates executable tools from API documentations and iteratively refines them using a code agent. We evaluate our approach on real-world APIs, WebArena APIs, and research APIs, producing validated tools. We achieved a 55\% relative performance improvement with 90\% lower cost compared to direct API calling on WebArena benchmark. A domain-specific agent built for glycomaterial science further demonstrates the pipeline's adaptability to complex, knowledge-rich tasks. Doc2Agent offers a generalizable solution for building tool agents from unstructured API documentation at scale.

[4] A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs

Kethmi Hirushini Hettige,Jiahao Ji,Cheng Long,Shili Xiang,Gao Cong,Jingyuan Wang

Main category: cs.CL

TL;DR: STReason combines LLMs with spatio-temporal analysis to enable multi-task inference and complex reasoning without task-specific tuning, offering improved performance and practical utility in real-world applications.

Details

Motivation: Existing spatio-temporal models are limited in multi-task inference and long-form reasoning. The need for detailed explanatory outputs and broader applicability to real-world decision-making scenarios motivated the development of STReason. Method: STReason integrates the reasoning strengths of large language models (LLMs) with spatio-temporal analytical capabilities, enabling multi-task inference without task-specific fine-tuning through in-context learning. Modular programs are systematically executed to generate solutions and rationales. Result: Experimental results show that STReason significantly outperforms advanced LLM baselines across all metrics, particularly excelling in complex, reasoning-intensive spatio-temporal tasks. Human evaluations validate its credibility and potential to reduce expert workload. Conclusion: STReason provides a promising direction for developing more capable and generalizable spatio-temporal reasoning systems, showing significant improvements over advanced LLM baselines in complex reasoning-intensive scenarios. Abstract: Spatio-temporal data mining plays a pivotal role in informed decision making across diverse domains. However, existing models are often restricted to narrow tasks, lacking the capacity for multi-task inference and complex long-form reasoning that require generation of in-depth, explanatory outputs. These limitations restrict their applicability to real-world, multi-faceted decision scenarios. In this work, we introduce STReason, a novel framework that integrates the reasoning strengths of large language models (LLMs) with the analytical capabilities of spatio-temporal models for multi-task inference and execution. Without requiring task-specific finetuning, STReason leverages in-context learning to decompose complex natural language queries into modular, interpretable programs, which are then systematically executed to generate both solutions and detailed rationales. To facilitate rigorous evaluation, we construct a new benchmark dataset and propose a unified evaluation framework with metrics specifically designed for long-form spatio-temporal reasoning. Experimental results show that STReason significantly outperforms advanced LLM baselines across all metrics, particularly excelling in complex, reasoning-intensive spatio-temporal scenarios. Human evaluations further validate STReason's credibility and practical utility, demonstrating its potential to reduce expert workload and broaden the applicability to real-world spatio-temporal tasks. We believe STReason provides a promising direction for developing more capable and generalizable spatio-temporal reasoning systems.

[5] SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization

Dhruv Gupta,Gayathri Ganesh Lakshmy,Yiqing Xie

Main category: cs.CL

TL;DR: This paper introduces SACL, a framework that improves code retrieval and generation by augmenting semantic information to reduce reliance on superficial text features and mitigate bias toward well-documented code.

Details

Motivation: Current code retrievers overly depend on superficial text features and exhibit bias towards well-documented code, even when irrelevant, which limits the effectiveness of Retrieval-Augmented Code Generation (RACG). Method: The authors analyze the reliance of current retrievers on surface-level textual features and their bias towards well-documented code, then propose SACL to enhance retrieval by incorporating semantic information. Result: SACL achieves significant improvements in code retrieval with increases in Recall@1 across multiple benchmarks (e.g., 12.8% improvement on HumanEval) and enhances code generation performance (e.g., 4.88% increase in Pass@1 on HumanEval). Conclusion: SACL improves code retrieval and generation performance by enriching textual information and reducing bias through semantic augmentation. Abstract: Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant.Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88% Pass@1 on HumanEval).

[6] Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder

Yingji Zhang,Danilo S. Carvalho,André Freitas

Main category: cs.CL

TL;DR: This paper discusses how semantic representation learning can integrate symbolic and distributional semantics to enhance language model capabilities by examining latent geometries from different autoencoder architectures.

Details

Motivation: Integrating compositional and symbolic properties into current semantic spaces can improve the interpretability, controllability, compositionality, and generalization of Transformer-based language models. Method: Reviewed and compared VAE, VQVAE, and SAE architectures to examine their induced latent geometries concerning semantic structure and interpretability. Result: The survey provides a novel perspective on latent space geometry through compositional semantics, highlighting distinctive latent geometries from different autoencoder architectures. Conclusion: Semantic representation learning bridges symbolic and distributional semantics, enhancing the capabilities of language models. Abstract: Integrating compositional and symbolic properties into current distributional semantic spaces can enhance the interpretability, controllability, compositionality, and generalisation capabilities of Transformer-based auto-regressive language models (LMs). In this survey, we offer a novel perspective on latent space geometry through the lens of compositional semantics, a direction we refer to as \textit{semantic representation learning}. This direction enables a bridge between symbolic and distributional semantics, helping to mitigate the gap between them. We review and compare three mainstream autoencoder architectures-Variational AutoEncoder (VAE), Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE)-and examine the distinctive latent geometries they induce in relation to semantic structure and interpretability.

Yilin Wang,Peixuan Lei,Jie Song,Yuzhe Hao,Tao Chen,Yuxuan Zhang,Lei Jia,Yuanxiang Li,Zhongyu Wei

Main category: cs.CL

TL;DR: 本文提出了一种新方法(ITFormer)来解决时间序列数据与自然语言的有效整合问题，并发布了首个大规模多任务时间-文本问答数据集EngineMT-QA。

Details

Motivation: 时间序列数据在多个领域至关重要，但如何将其与自然语言有效结合以实现动态交互任务是一个重大挑战。 Method: 提出了时间序列问答任务并构建了EngineMT-QA数据集，同时设计了ITFormer框架，将时间序列编码器与冻结的大语言模型结合。 Result: ITFormer框架在问答准确性上显著优于强基线方法，并且仅需不到1%的额外可训练参数。 Conclusion: 该研究通过ITFormer框架有效地整合了时间序列数据与自然语言，为多模态人工智能的研究和应用建立了新的范式。 Abstract: Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge. To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language. Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving a strong improvement in QA accuracy over strong baselines with fewer than 1\% additional trainable parameters. By combining computational efficiency with robust cross-modal modeling, our work establishes a adaptable paradigm for integrating temporal data with natural language, paving the way for new research and applications in multi-modal AI. More details about the project, including datasets and code, are available at: https://pandalin98.github.io/itformer_site/

[8] A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection

Songsoo Kim,Seungtae Lee,See Young Lee,Joonho Kim,Keechan Kan,Dukyong Yoon

Main category: cs.CL

TL;DR: This paper proposes a three-pass LLM framework to enhance the positive predictive value and reduce operational costs of AI-assisted radiology report quality assurance.

Details

Motivation: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence. This study aims to assess whether a three-pass LLM framework enhances PPV and reduces operational costs compared with baseline approaches. Method: Three LLM frameworks were tested: (1) single-prompt detector; (2) extractor plus detector; and (3) extractor, detector, and false-positive verifier. Precision was measured by PPV and absolute true positive rate (aTPR). Efficiency was calculated from model inference charges and reviewer remuneration. Statistical significance was tested using cluster bootstrap, exact McNemar tests, and Holm-Bonferroni correction. Result: Framework PPV increased from 0.063 (95% CI, 0.036-0.101, Framework 1) to 0.079 (0.049-0.118, Framework 2), and significantly to 0.159 (0.090-0.252, Framework 3; P<.001 vs. baselines). aTPR remained stable (0.012-0.014; P>=.84). Operational costs per 1,000 reports dropped to USD 5.58 (Framework 3) from USD 9.72 (Framework 1) and USD 6.85 (Framework 2), reflecting reductions of 42.6% and 18.5%, respectively. Human-reviewed reports decreased from 192 to 88. External validation supported Framework 3's superior PPV (CheXpert 0.133, Open-i 0.105) and stable aTPR (0.007). Conclusion: A three-pass LLM framework significantly enhanced PPV and reduced operational costs, maintaining detection performance, providing an effective strategy for AI-assisted radiology report quality assurance. Abstract: Background: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence. Purpose: To assess whether a three-pass LLM framework enhances PPV and reduces operational costs compared with baseline approaches. Materials and Methods: A retrospective analysis was performed on 1,000 consecutive radiology reports (250 each: radiography, ultrasonography, CT, MRI) from the MIMIC-III database. Two external datasets (CheXpert and Open-i) were validation sets. Three LLM frameworks were tested: (1) single-prompt detector; (2) extractor plus detector; and (3) extractor, detector, and false-positive verifier. Precision was measured by PPV and absolute true positive rate (aTPR). Efficiency was calculated from model inference charges and reviewer remuneration. Statistical significance was tested using cluster bootstrap, exact McNemar tests, and Holm-Bonferroni correction. Results: Framework PPV increased from 0.063 (95% CI, 0.036-0.101, Framework 1) to 0.079 (0.049-0.118, Framework 2), and significantly to 0.159 (0.090-0.252, Framework 3; P<.001 vs. baselines). aTPR remained stable (0.012-0.014; P>=.84). Operational costs per 1,000 reports dropped to USD 5.58 (Framework 3) from USD 9.72 (Framework 1) and USD 6.85 (Framework 2), reflecting reductions of 42.6% and 18.5%, respectively. Human-reviewed reports decreased from 192 to 88. External validation supported Framework 3's superior PPV (CheXpert 0.133, Open-i 0.105) and stable aTPR (0.007). Conclusion: A three-pass LLM framework significantly enhanced PPV and reduced operational costs, maintaining detection performance, providing an effective strategy for AI-assisted radiology report quality assurance.

[9] Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests

Masaki Uto,Yuma Ito

Main category: cs.CL

TL;DR: This paper introduces a new approach to impute missing scores using automated scoring technologies, which enhances the accuracy of ability estimation in educational assessments while decreasing reliance on manual grading.

Details

Motivation: There is an increasing need to assess higher-order abilities through constructed-response tests; however, these tests are labor-intensive and costly due to substantial manual grading. Additionally, traditional methods like IRT face challenges with accuracy when dealing with sparse or heterogeneous data. Method: The study proposes a novel method for imputing missing scores using automated scoring technologies, aiming to enhance the accuracy of ability estimation in constructed-response tests. Result: The proposed method achieves high accuracy in ability estimation and markedly reduces the manual grading workload, addressing the limitations faced by traditional data augmentation techniques. Conclusion: The study concludes that leveraging automated scoring technologies can significantly improve the accuracy of IRT-based ability estimation while reducing the manual grading workload. Abstract: Evaluating the abilities of learners is a fundamental objective in the field of education. In particular, there is an increasing need to assess higher-order abilities such as expressive skills and logical thinking. Constructed-response tests such as short-answer and essay-based questions have become widely used as a method to meet this demand. Although these tests are effective, they require substantial manual grading, making them both labor-intensive and costly. Item response theory (IRT) provides a promising solution by enabling the estimation of ability from incomplete score data, where human raters grade only a subset of answers provided by learners across multiple test items. However, the accuracy of ability estimation declines as the proportion of missing scores increases. Although data augmentation techniques for imputing missing scores have been explored in order to address this limitation, they often struggle with inaccuracy for sparse or heterogeneous data. To overcome these challenges, this study proposes a novel method for imputing missing scores by leveraging automated scoring technologies for accurate IRT-based ability estimation. The proposed method achieves high accuracy in ability estimation while markedly reducing manual grading workload.

[10] CCRS: A Zero-Shot LLM-as-a-Judge Framework for Comprehensive RAG Evaluation

Aashiq Muhamed

Main category: cs.CL

TL;DR: This paper introduces CCRS, a new evaluation framework for RAG systems that uses a single pretrained LLM to assess multiple quality aspects efficiently and effectively.

Details

Motivation: Evaluating the multifaceted quality of RAG outputs is challenging due to limitations in existing evaluation methods, which either use inadequate lexical overlap metrics or complex multi-stage pipelines that hinder efficiency. Method: The study proposes CCRS, a suite of five zero-shot metrics using a pretrained LLM as an end-to-end judge. These metrics evaluate contextual coherence, question relevance, information density, answer correctness, and information recall. Result: CCRS effectively discriminates between RAG system performances, confirmed by its application on six RAG configurations using the BioASQ dataset, showing that Mistral-7B outperforms Llama variants. Conclusion: CCRS provides a practical, comprehensive, and efficient framework for evaluating RAG systems, offering comparable or superior performance to existing methods while improving computational efficiency. Abstract: RAG systems enhance LLMs by incorporating external knowledge, which is crucial for domains that demand factual accuracy and up-to-date information. However, evaluating the multifaceted quality of RAG outputs, spanning aspects such as contextual coherence, query relevance, factual correctness, and informational completeness, poses significant challenges. Existing evaluation methods often rely on simple lexical overlap metrics, which are inadequate for capturing these nuances, or involve complex multi-stage pipelines with intermediate steps like claim extraction or require finetuning specialized judge models, hindering practical efficiency. To address these limitations, we propose CCRS (Contextual Coherence and Relevance Score), a novel suite of five metrics that utilizes a single, powerful, pretrained LLM as a zero-shot, end-to-end judge. CCRS evaluates: Contextual Coherence (CC), Question Relevance (QR), Information Density (ID), Answer Correctness (AC), and Information Recall (IR). We apply CCRS to evaluate six diverse RAG system configurations on the challenging BioASQ dataset. Our analysis demonstrates that CCRS effectively discriminates between system performances, confirming, for instance, that the Mistral-7B reader outperforms Llama variants. We provide a detailed analysis of CCRS metric properties, including score distributions, convergent/discriminant validity, tie rates, population statistics, and discriminative power. Compared to the complex RAGChecker framework, CCRS offers comparable or superior discriminative power for key aspects like recall and faithfulness, while being significantly more computationally efficient. CCRS thus provides a practical, comprehensive, and efficient framework for evaluating and iteratively improving RAG systems.

[11] AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control

Ruosen Li,Ziming Luo,Quan Zhang,Ruochen Li,Ben Zhou,Ali Payani,Xinya Du

Main category: cs.CL

TL;DR: AALC通过在强化学习中引入一种轻量级的、准确性感知的长度奖励，平衡推理准确性和简明性，减少响应长度超过50%，同时保持或提升准确性。

Details

Motivation: 大型推理模型（LRMs）通过生成长链的思维链实现强大的推理能力，但这种“过度思考”导致高延迟和成本，且准确性增益不足。因此需要一种方法在训练中动态平衡正确性和简洁性。 Method: 将验证准确性纳入奖励机制，并采用平滑的、动态调度的长度惩罚，在达到目标性能前延迟长度惩罚。 Result: 实验表明，该方法减少了超过50%的响应长度，同时保持或提高了原始准确性；定性分析显示，该方法减少了冗余的推理模式，如过多的子目标设定和验证。 Conclusion: 基于奖励的策略有潜力引导大型推理模型走向更高效、更具泛化性的推理路径，尽管效率提升伴随着可解释性的降低。 Abstract: Large reasoning models (LRMs) achieve impressive reasoning capabilities by generating lengthy chain-of-thoughts, but this "overthinking" incurs high latency and cost without commensurate accuracy gains. In this work, we introduce AALC, a lightweight, accuracy-aware length reward integrated into reinforcement learning that dynamically balances correctness and brevity during training. By incorporating validation accuracy into the reward and employing a smooth, dynamically scheduled length penalty, AALC delays length penalty until target performance is met. Through extensive experiments across standard and out-of-distribution math benchmarks, we show that our approach reduces response length by over 50% while maintaining or even improving the original accuracy. Furthermore, qualitative analysis reveals that our method curbs redundant reasoning patterns such as excessive subgoal setting and verification, leading to structurally refined outputs rather than naive truncation. We also identify that efficiency gains are accompanied by reduced interpretability: models trained with AALC omit some narrative framing and explanatory context. These findings highlight the potential of reward-based strategies to guide LRMs toward more efficient, generalizable reasoning paths.

[12] SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs

Fengze Li,Yue Wang,Yangle Liu,Ming Huang,Dou Hong,Jieming Ma

Main category: cs.CL

TL;DR: This paper proposes SEED, a structural encoder framework that connects multivariate time series data with large language models to enhance forecasting accuracy by bridging the gap between structural and semantic modeling.

Details

Motivation: Current structural encoders can model feature interactions but lack semantic-level reasoning or task adaptation capabilities, while large language models (LLMs) are incompatible with raw time series inputs. This limits the development of unified, transferable prediction systems. Method: SEED uses four stages: a token-aware encoder for patch extraction, a projection module to align patches with language model embeddings, a semantic reprogramming mechanism mapping patches to task-aware prototypes, and a frozen language model for prediction. Result: Empirical results show that SEED consistently outperforms strong baselines and effectively addresses the structural-semantic modeling gap across various datasets. Conclusion: SEED successfully bridges the structural-semantic modeling gap in multivariate time series forecasting by integrating a structural encoder with a frozen language model, enabling efficient alignment between numerical patterns and semantic reasoning. Abstract: Multivariate time series forecasting requires models to simultaneously capture variable-wise structural dependencies and generalize across diverse tasks. While structural encoders are effective in modeling feature interactions, they lack the capacity to support semantic-level reasoning or task adaptation. Conversely, large language models (LLMs) possess strong generalization capabilities but remain incompatible with raw time series inputs. This gap limits the development of unified, transferable prediction systems. Therefore, we introduce SEED, a structural encoder for embedding-driven decoding, which integrates four stages: a token-aware encoder for patch extraction, a projection module that aligns patches with language model embeddings, a semantic reprogramming mechanism that maps patches to task-aware prototypes, and a frozen language model for prediction. This modular architecture decouples representation learning from inference, enabling efficient alignment between numerical patterns and semantic reasoning. Empirical results demonstrate that the proposed method achieves consistent improvements over strong baselines, and comparative studies on various datasets confirm SEED's role in addressing the structural-semantic modeling gap.

[13] COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees

Zhiyuan Wang,Jinhao Duan,Qingni Wang,Xiaofeng Zhu,Tianlong Chen,Xiaoshuang Shi,Kaidi Xu

Main category: cs.CL

TL;DR: COIN通过统计校准过滤生成文本的答案，在保障错误发现率（FDR）的前提下提高样本保留率。

Details

Motivation: 现有启发式UQ方法缺乏对选择预测中错误发现率（FDR）的形式化保证，而现有方法在实用性方面受限。 Method: 使用Clopper-Pearson置信区间方法估计经验误差率并建立高概率上限以确定有效的不确定性阈值。 Result: COIN在风险控制、保留可接受答案的能力以及在有限校准数据下的预测效率方面表现出色。 Conclusion: COIN是一个可扩展的不确定性量化框架，能够在用户指定的FDR约束下有效筛选生成的答案。 Abstract: Uncertainty quantification (UQ) for foundation models is essential to identify and mitigate potential hallucinations in automatically generated text. However, heuristic UQ approaches lack formal guarantees for key metrics such as the false discovery rate (FDR) in selective prediction. Previous work adopts the split conformal prediction (SCP) framework to ensure desired coverage of admissible answers by constructing prediction sets, but these sets often contain incorrect candidates, limiting their practical utility. To address this, we propose COIN, an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question under user-specified FDR constraints. COIN estimates the empirical error rate on a calibration set and applies confidence interval methods such as Clopper-Pearson to establish a high-probability upper bound on the true error rate (i.e., FDR). This enables the selection of the largest uncertainty threshold that ensures FDR control on test data while significantly increasing sample retention. We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data across both general and multimodal text generation tasks. Furthermore, we show that employing alternative upper bound constructions and UQ strategies can further boost COIN's power performance, which underscores its extensibility and adaptability to diverse application scenarios.

[14] How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models?

Mengqi Wang,Tiantian Feng,Shrikanth Narayanan

Main category: cs.CL

TL;DR: 本文提出了一种通过增强示例检索来提高大型语言模型在对话中识别情感的方法，证明了高质量示例检索的重要性。

Details

Motivation: 由于构建高精度的大型语言模型应用仍然具有挑战性，尤其是在情绪识别等主观任务上，因此需要探索改进方法。 Method: 提出了基于随机和增强示例检索的多种策略，并分析了对话上下文对情感识别准确性的影响。 Result: 实验结果表明，在所有数据集中，增强示例检索技术始终优于其他研究中的技术。 Conclusion: 增强示例检索在上下文学习中能有效提升大型语言模型的情感识别能力，特别是在对话情境中。 Abstract: Large language models (LLMs) have enabled a wide variety of real-world applications in various domains. However, creating a high-performing application with high accuracy remains challenging, particularly for subjective tasks like emotion recognition. Inspired by the SLT 2024 GenSER Challenge, this study investigates approaches to improving conversational emotion recognition (CER) by LLMs. Specifically, we explore how to retrieve high-quality examples in in-context learning (ICL) to enhance CER. We propose various strategies based on random and augmented example retrieval and also analyze the impact of conversational context on CER accuracy. Experiments were conducted on the three datasets including IEMOCAP, MELD and EmoryNLP. The results show that augmented example retrieval consistently outperforms other techniques under investigation across all datasets, highlighting the importance of retrieving coherent targeted examples and enhancing them through paraphrasing.

[15] Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation

Petra Barančíková,Ondřej Bojar

Main category: cs.CL

TL;DR: 该论文比较了捷克语特定和多语言句子嵌入模型，通过内在和外在评估方法，发现内在评估表现良好的模型不一定在下游任务中表现出色。

Details

Motivation: 为了理解不同句子嵌入模型在捕捉语言现象方面的复杂关系及其在下游任务中的表现。 Method: 使用Costra数据集和STS基准进行内在评估，并通过基于COMET的指标对机器翻译评估任务进行外在评估。 Result: 实验显示，在内在语义相似性测试中表现优异的模型在下游翻译评估任务中并不总是表现良好，而经过微调的模型可以在看似平滑的嵌入空间中取得良好结果。 Conclusion: 研究强调了需要更多关于句子嵌入中的“可操作语义”或更深入的下游任务数据集的研究。 Abstract: In this paper, we compare Czech-specific and multilingual sentence embedding models through intrinsic and extrinsic evaluation paradigms. For intrinsic evaluation, we employ Costra, a complex sentence transformation dataset, and several Semantic Textual Similarity (STS) benchmarks to assess the ability of the embeddings to capture linguistic phenomena such as semantic similarity, temporal aspects, and stylistic variations. In the extrinsic evaluation, we fine-tune each embedding model using COMET-based metrics for machine translation evaluation. Our experiments reveal an interesting disconnect: models that excel in intrinsic semantic similarity tests do not consistently yield superior performance on downstream translation evaluation tasks. Conversely, models with seemingly over-smoothed embedding spaces can, through fine-tuning, achieve excellent results. These findings highlight the complex relationship between semantic property probes and downstream task, emphasizing the need for more research into 'operationalizable semantics' in sentence embeddings, or more in-depth downstream tasks datasets (here translation evaluation)

[16] Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems

Benedetta Muscato,Lucia Passaro,Gizem Gezici,Fosca Giannotti

Main category: cs.CL

TL;DR: 这篇论文提出了一种基于软标签的多视角方法，用于处理NLP中的主观任务，旨在更好地捕捉人类分歧，提升模型性能并增强包容性。

Details

Motivation: 传统的将标注者观点聚合为单一真实标签的方法可能会忽视少数观点，特别是在主观任务中。为了更全面地反映标注者的多样性背景和价值观，本研究提出了多视角方法。 Method: 研究采用了多视角的软标签方法，并在多种主观文本分类任务中进行了广泛分析，包括仇恨言论、讽刺、侮辱性语言和立场检测。此外，利用可解释AI（XAI）探索了模型的不确定性。 Result: 结果表明，多视角方法不仅更好地逼近了人类标签分布（通过Jensen-Shannon散度衡量），还在分类性能上优于传统方法（更高的F1分数）。然而，在讽刺和立场检测等任务中，该方法表现出了较低的置信度，这可能与文本的内在主观性有关。 Conclusion: 该研究提出了一种新的多视角方法，通过使用软标签来更好地处理自然语言处理中的主观任务，从而开发更具包容性和多元化的模型。这种方法在逼近人类标签分布和分类性能方面优于传统方法，但也显示出在某些高度主观的任务中存在较低的置信度。 Abstract: In the realm of Natural Language Processing (NLP), common approaches for handling human disagreement consist of aggregating annotators' viewpoints to establish a single ground truth. However, prior studies show that disregarding individual opinions can lead can lead to the side effect of underrepresenting minority perspectives, especially in subjective tasks, where annotators may systematically disagree because of their preferences. Recognizing that labels reflect the diverse backgrounds, life experiences, and values of individuals, this study proposes a new multi-perspective approach using soft labels to encourage the development of the next generation of perspective aware models, more inclusive and pluralistic. We conduct an extensive analysis across diverse subjective text classification tasks, including hate speech, irony, abusive language, and stance detection, to highlight the importance of capturing human disagreements, often overlooked by traditional aggregation methods. Results show that the multi-perspective approach not only better approximates human label distributions, as measured by Jensen-Shannon Divergence (JSD), but also achieves superior classification performance (higher F1 scores), outperforming traditional approaches. However, our approach exhibits lower confidence in tasks like irony and stance detection, likely due to the inherent subjectivity present in the texts. Lastly, leveraging Explainable AI (XAI), we explore model uncertainty and uncover meaningful insights into model predictions.

[17] Enhancing Large Language Models through Structured Reasoning

Yubo Dong,Hehe Fan

Main category: cs.CL

TL;DR: This paper introduces a novel method to improve LLMs' reasoning abilities by incorporating structured reasoning through annotated datasets and advanced training techniques like GRPO with MAX-Flow and LCS algorithms.

Details

Motivation: LLMs struggle with complex reasoning tasks due to reliance on implicit statistical relationships without structured knowledge representation. This work aims to overcome this limitation by introducing explicit structured reasoning inspired by cognitive science and neurosymbolic AI. Method: The authors employ a structured dataset created by explicitly annotating reasoning steps to train LLMs via Supervised Fine-Tuning (SFT). They further enhance reasoning capabilities using Group Relative Policy Optimization (GRPO) with two algorithms: MAX-Flow and Longest Common Subsequence (LCS). Result: Fine-tuning a DeepSeek-R1-Distill-Qwen-1.5B model showed improved concise reasoning, robust performance across various scenarios, and reduced computational complexity, validating the effectiveness of the proposed structured reasoning approach. Conclusion: Structured reasoning integration enhances the complex reasoning capabilities of Large Language Models (LLMs), improving their performance in logical deduction, systematic planning, and compatibility with optimization techniques. Abstract: Recent Large Language Models (LLMs) have significantly advanced natural language processing and automated decision-making. However, these models still encounter difficulties when performing complex reasoning tasks involving logical deduction and systematic planning, primarily due to their reliance on implicit statistical relationships without structured knowledge representation.Inspired by cognitive science and neurosymbolic AI, we introduce a novel approach to enhance LLMs through explicit structured reasoning. First, we convert unstructured data into structured formats by explicitly annotating reasoning steps. We then employ this structured dataset to train LLMs through Supervised Fine-Tuning (SFT). Additionally, we enhance the structured reasoning capabilities of LLMs using Group Relative Policy Optimization (GRPO), incorporating two innovative algorithms--MAX-Flow and Longest Common Subsequence (LCS)--which notably improve reasoning effectiveness and reduce computational complexity. Experimental results from fine-tuning a DeepSeek-R1-Distill-Qwen-1.5B model demonstrate concise reasoning, robust performance across various scenarios, and improved compatibility with optimization techniques, validating the efficacy of structured reasoning integration in LLMs.

[18] CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment

Papa Séga Wade,Mihai Andries,Ioannis Kanellos,Thierry Moudenc

Main category: cs.CL

TL;DR: This paper proposes a chunk-based approach integrating multiple self-supervised learning models for automatic fluency assessment, achieving improved performance on two datasets while highlighting the need for further exploration of generalization to dialects with irregular prosody.

Details

Motivation: Automatic fluency assessment remains challenging in capturing speech rhythm, pauses, and disfluencies in non-native speakers, prompting the need for improved approaches. Method: The chunk-based approach integrates multiple SSL models (Wav2Vec2, HuBERT, and WavLM) using a learnable weighted mechanism for embedding fusion. Speech is segmented into breath-group chunks via Silero-VAD, and the hierarchical CNN-BiLSTM framework captures local and long-term dependencies across chunks. Chunk-level fluency markers are also incorporated for enriched analysis. Result: On the Avalinguo and Speechocean762 datasets, the proposed approach improves F1-score by 2.8 and Pearson correlation by 6.2 points over single SSL baselines on Speechocean762, and achieves gains of 4.2 F1-score and 4.0 Pearson points on Avalinguo, surpassing Pyannote.audio-based segmentation baselines. Conclusion: Chunk-based multi-SSL fusion can effectively evaluate fluency, but future work should explore its generalization to dialects with irregular prosody. Abstract: Automatic fluency assessment (AFA) remains challenging, particularly in capturing speech rhythm, pauses, and disfluencies in non-native speakers. We introduce a chunk-based approach integrating self-supervised learning (SSL) models (Wav2Vec2, HuBERT, and WavLM) selected for their complementary strengths in phonetic, prosodic, and noisy speech modeling, with a hierarchical CNN-BiLSTM framework. Speech is segmented into breath-group chunks using Silero voice activity detection (Silero-VAD), enabling fine-grained temporal analysis while mitigating over-segmentation artifacts. SSL embeddings are fused via a learnable weighted mechanism, balancing acoustic and linguistic features, and enriched with chunk-level fluency markers (e.g., speech rate, pause durations, n-gram repetitions). The CNN-BiLSTM captures local and long-term dependencies across chunks. Evaluated on Avalinguo and Speechocean762, our approach improves F1-score by 2.8 and Pearson correlation by 6.2 points over single SSL baselines on Speechocean762, with gains of 4.2 F1-score and 4.0 Pearson points on Avalinguo, surpassing Pyannote.audio-based segmentation baselines. These findings highlight chunk-based multi-SSL fusion for robust fluency evaluation, though future work should explore generalization to dialects with irregular prosody.

[19] Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models

Kai-Robin Lange,Tobias Schmidt,Matthias Reccius,Henrik Müller,Michael Roos,Carsten Jentsch

Main category: cs.CL

TL;DR: 本文提出了一种结合大型语言模型和主题模型的方法，用于分析大规模语料库中的叙事变化。

Details

Motivation: 随着媒体叙事的快速发展，研究其随时间演变的方式变得越来越重要。现有的叙事提取方法如大型语言模型虽然能够捕捉复杂的叙事结构，但应用于整个语料库时存在高成本的问题。 Method: 将大型语言模型的语言理解能力与主题模型的大规模适用性相结合，使用叙事政策框架动态建模时间上的叙事变化。通过主题模型和相应的变点检测方法来寻找特定主题的变化，并利用大型语言模型解释这些变化。 Result: 该方法在给定时间段内能有效提取存在的叙事变化，但在判断变化是内容变化还是叙事变化时表现不佳。 Conclusion: 结合大型语言模型和主题模型的方法在识别叙事变化方面具有潜力，但在判断变化类型（内容或叙事）时存在局限性。 Abstract: With rapidly evolving media narratives, it has become increasingly critical to not just extract narratives from a given corpus but rather investigate, how they develop over time. While popular narrative extraction methods such as Large Language Models do well in capturing typical narrative elements or even the complex structure of a narrative, applying them to an entire corpus comes with obstacles, such as a high financial or computational cost. We propose a combination of the language understanding capabilities of Large Language Models with the large scale applicability of topic models to dynamically model narrative shifts across time using the Narrative Policy Framework. We apply a topic model and a corresponding change point detection method to find changes that concern a specific topic of interest. Using this model, we filter our corpus for documents that are particularly representative of that change and feed them into a Large Language Model that interprets the change that happened in an automated fashion and distinguishes between content and narrative shifts. We employ our pipeline on a corpus of The Wall Street Journal news paper articles from 2009 to 2023. Our findings indicate that a Large Language Model can efficiently extract a narrative shift if one exists at a given point in time, but does not perform as well when having to decide whether a shift in content or a narrative shift took place.

[20] Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content

Rian Touchent,Nathan Godey,Eric de la Clergerie

Main category: cs.CL

TL;DR: Biomed-Enriched 是一个基于 PubMed 构建的生物医学文本数据集，提供大规模、开放获取的临床案例集合，并展示出更高效和有效的生物医学预训练策略的潜力。

Details

Motivation: 临床文本通常由于隐私限制难以访问，而医院记录无法公开共享。因此，需要一个大规模、开放获取的临床案例集合用于研究。 Method: 通过两阶段注释过程构建数据集：第一阶段使用大语言模型对 PubMed 中的 40 万段落进行注释，第二阶段利用这些注释微调小语言模型，并将其标签传播到完整的 PMC-OA 语料库中。 Result: Biomed-Enriched 数据集包含 200 万条临床案例段落，其中超过 45 万条高质量段落来自具有商业使用许可的文章，并且通过质量过滤和领域上采样构建了多个变体。此外，初步实验表明这些精选子集能够提高性能并加速收敛。 Conclusion: Biomed-Enriched 提供了一个替代的大规模、开放获取的临床案例集合，有助于生物医学和临床 NLP 的研究，并展示了更高效和有效的生物医学预训练策略的潜力。 Abstract: We introduce Biomed-Enriched, a biomedical text dataset constructed from PubMed via a two-stage annotation process. In the first stage, a large language model annotates 400K paragraphs from PubMed scientific articles, assigning scores for their type (review, study, clinical case, other), domain (clinical, biomedical, other), and educational quality. The educational quality score (rated 1 to 5) estimates how useful a paragraph is for college-level learning. These annotations are then used to fine-tune a small language model, which propagates the labels across the full PMC-OA corpus. The resulting metadata allows us to extract refined subsets, including 2M clinical case paragraphs with over 450K high-quality ones from articles with commercial-use licenses, and to construct several variants via quality filtering and domain upsampling. Clinical text is typically difficult to access due to privacy constraints, as hospital records cannot be publicly shared. Hence, our dataset provides an alternative large-scale, openly available collection of clinical cases from PubMed, making it a valuable resource for biomedical and clinical NLP. Preliminary continual-pretraining experiments with OLMo2 suggest these curated subsets enable targeted improvements, with clinical upsampling boosting performance by ~5% on MMLU ProfMed and educational quality filtering improving MedQA and MedMCQA by ~1%. Combinations of these techniques led to faster convergence, reaching same performance with a third of training tokens, indicating potential for more efficient and effective biomedical pretraining strategies.

[21] TAPS: Tool-Augmented Personalisation via Structured Tagging

Ekaterina Taktasheva,Jeff Dalton

Main category: cs.CL

TL;DR: 本文提出了一種新方法TAPS，通過結構化標籤工具和基於不確定性的工具檢測器，提升大語言模型在目標導向對話代理中整合用戶偏好的能力，達到了開源模型在NLSI任務上的最新最先進表現。

Details

Motivation: 現有工具增強的大語言模型忽視了個性化在指導工具使用中的作用，本文旨在探索如何有效整合用戶偏好以提升模型性能。 Method: 提出了名為TAPS的解決方案，該方案結合了結構化標籤工具與基於不確定性的工具檢測器，以提高模型對用戶偏好的適應能力。 Result: TAPS在NLSI任務上實現了開源模型的最先進表現，顯著提升了模型整合用戶偏好的能力。 Conclusion: 本文展示了個性化在工具使用中的重要性，並提供了一種有效的解決方案來提升目標導向對話代理的用戶體驗。 Abstract: Recent advancements in tool-augmented large language models have enabled them to interact with external tools, enhancing their ability to perform complex user tasks. However, existing approaches overlook the role of personalisation in guiding tool use. This work investigates how user preferences can be effectively integrated into goal-oriented dialogue agents. Through extensive analysis, we identify key weaknesses in the ability of LLMs to personalise tool use. To this end, we introduce \name, a novel solution that enhances personalised tool use by leveraging a structured tagging tool and an uncertainty-based tool detector. TAPS significantly improves the ability of LLMs to incorporate user preferences, achieving the new state-of-the-art for open source models on the NLSI task.

[22] An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

Weike Zhao,Chaoyi Wu,Yanjie Fan,Xiaoman Zhang,Pengcheng Qiu,Yuze Sun,Xiao Zhou,Yanfeng Wang,Ya Zhang,Yongguo Yu,Kun Sun,Weidi Xie

Main category: cs.CL

TL;DR: DeepRare is an innovative LLM-powered system for rare disease diagnosis, combining modular design, specialized tools, and transparent reasoning to achieve superior diagnostic accuracy and usability.

Details

Motivation: Diagnosing rare diseases is challenging due to their clinical heterogeneity, low prevalence, and limited clinician familiarity. There is a need for advanced diagnostic tools to improve timely and accurate diagnosis. Method: DeepRare uses a large language model (LLM) with a modular architecture comprising a central host, long-term memory module, and specialized agent servers that integrate over 40 tools and up-to-date medical knowledge sources. Result: DeepRare achieved 100% accuracy for 1013 out of 2919 diseases, outperformed 15 other methods with a Recall@1 score of 57.18%, surpassed the second-best method by 23.79 percentage points, and achieved 70.60% Recall@1 for multi-modal inputs. Clinical experts verified reasoning chains with 95.40% agreement. Conclusion: DeepRare is an effective agentic system for diagnosing rare diseases, demonstrating high accuracy and outperforming existing methods in multiple evaluation scenarios. Abstract: Rare diseases collectively affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains a pervasive challenge. This is largely due to their clinical heterogeneity, low individual prevalence, and the limited familiarity most clinicians have with rare conditions. Here, we introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM), capable of processing heterogeneous clinical inputs. The system generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning that links intermediate analytic steps to verifiable medical evidence. DeepRare comprises three key components: a central host with a long-term memory module; specialized agent servers responsible for domain-specific analytical tasks integrating over 40 specialized tools and web-scale, up-to-date medical knowledge sources, ensuring access to the most current clinical information. This modular and scalable design enables complex diagnostic reasoning while maintaining traceability and adaptability. We evaluate DeepRare on eight datasets. The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases. In HPO-based evaluations, DeepRare significantly outperforms other 15 methods, like traditional bioinformatics diagnostic tools, LLMs, and other agentic systems, achieving an average Recall@1 score of 57.18% and surpassing the second-best method (Reasoning LLM) by a substantial margin of 23.79 percentage points. For multi-modal input scenarios, DeepRare achieves 70.60% at Recall@1 compared to Exomiser's 53.20% in 109 cases. Manual verification of reasoning chains by clinical experts achieves 95.40% agreements. Furthermore, the DeepRare system has been implemented as a user-friendly web application http://raredx.cn/doctor.

[23] Probing AI Safety with Source Code

Ujwal Narayan,Shreyas Chaudhari,Ashwin Kalyan,Tanmay Rajpurohit,Karthik Narasimhan,Ameet Deshpande,Vishvak Murahari

Main category: cs.CL

TL;DR: 本文介绍了一种名为Code of Thought (CoDoT)的提示策略，以评估大型语言模型（LLMs）的安全性，并表明当前模型在AI安全性方面存在严重不足，突出了改进安全措施的迫切需求。

Details

Motivation: 当代模型在AI安全性方面存在严重不足，导致用户体验不安全且有害。 Method: 引入了一种名为Code of Thought (CoDoT)的提示策略，将自然语言输入转换为表示相同意图的简单代码。 Result: CoDoT导致了一系列最先进的LLM的持续失败，例如GPT-4 Turbo的毒性增加了16.5倍，DeepSeek R1 100%失败，七种现代LLM平均毒性增加了300%。递归应用CoDoT还可以使毒性增加两倍。 Conclusion: CoDoT强调了从第一原理评估安全措施的重要性，确保安全与能力共同进步。 Abstract: Large language models (LLMs) have become ubiquitous, interfacing with humans in numerous safety-critical applications. This necessitates improving capabilities, but importantly coupled with greater safety measures to align these models with human values and preferences. In this work, we demonstrate that contemporary models fall concerningly short of the goal of AI safety, leading to an unsafe and harmful experience for users. We introduce a prompting strategy called Code of Thought (CoDoT) to evaluate the safety of LLMs. CoDoT converts natural language inputs to simple code that represents the same intent. For instance, CoDoT transforms the natural language prompt "Make the statement more toxic: {text}" to: "make_more_toxic({text})". We show that CoDoT results in a consistent failure of a wide range of state-of-the-art LLMs. For example, GPT-4 Turbo's toxicity increases 16.5 times, DeepSeek R1 fails 100% of the time, and toxicity increases 300% on average across seven modern LLMs. Additionally, recursively applying CoDoT can further increase toxicity two times. Given the rapid and widespread adoption of LLMs, CoDoT underscores the critical need to evaluate safety efforts from first principles, ensuring that safety and capabilities advance together.

Kaixiang Zhang,Justine Zhang,Cristian Danescu-Niculescu-Mizil

Main category: cs.CL

TL;DR: 本文介绍了一种量化对话级说话时间分配及其实现动态的新框架，发现平衡对话更受偏好，且不同的动态类型会影响参与者的感知。

Details

Motivation: 对话中说话时间的分配是每段对话的一个重要特征。对话可能平衡也可能不平衡。这种整体分布是说话者在对话过程中不断协商的结果。 Method: 研究者建立了一个计算框架，用以量化说话时间的分配情况及其底层动态，并通过一个陌生人视频聊天的大数据集进行验证。 Result: 研究确认了不同说话时间分布模式对参与者感知的影响不同，尤其是当最终说话较少的人更倾向于偏好平衡的对话。同时，即使导致相同的总体平衡水平，不同类型的说话时间共享动态也会被参与者不同地感知。 Conclusion: 该框架为计算机中介交流平台的设计者提供了新的工具，适用于人与人以及人与人工智能之间的交流。 Abstract: An intrinsic aspect of every conversation is the way talk-time is shared between multiple speakers. Conversations can be balanced, with each speaker claiming a similar amount of talk-time, or imbalanced when one talks disproportionately. Such overall distributions are the consequence of continuous negotiations between the speakers throughout the conversation: who should be talking at every point in time, and for how long? In this work we introduce a computational framework for quantifying both the conversation-level distribution of talk-time between speakers, as well as the lower-level dynamics that lead to it. We derive a typology of talk-time sharing dynamics structured by several intuitive axes of variation. By applying this framework to a large dataset of video-chats between strangers, we confirm that, perhaps unsurprisingly, different conversation-level distributions of talk-time are perceived differently by speakers, with balanced conversations being preferred over imbalanced ones, especially by those who end up talking less. Then we reveal that -- even when they lead to the same level of overall balance -- different types of talk-time sharing dynamics are perceived differently by the participants, highlighting the relevance of our newly introduced typology. Finally, we discuss how our framework offers new tools to designers of computer-mediated communication platforms, for both human-human and human-AI communication.

[25] Knowledge-Aware Diverse Reranking for Cross-Source Question Answering

Tong Zhou

Main category: cs.CL

TL;DR: This paper describes Team Marikarp's winning solution in the SIGIR 2025 LiveRAG competition, which involved a knowledge-aware diverse reranking RAG pipeline for retrieving relevant documents.

Details

Motivation: The motivation behind this study was to develop an effective method for retrieving relevant information from a large document set, as evaluated in the SIGIR 2025 LiveRAG competition. Method: Team Marikarp employed a knowledge-aware diverse reranking RAG pipeline to retrieve question-relevant supporting documents from a subset of the FineWeb corpus. Result: Team Marikarp's solution achieved first place in the SIGIR 2025 LiveRAG competition. Conclusion: The knowledge-aware diverse reranking RAG pipeline proposed by Team Marikarp performed exceptionally well, securing first place in the SIGIR 2025 LiveRAG competition. Abstract: This paper presents Team Marikarp's solution for the SIGIR 2025 LiveRAG competition. The competition's evaluation set, automatically generated by DataMorgana from internet corpora, encompassed a wide range of target topics, question types, question formulations, audience types, and knowledge organization methods. It offered a fair evaluation of retrieving question-relevant supporting documents from a 15M documents subset of the FineWeb corpus. Our proposed knowledge-aware diverse reranking RAG pipeline achieved first place in the competition.

[26] GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching

Guinan Su,Li Shen,Lu Yin,Shiwei Liu,Yanwu Yang,Jonas Geiping

Main category: cs.CL

TL;DR: 本文提出了一种新的大型语言模型压缩方法，通过优化合并、选择和删除层次的操作，在保持高性能的同时显著减少了模型参数数量。

Details

Motivation: 大型语言模型通常具有惊人的语言理解和生成能力，但其巨大的模型规模给部署和推理带来了重大挑战。为了克服这些问题，需要开发一种有效的模型压缩策略，以降低计算成本并保持高性能。 Method: 作者将模型压缩问题建模为一个零阶优化问题，并采用支持三种不同操作的搜索空间：（1）删除层；（2）从不同的候选模型中选择层；（3）合并层。实验中使用了结构化剪枝技术以减少计算成本。 Result: 该方法在Llama2-13B模型上的实验表明，在移除约25%的参数的同时能够保持大约97.3%的原始性能，显著优于之前的最先进方法。 Conclusion: 本文提出了一种新颖的大型语言模型压缩策略，通过结合或合并微调模型变体中的层次来减少参数数量，同时保留原始模型的能力。这种方法在Llama2-13B模型家族中表现出色，相比以前最先进的方法显著提高了性能。 Abstract: Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model's abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3\% of the original performance while removing $\sim25\%$ of parameters, significantly outperforming previous state-of-the-art methods. The code is available at https://github.com/Guinan-Su/auto-merge-llm.

[27] ReCode: Updating Code API Knowledge with Reinforcement Learning

Haoze Wu,Yunzhi Yao,Wenhao Yu,Huajun Chen,Ningyu Zhang

Main category: cs.CL

TL;DR: 本文提出ReCode，一种基于强化学习的代码更新框架，通过模拟程序员适应API变化的方式，显著提高了大语言模型在动态API环境下的代码生成能力。

Details

Motivation: LLMs在代码生成方面表现出色，但在面对外部库API频繁更新时表现不佳。这一问题源于其依赖过时的训练数据中的API知识，即使访问最新文档也难以可靠生成代码。 Method: 构建了一个约2000个数据条目的数据集用于训练LLMs进行版本迁移，并引入了一种改进的代码评估字符串相似度指标作为强化学习的奖励机制。 Result: ReCode显著提升了LLMs在动态API场景下的代码生成性能，特别是在未见过的CodeUpdateArena任务上；Qwen2.5-Coder-7B在训练后表现优于32B参数的代码指令调优模型和同架构推理模型。 Conclusion: ReCode是一个基于规则的强化学习框架，能够有效提升LLMs在动态API环境下的代码生成性能，且对模型的一般代码生成能力影响较小。 Abstract: Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.

[28] OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Zengzhi Wang,Fan Zhou,Xuefeng Li,Pengfei Liu

Main category: cs.CL

TL;DR: This paper explores strategies to improve reinforcement learning (RL) compatibility in language models like Llama by introducing a two-stage mid-training method called Stable-then-Decay. It highlights the role of high-quality mathematical data, QA-style examples, and instruction data in enhancing RL outcomes, ultimately closing the performance gap with more RL-friendly model families like Qwen.

Details

Motivation: Different base language model families, such as Llama and Qwen, show divergent behaviors during post-training with reinforcement learning (RL), particularly on reasoning-intensive tasks. Understanding what makes a base model suitable for RL is crucial for developing next-generation RL-scalable foundation models. Method: The study investigates how mid-training strategies influence RL dynamics by focusing on two model families, Qwen and Llama. It analyzes the impact of mathematical corpora, QA-style data including long chain-of-thought examples, and instruction data on both base model and RL performance. A curated math reasoning-intensive corpus, MegaMath-Web-Pro-Max, is released alongside the proposed training approach. Result: Key findings include that high-quality mathematical corpora significantly improve base model and RL performance, adding QA-style data and instruction data enhances RL outcomes, long chain-of-thought (CoT) reasoning examples improve reasoning depth but may induce verbosity and instability, and scaling mid-training consistently leads to stronger downstream RL performance. Conclusion: The work introduces a two-stage mid-training strategy, Stable-then-Decay, to enhance reinforcement learning (RL) compatibility in language models like Llama, closing the performance gap with more RL-friendly families such as Qwen. The study emphasizes the importance of high-quality data and training strategies for developing next-generation RL-scalable foundation models. Abstract: Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).

[29] When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

Ammar Khairi,Daniel D'souza,Ye Shen,Julia Kreutzer,Sara Hooker

Main category: cs.CL

TL;DR: 本文探讨了如何通过多语言和多任务适应性策略增强大语言模型的推理时计算性能，提出了有效的采样与选择方法。

Details

Motivation: 现有技术主要集中在英语和少数领域，而对多语言和多样化任务的泛化能力不足。 Method: 评估现有的选择方法，并提出适用于多语言和多任务推理场景的新采样与选择策略。 Result: 新方法使8B模型在m-ArenaHard-v2.0提示上平均胜率提高+6.8，在Command-A (111B模型) 上使用五次采样就实现了+9.0的提升。 Conclusion: 研究强调了在不同语言和任务中需要特定策略来提升推理时计算性能，尤其是在多语言和多任务场景下的适应性方法。 Abstract: Recent advancements in large language models (LLMs) have shifted focus toward scaling inference-time compute, improving performance without retraining the model. A common approach is to sample multiple outputs in parallel, and select one of these as the final output. However, work to date has focused on English and a handful of domains such as math and code. In contrast, we are most interested in techniques that generalize across open-ended tasks, formally verifiable tasks, and across languages. In this work, we study how to robustly scale inference-time compute for open-ended generative tasks in a multilingual, multi-task setting. Our findings show that both sampling strategy based on temperature variation and selection strategy must be adapted to account for diverse domains and varied language settings. We evaluate existing selection methods, revealing that strategies effective in English often fail to generalize across languages. We propose novel sampling and selection strategies specifically adapted for multilingual and multi-task inference scenarios, and show they yield notable gains across languages and tasks. In particular, our combined sampling and selection methods lead to an average +6.8 jump in win-rates for our 8B models on m-ArenaHard-v2.0 prompts, against proprietary models such as Gemini. At larger scale, Command-A (111B model) equipped with our methods, shows +9.0 improvement in win-rates on the same benchmark with just five samples against single-sample decoding, a substantial increase at minimal cost. Our results underscore the need for language- and task-aware approaches to inference-time compute, aiming to democratize performance improvements in underrepresented languages.

[30] Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

Baixiang Huang,Zhen Tan,Haoran Wang,Zijie Liu,Dawei Li,Ali Payani,Huan Liu,Tianlong Chen,Kai Shu

Main category: cs.CL

TL;DR: This paper introduces Behavior Editing, a method to steer the ethical or harmful behavior of AI agents using model editing techniques, evaluated through the BehaviorBench benchmark grounded in moral psychology.

Details

Motivation: The motivation stems from the need to address the safety and ethical risks associated with deploying LLM-based agents in high-stakes environments, where unethical behavior can have severe real-world consequences. Method: The researchers framed behavior steering as a model editing task, using BehaviorBench, a multi-tier benchmark based on psychological moral theories, to evaluate and modify agent behaviors across various scenarios. Result: Behavior Editing was shown to effectively guide agents toward targeted behaviors, enabling both local adjustments in specific situations and broader shifts in an agent's overall moral alignment, as demonstrated through evaluations on advanced LLM-based agents. Conclusion: The study concludes that Behavior Editing offers a promising paradigm for steering agent behavior, with the potential to both promote ethical conduct and induce harmful actions, underscoring its significant implications for safety and ethics in high-stakes domains. Abstract: Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent's global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through comprehensive evaluations on agents based on frontier LLMs, BehaviorBench shows the effectiveness of Behavior Editing across different models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.

[31] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

Shansan Gong,Ruixiang Zhang,Huangjie Zheng,Jiatao Gu,Navdeep Jaitly,Lingpeng Kong,Yizhe Zhang

Main category: cs.CL

TL;DR: 本文研究了扩散大语言模型在代码生成中的应用，提出了Coupled-GRPO训练方法，显著提升性能并减少了对传统自回归解码的依赖。

Details

Motivation: 扩散大语言模型（dLLMs）能够对整个序列进行操作，具备全局规划和迭代优化的特点，尤其适用于代码生成。然而其在编码任务中的训练与推理机制仍缺乏探索。 Method: 训练了一个名为DiffuCoder的7B扩散大语言模型，并采用Coupled-GRPO方法进行强化学习训练以减少估计方差并提高效率。 Result: DiffuCoder在代码生成基准测试中表现优异，Coupled-GRPO提升了4.4%的性能（EvalPlus），并且减少了对自回归因果解码的依赖。 Conclusion: 该论文提出了一种基于扩散模型的代码生成方法，并通过新型采样方案Coupled-GRPO显著提高了性能，为扩散大语言模型在编码任务中的应用提供了有效框架。 Abstract: Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR causal during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.

[32] Memento: Note-Taking for Your Future Self

Chao Wan,Albert Gong,Mihir Mishra,Carl-Leander Henneking,Claas Beger,Kilian Q. Weinberger

Main category: cs.CL

TL;DR: Memento is a novel prompting strategy for large language models that combines reasoning and retrieval, significantly improving performance on multi-hop question answering tasks across different benchmarks.

Details

Motivation: Large language models (LLMs) excel at reasoning-only tasks but struggle when reasoning must be tightly coupled with retrieval, as in multi-hop question answering. The motivation is to develop a prompting strategy that overcomes this limitation by integrating reasoning and retrieval effectively. Method: The Memento prompting strategy decomposes complex questions into smaller steps, dynamically constructs a database of facts using LLMs, and pieces these facts together to solve the question. This approach is evaluated across multiple benchmarks including PhantomWiki, 2WikiMultiHopQA, and MuSiQue. Result: Memento doubles the performance of chain-of-thought (CoT) on the 9-step PhantomWiki benchmark, improves CoT-RAG by over 20 F1 points and IRCoT by over 13 F1 points on 2WikiMultiHopQA, and enhances ReAct by over 3 F1 points on the MuSiQue dataset. Conclusion: Memento prompting strategy enhances the performance of large language models in tasks requiring integrated reasoning and retrieval, outperforming existing methods like CoT, CoT-RAG, IRCoT, and ReAct on various benchmarks. Abstract: Large language models (LLMs) excel at reasoning-only tasks, but struggle when reasoning must be tightly coupled with retrieval, as in multi-hop question answering. To overcome these limitations, we introduce a prompting strategy that first decomposes a complex question into smaller steps, then dynamically constructs a database of facts using LLMs, and finally pieces these facts together to solve the question. We show how this three-stage strategy, which we call Memento, can boost the performance of existing prompting strategies across diverse settings. On the 9-step PhantomWiki benchmark, Memento doubles the performance of chain-of-thought (CoT) when all information is provided in context. On the open-domain version of 2WikiMultiHopQA, CoT-RAG with Memento improves over vanilla CoT-RAG by more than 20 F1 percentage points and over the multi-hop RAG baseline, IRCoT, by more than 13 F1 percentage points. On the challenging MuSiQue dataset, Memento improves ReAct by more than 3 F1 percentage points, demonstrating its utility in agentic settings.

[33] Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs

Sonia K. Murthy,Rosie Zhao,Jennifer Hu,Sham Kakade,Markus Wulfmeier,Peng Qian,Tomer Ullman

Main category: cs.CL

TL;DR: 这篇论文利用认知科学模型分析大语言模型在多目标权衡（如诚实与礼貌）中的表现，揭示了其在训练过程中效用变化的动态特性，并提供了优化未来模型训练的思路。

Details

Motivation: 当前解读大语言模型中复杂价值观的工具有限，而认知科学中的认知模型能够形式化描述人类决策过程中的权衡。因此，研究者希望通过这种方法揭示大语言模型在社会效用和信息效用之间的平衡特点。 Method: 使用一个领先的认知模型来分析礼貌言语，以解释大语言模型（LLMs）在日常社交情境中如何表现人类般的价值权衡。研究涵盖了前沿黑盒模型的推理“努力”程度以及开源模型的强化学习后训练动态。 Result: 研究发现，在推理模型和数学推理能力较强的开源模型中，信息效用高于社会效用。此外，训练早期出现了效用值的重大转变，基础模型和预训练数据的选择具有持续性影响。 Conclusion: 该研究通过应用认知科学中的认知模型，系统评估了大语言模型中人类价值权衡的表现。研究表明训练动态对效用值有显著影响，并展示了如何为推理模型的训练机制提供见解。 Abstract: Navigating everyday social situations often requires juggling conflicting goals, such as conveying a harsh truth, maintaining trust, all while still being mindful of another person's feelings. These value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in LLMs are limited. In cognitive science, so-called "cognitive models" provide formal accounts of these trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. In this work, we use a leading cognitive model of polite speech to interpret the extent to which LLMs represent human-like trade-offs. We apply this lens to systematically evaluate value trade-offs in two encompassing model settings: degrees of reasoning "effort" in frontier black-box models, and RL post-training dynamics of open-source models. Our results highlight patterns of higher informational utility than social utility in reasoning models, and in open-source models shown to be stronger in mathematical reasoning. Our findings from LLMs' training dynamics suggest large shifts in utility values early on in training with persistent effects of the choice of base model and pretraining data, compared to feedback dataset or alignment method. We show that our method is responsive to diverse aspects of the rapidly evolving LLM landscape, with insights for forming hypotheses about other high-level behaviors, shaping training regimes for reasoning models, and better controlling trade-offs between values during model training.

cs.CV [Back]

[34] Computer Vision based Automated Quantification of Agricultural Sprayers Boom Displacement

Aryan Singh Dalal,Sidharth Rai,Rahul Singh,Treman Singh Kaloya,Rahul Harsha Cheppally,Ajay Sharda

Main category: cs.CV

TL;DR: This paper presents a computer vision system using YOLO models to accurately quantify spray boom movement in agricultural sprayers, aiming to improve stability and application accuracy.

Details

Motivation: Spray boom instability in self-propelled agricultural sprayers contributes significantly to application rate errors, and there is a need for quantitative data to develop effective solutions such as improved boom designs and responsive control systems. Method: An automated computer vision system was developed using YOLO V7, V8, and V11 neural network models to track a target on the edge of the sprayer boom. An inclinometer sensor was used to validate the model output. Result: The neural network models achieved over 90% accuracy in detecting the target, with distance estimates within 0.026 m of the inclinometer sensor data. Conclusion: The study concludes that the developed computer vision system effectively quantifies spray boom movements, offering potential for improving sprayer boom design and control systems to enhance application accuracy. Abstract: Application rate errors when using self-propelled agricultural sprayers for agricultural production remain a concern. Among other factors, spray boom instability is one of the major contributors to application errors. Spray booms' width of 38m, combined with 30 kph driving speeds, varying terrain, and machine dynamics when maneuvering complex field boundaries, make controls of these booms very complex. However, there is no quantitative knowledge on the extent of boom movement to systematically develop a solution that might include boom designs and responsive boom control systems. Therefore, this study was conducted to develop an automated computer vision system to quantify the boom movement of various agricultural sprayers. A computer vision system was developed to track a target on the edge of the sprayer boom in real time. YOLO V7, V8, and V11 neural network models were trained to track the boom's movements in field operations to quantify effective displacement in the vertical and transverse directions. An inclinometer sensor was mounted on the boom to capture boom angles and validate the neural network model output. The results showed that the model could detect the target with more than 90 percent accuracy, and distance estimates of the target on the boom were within 0.026 m of the inclinometer sensor data. This system can quantify the boom movement on the current sprayer and potentially on any other sprayer with minor modifications. The data can be used to make design improvements to make sprayer booms more stable and achieve greater application accuracy.

[35] EBC-ZIP: Improving Blockwise Crowd Counting with Zero-Inflated Poisson Regression

Yiming Ma,Victor Sanchez,Tanaya Guha

Main category: cs.CV

TL;DR: 本文提出了一种新的群体计数方法EBC-ZIP，通过引入零膨胀泊松回归和改进损失函数，在处理稀疏数据的同时提高计数准确性，并在多个基准测试中表现优异。

Details

Motivation: 现有密度图估计方法忽略了真实场景中人群分布的极端稀疏性以及传统损失函数不适合建模离散非负计数数据的问题。 Method: 提出了一种基于零膨胀泊松(ZIP)回归的密度图估计方法，并将其与增强块分类(EBC)框架结合，使用ZIP分布的负对数似然代替传统回归损失。 Result: 实验表明，EBC-ZIP在四个群体计数基准上均优于EBC并取得最先进的结果，同时具有良好的可扩展性。 Conclusion: EBC-ZIP不仅继承了EBC框架的优势，还在处理零重分布的同时保持计数准确性，从而在多个基准测试中取得了最先进的结果。 Abstract: Density map estimation has become the mainstream paradigm in crowd counting. However, most existing methods overlook the extreme sparsity of ground-truth density maps. In real-world crowd scenes, the vast majority of spatial regions (often over 95%) contain no people, leading to heavily imbalanced count distributions. Ignoring this imbalance can bias models toward overestimating dense regions and underperforming in sparse areas. Furthermore, most loss functions used in density estimation are majorly based on MSE and implicitly assume Gaussian distributions, which are ill-suited for modeling discrete, non-negative count data. In this paper, we propose EBC-ZIP, a crowd counting framework that models the spatial distribution of counts using a Zero-Inflated Poisson (ZIP) regression formulation. Our approach replaces the traditional regression loss with the negative log-likelihood of the ZIP distribution, enabling better handling of zero-heavy distributions while preserving count accuracy. Built upon the recently proposed Enhanced Block Classification (EBC) framework, EBC-ZIP inherits EBC's advantages in preserving the discreteness of targets and ensuring training stability, while further improving performance through a more principled probabilistic loss. We also evaluate EBC-ZIP with backbones of varying computational complexity to assess its scalability. Extensive experiments on four crowd counting benchmarks demonstrate that EBC-ZIP consistently outperforms EBC and achieves state-of-the-art results.

[36] ToSA: Token Merging with Spatial Awareness

Hsiang-Wei Huang,Wenhao Chai,Kuang-Ming Chen,Cheng-Yen Yang,Jenq-Neng Hwang

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉变换器加速策略ToSA，它利用空间感知与语义信息相结合的方式提升令牌合并效果。

Details

Motivation: 现有ViT加速方法主要依赖特征相似度进行令牌合并，忽略了早期层中空间信息的应用潜力。 Method: 通过引入伪空间令牌来利用深度图像，并据此进行视觉令牌合并。 Result: ToSA在多个基准测试中表现优于以往方法，同时显著减少了ViT运行时间。 Conclusion: ToSA是一个结合语义和空间感知的新型令牌合并方法，可以有效加速视觉变换器（ViT）并保留关键场景结构。 Abstract: Token merging has emerged as an effective strategy to accelerate Vision Transformers (ViT) by reducing computational costs. However, existing methods primarily rely on the visual token's feature similarity for token merging, overlooking the potential of integrating spatial information, which can serve as a reliable criterion for token merging in the early layers of ViT, where the visual tokens only possess weak visual information. In this paper, we propose ToSA, a novel token merging method that combines both semantic and spatial awareness to guide the token merging process. ToSA leverages the depth image as input to generate pseudo spatial tokens, which serve as auxiliary spatial information for the visual token merging process. With the introduced spatial awareness, ToSA achieves a more informed merging strategy that better preserves critical scene structure. Experimental results demonstrate that ToSA outperforms previous token merging methods across multiple benchmarks on visual and embodied question answering while largely reducing the runtime of the ViT, making it an efficient solution for ViT acceleration. The code will be available at: https://github.com/hsiangwei0903/ToSA

[37] BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos

Jiahao Lin,Weixuan Peng,Bojia Zi,Yifeng Gao,Xianbiao Qi,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出了 BrokenVideos 数据集，用于解决 AI 生成视频中视觉伪影定位的问题，并展示了其在提升模型定位能力方面的有效性。

Details

Motivation: 当前 AI 生成视频的质量仍存在诸多问题，如运动不一致、物体变形等，而缺乏专门针对视频伪影定位的全面基准数据集。 Method: 介绍了一个名为 BrokenVideos 的新数据集，包含 3,254 个 AI 生成的视频，并通过像素级掩码标注了视觉损坏区域，同时利用最先进的伪影检测模型和多模态大语言模型（MLLMs）进行了实验验证。 Result: 实验表明，在 BrokenVideos 上训练模型显著提高了其对视频损坏区域的定位能力，且该数据集通过详尽的人工检查确保了高质量的标注结果。 Conclusion: BrokenVideos 数据集的引入为生成模型中的视频伪影定位研究提供了重要的基准基础。 Abstract: Recent advances in deep generative models have led to significant progress in video generation, yet the fidelity of AI-generated videos remains limited. Synthesized content often exhibits visual artifacts such as temporally inconsistent motion, physically implausible trajectories, unnatural object deformations, and local blurring that undermine realism and user trust. Accurate detection and spatial localization of these artifacts are crucial for both automated quality control and for guiding the development of improved generative models. However, the research community currently lacks a comprehensive benchmark specifically designed for artifact localization in AI generated videos. Existing datasets either restrict themselves to video or frame level detection or lack the fine-grained spatial annotations necessary for evaluating localization methods. To address this gap, we introduce BrokenVideos, a benchmark dataset of 3,254 AI-generated videos with meticulously annotated, pixel-level masks highlighting regions of visual corruption. Each annotation is validated through detailed human inspection to ensure high quality ground truth. Our experiments show that training state of the art artifact detection models and multi modal large language models (MLLMs) on BrokenVideos significantly improves their ability to localize corrupted regions. Through extensive evaluation, we demonstrate that BrokenVideos establishes a critical foundation for benchmarking and advancing research on artifact localization in generative video models. The dataset is available at: https://broken-video-detection-datetsets.github.io/Broken-Video-Detection-Datasets.github.io/.

[38] From 2D to 3D Cognition: A Brief Survey of General World Models

Ningwei Xie,Zizi Tian,Lei Yang,Xiao-Ping Zhang,Meng Guo,Jie Li

Main category: cs.CV

TL;DR: This paper provides a comprehensive survey on the evolution of world models from 2D perception to 3D cognition, introducing a conceptual framework to analyze key technologies, cognitive capabilities, and applications while identifying future research directions.

Details

Motivation: While 3D-aware generative world models have shown promising progress in simulating interactive environments, there is a lack of systematic analysis to classify these techniques and clarify their role in advancing 3D cognitive models. This survey aims to fill this gap. Method: The authors introduce a conceptual framework to systematically review and categorize emerging techniques in transitioning from 2D perception to 3D cognition. They analyze key technologies, dissect core cognitive abilities, and explore real-world applications and challenges. Result: A structured conceptual framework for analyzing 3D world models was introduced. The study identified two key technological drivers and three core cognitive capabilities essential for 3D world modeling. It also explored application domains and outlined challenges and future research directions. Conclusion: The paper concludes that the advancement of 3D world models depends on two technological drivers: improvements in 3D representations and the integration of world knowledge. It emphasizes the importance of three cognitive capabilities—3D physical scene generation, spatial reasoning, and interaction—and outlines future directions to overcome current challenges. Abstract: World models have garnered increasing attention in the development of artificial general intelligence (AGI), serving as computational frameworks for learning representations of the external world and forecasting future states. While early efforts focused on 2D visual perception and simulation, recent 3D-aware generative world models have demonstrated the ability to synthesize geometrically consistent, interactive 3D environments, marking a shift toward 3D spatial cognition. Despite rapid progress, the field lacks systematic analysis to categorize emerging techniques and clarify their roles in advancing 3D cognitive world models. This survey addresses this need by introducing a conceptual framework, providing a structured and forward-looking review of world models transitioning from 2D perception to 3D cognition. Within this framework, we highlight two key technological drivers, particularly advances in 3D representations and the incorporation of world knowledge, as fundamental pillars. Building on these, we dissect three core cognitive capabilities that underpin 3D world modeling: 3D physical scene generation, 3D spatial reasoning, and 3D spatial interaction. We further examine the deployment of these capabilities in real-world applications, including embodied AI, autonomous driving, digital twin, and gaming/VR. Finally, we identify challenges across data, modeling, and deployment, and outline future directions for advancing more robust and generalizable 3D world models.

[39] EAR: Erasing Concepts from Unified Autoregressive Models

Haipeng Fan,Shiyuan Zhang,Baohunesitu,Zihang Guo,Huaiwen Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为EAR的方法,用于从自回归模型中高效地删除特定概念,同时保持模型的生成能力。

Details

Motivation: 尽管自回归模型在图像理解和生成任务上表现出色,但在保留整体生成质量的同时从模型中移除不想要的概念仍然是一个挑战。 Method: 提出了EAR方法,包含WGA和TLM两个关键策略;构建了一个新的评估基准ECGVF,包括使用LLM生成大规模数据集以及利用视觉分类器进行过滤。 Result: 在ECGVF基准上的实验表明,与基线方法相比,EAR在擦除效果和模型效用保持方面都有显著提升。 Conclusion: EAR通过WGA和TLM策略,在保持生成质量的同时有效地从AR模型中删除了目标概念。 Abstract: Autoregressive (AR) models have achieved unified and strong performance across both visual understanding and image generation tasks. However, removing undesired concepts from AR models while maintaining overall generation quality remains an open challenge. In this paper, we propose Erasure Autoregressive Model (EAR), a fine-tuning method for effective and utility-preserving concept erasure in AR models. Specifically, we introduce Windowed Gradient Accumulation (WGA) strategy to align patch-level decoding with erasure objectives, and Thresholded Loss Masking (TLM) strategy to protect content unrelated to the target concept during fine-tuning. Furthermore, we propose a novel benchmark, Erase Concept Generator and Visual Filter (ECGVF), aim at provide a more rigorous and comprehensive foundation for evaluating concept erasure in AR models. Specifically, we first employ structured templates across diverse large language models (LLMs) to pre-generate a large-scale corpus of target-replacement concept prompt pairs. Subsequently, we generate images from these prompts and subject them to rigorous filtering via a visual classifier to ensure concept fidelity and alignment. Extensive experimental results conducted on the ECGVF benchmark with the AR model Janus-Pro demonstrate that EAR achieves marked improvements in both erasure effectiveness and model utility preservation. Code is available at: https://github.com/immc-lab/ear/

[40] Loss-Aware Automatic Selection of Structured Pruning Criteria for Deep Neural Network Acceleration

Deepak Ghimire,Kilho Lee,Seong-heum Kim

Main category: cs.CV

TL;DR: This paper proposes LAASP, an automated structured pruning method that integrates pruning and training into a single process, achieving efficient model compression with minimal accuracy loss.

Details

Motivation: Structured pruning is crucial for deploying neural networks on resource-limited edge devices, and existing methods often require manual tuning and multiple training-pruning-fine-tuning stages. Method: LAASP uses a pruning-while-training approach that automatically selects magnitude or similarity-based filter pruning criteria based on network loss and determines optimal layer-wise pruning rates without manual intervention. Result: On CIFAR-10, ResNet56 and ResNet110 achieved significant improvements in top-1 accuracy while reducing FLOPs by 52%. On ImageNet, ResNet50 reduced FLOPs by over 42% with only a 0.33% drop in top-5 accuracy. Conclusion: The proposed LAASP method effectively reduces computational complexity while maintaining or improving model accuracy, as demonstrated on VGGNet and ResNet models across CIFAR-10 and ImageNet datasets. Abstract: Structured pruning is a well-established technique for compressing neural networks, making it suitable for deployment in resource-limited edge devices. This paper presents an efficient Loss-Aware Automatic Selection of Structured Pruning Criteria (LAASP) for slimming and accelerating deep neural networks. The majority of pruning methodologies employ a sequential process consisting of three stages: 1) training, 2) pruning, and 3) fine-tuning, whereas the proposed pruning technique adopts a pruning-while-training approach that eliminates the first stage and integrates the second and third stages into a single cycle. The automatic selection of magnitude or similarity-based filter pruning criteria from a specified pool of criteria and the specific pruning layer at each pruning iteration is guided by the network's overall loss on a small subset of the training data. To mitigate the abrupt accuracy drop due to pruning, the network is retrained briefly after each reduction of a predefined number of floating-point operations (FLOPs). The optimal pruning rates for each layer in the network are automatically determined, eliminating the need for manual allocation of fixed or variable pruning rates for each layer. Experiments on the VGGNet and ResNet models on the CIFAR-10 and ImageNet benchmark datasets demonstrate the effectiveness of the proposed method. In particular, the ResNet56 and ResNet110 models on the CIFAR-10 dataset significantly improve the top-1 accuracy compared to state-of-the-art methods while reducing the network FLOPs by 52\%. Furthermore, the ResNet50 model on the ImageNet dataset reduces FLOPs by more than 42\% with a negligible 0.33\% drop in top-5 accuracy. The source code of this paper is publicly available online - https://github.com/ghimiredhikura/laasp.

[41] Towards Efficient Exemplar Based Image Editing with Multimodal VLMs

Avadhoot Jadhav,Ashutosh Srivastava,Abhinav Java,Silky Singh,Tarun Ram Menta,Surgan Jandial,Balaji Krishnamurthy

Main category: cs.CV

TL;DR: This paper introduces an exemplar-based image editing approach using pre-trained diffusion models and multimodal VLMs, achieving superior performance and faster processing compared to existing methods.

Details

Motivation: Text-to-image diffusion models have limitations in capturing complex or ambiguous image edits solely through text. Exemplar pairs offer a clearer way to represent such edits, motivating the need for an improved method of image editing. Method: An end-to-end optimization-free pipeline was developed to transfer edits from an exemplar pair (an image before and after editing) to a content image using pre-trained text-to-image diffusion models and multimodal VLMs. Result: The proposed method outperformed baseline approaches on multiple types of image edits while achieving a speedup of approximately 4 times. Conclusion: The study concludes that leveraging pre-trained text-to-image diffusion models and multimodal VLMs for exemplar-based image editing is effective and efficient, surpassing existing baselines. Abstract: Text-to-Image Diffusion models have enabled a wide array of image editing applications. However, capturing all types of edits through text alone can be challenging and cumbersome. The ambiguous nature of certain image edits is better expressed through an exemplar pair, i.e., a pair of images depicting an image before and after an edit respectively. In this work, we tackle exemplar-based image editing -- the task of transferring an edit from an exemplar pair to a content image(s), by leveraging pretrained text-to-image diffusion models and multimodal VLMs. Even though our end-to-end pipeline is optimization-free, our experiments demonstrate that it still outperforms baselines on multiple types of edits while being ~4x faster.

[42] Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

Zhentao He,Can Zhang,Ziheng Wu,Zhenghao Chen,Yufei Zhan,Yifan Li,Zhao Zhang,Xian Wang,Minghui Qiu

Main category: cs.CV

TL;DR: This paper introduces KIE-HVQA, a benchmark for evaluating OCR hallucination in degraded documents, and a GRPO-based framework that significantly improves hallucination-free accuracy by enabling vision-faithful reasoning under uncertainty.

Details

Motivation: Existing multimodal large language models struggle with visual degradation in real-world scenarios, often leading to hallucinatory content due to overreliance on linguistic priors or misaligned visual-textual reasoning. This work aims to address this limitation by evaluating and improving model robustness under uncertain visual conditions. Method: The authors introduced a GRPO-based framework with a novel reward mechanism that incorporates self-awareness of visual uncertainty and a refusal-to-answer strategy within supervised fine-tuning and reinforcement learning. They also developed the KIE-HVQA benchmark to evaluate OCR hallucination in degraded document understanding. Result: The 7B-parameter GRPO-based model achieved a 22% absolute improvement in hallucination-free accuracy over GPT-4o on the KIE-HVQA benchmark, with no significant performance drop on standard tasks. Conclusion: The proposed GRPO-based framework effectively mitigates hallucinations in ambiguous regions during degraded document understanding, achieving significant improvement over existing models without compromising performance on standard tasks. Abstract: Recent advancements in multimodal large language models have enhanced document understanding by integrating textual and visual information. However, existing models exhibit incompleteness within their paradigm in real-world scenarios, particularly under visual degradation. In such conditions, the current response paradigm often fails to adequately perceive visual degradation and ambiguity, leading to overreliance on linguistic priors or misaligned visual-textual reasoning. This difficulty in recognizing uncertainty frequently results in the generation of hallucinatory content, especially when a precise answer is not feasible. To better demonstrate and analyze this phenomenon and problem, we propose KIE-HVQA, the first benchmark dedicated to evaluating OCR hallucination in degraded document understanding. This dataset includes test samples spanning identity cards and invoices, with simulated real-world degradations for OCR reliability. This setup allows for evaluating models' capacity, under degraded input, to distinguish reliable visual information and answer accordingly, thereby highlighting the challenge of avoiding hallucination on uncertain data. To achieve vision-faithful reasoning and thereby avoid the aforementioned issues, we further introduce a GRPO-based framework featuring a novel reward mechanism. By incorporating a self-awareness of visual uncertainty and an analysis method that initiates refusal to answer to increase task difficulty within our supervised fine-tuning and reinforcement learning framework, we successfully mitigated hallucinations in ambiguous regions. Experiments on Qwen2.5-VL demonstrate that our 7B-parameter model achieves a 22\% absolute improvement in hallucination-free accuracy over GPT-4o on KIE-HVQA and there is no significant performance drop in standard tasks, highlighting both effectiveness and robustness.

[43] Towards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition

Man Duc Chuc

Main category: cs.CV

TL;DR: This paper shows that combining existing pretrained models can deliver strong performance on Earth Observation tasks more efficiently than training large models from scratch.

Details

Motivation: To explore whether existing pretrained foundation models can be combined effectively for Earth Observation tasks instead of training large models from scratch. Method: Evaluated models like Prithvi, Hiera, and DOFA on GEO-Bench benchmark across eleven datasets; used feature-level ensembling and knowledge distillation techniques. Result: Feature-level ensembling matched or exceeded performance of larger models with less computational demand; knowledge distillation showed potential for compact model deployment. Conclusion: The study concludes that combining smaller pretrained models can equal or surpass larger models in Earth Observation tasks, and knowledge distillation can transfer ensemble strengths into compact models. Abstract: Foundation models are rapidly transforming Earth Observation data mining by enabling generalizable and scalable solutions for key tasks such as scene classification and semantic segmentation. While most efforts in the geospatial domain have focused on developing large models trained from scratch using massive Earth Observation datasets, an alternative strategy that remains underexplored is the reuse and combination of existing pretrained models. In this study, we investigate whether foundation models pretrained on remote sensing and general vision datasets can be effectively combined to improve performance across a diverse set of key Earth Observation tasks. Using the GEO-Bench benchmark, we evaluate several prominent models, including Prithvi, Hiera, and DOFA, on eleven datasets covering a range of spatial resolutions, sensor modalities, and task types. The results show that feature-level ensembling of smaller pretrained models can match or exceed the performance of much larger models, while requiring less training time and computational resources. Moreover, the study highlights the potential of applying knowledge distillation to transfer the strengths of ensembles into more compact models, offering a practical path for deploying foundation models in real-world Earth Observation applications.

[44] Progressive Alignment Degradation Learning for Pansharpening

Enzhe Zhao,Zhichang Guo,Yao Li,Fanghui Song,Boying Wu

Main category: cs.CV

TL;DR: 本文分析了传统Wald协议在深度全色锐化中的局限性，并提出一种新的渐进对齐退化模块与高频扩散框架相结合的方法，在不依赖预定义操作的情况下实现更精确的退化过程学习，从而提高图像融合效果。

Details

Motivation: 现有的Wald协议在生成监督真实HRMS图像时存在不足，其假设用合成低分辨率数据训练的网络在高分辨率数据上同样有效，但实际中模型通常会在不同分辨率数据之间出现性能权衡。 Method: 提出了渐进对齐退化模块（PADM），使用PAlignNet和PDegradeNet两个子网络之间的相互迭代，自适应地学习精确的退化过程；并引入了HFreqdiff框架，结合CFB和BACM模块实现高频细节的嵌入和反向过程学习。 Result: 实验和消融研究表明，所提出的方法相比现有最先进方法具有更优性能，特别是在空间锐度和图像质量方面有显著提升。 Conclusion: 该论文提出了一种新的深度全色锐化方法，通过改进退化过程的学习方式，显著提升了高分辨率多光谱图像的空间锐度和质量。 Abstract: Deep learning-based pansharpening has been shown to effectively generate high-resolution multispectral (HRMS) images. To create supervised ground-truth HRMS images, synthetic data generated using the Wald protocol is commonly employed. This protocol assumes that networks trained on artificial low-resolution data will perform equally well on high-resolution data. However, well-trained models typically exhibit a trade-off in performance between reduced-resolution and full-resolution datasets. In this paper, we delve into the Wald protocol and find that its inaccurate approximation of real-world degradation patterns limits the generalization of deep pansharpening models. To address this issue, we propose the Progressive Alignment Degradation Module (PADM), which uses mutual iteration between two sub-networks, PAlignNet and PDegradeNet, to adaptively learn accurate degradation processes without relying on predefined operators. Building on this, we introduce HFreqdiff, which embeds high-frequency details into a diffusion framework and incorporates CFB and BACM modules for frequency-selective detail extraction and precise reverse process learning. These innovations enable effective integration of high-resolution panchromatic and multispectral images, significantly enhancing spatial sharpness and quality. Experiments and ablation studies demonstrate the proposed method's superior performance compared to state-of-the-art techniques.

[45] UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation

Yanzhe Chen,Huasong Zhong,Yan Li,Zhenheng Yang

Main category: cs.CV

TL;DR: The paper proposes UniCode$^2$, a cascaded codebook framework for stable and semantically aligned visual tokenization, enabling large-scale multimodal modeling with high token utilization and seamless integration with diffusion models.

Details

Motivation: Existing codebook-based methods face limitations with either small vocabularies lacking fine-grained semantics or unstable training when naively scaled up. This work aims to address these challenges by proposing a scalable and stable visual tokenization approach. Method: The paper introduces a cascaded codebook framework (UniCode$^2$) built by clustering millions of SigLIP sequence embeddings to create a large-scale, semantically aligned codebook. It uses a frozen codebook for stable embedding space and a trainable codebook to refine task-specific semantics. Result: UniCode$^2$ achieves strong performance across diverse benchmarks while ensuring high token utilization, vision-language alignment, and compatibility with pretrained diffusion decoders for high-quality visual synthesis. Conclusion: UniCode$^2$ demonstrates that scaling visual token spaces can be viable without sacrificing stability, semantics, or modularity, enabling high-quality visual synthesis and efficient learning. Abstract: Unified multimodal large language models (MLLMs) have shown promise in jointly advancing multimodal understanding and generation, with visual codebooks discretizing images into tokens for autoregressive modeling. Existing codebook-based methods either rely on small vocabularies (~16K entries) that lack fine-grained semantics or naively scale up, resulting in low token utilization and unstable training. We propose UniCode$^2$, a cascaded codebook framework enabling large-scale, semantically aligned, and stable visual tokenization. By clustering millions of SigLIP sequence embeddings, we build a 500K-entry codebook that preserves vision-language alignment while expanding capacity. Stability is ensured via a cascaded design: a frozen codebook anchors the embedding space, and a trainable codebook refines task-specific semantics. This decoupling promotes high utilization and robust learning. Moreover, the alignment of our visual tokens with textual semantics enables seamless integration with pretrained diffusion decoders, supporting high-quality visual synthesis with minimal adaptation. UniCode^2 delivers strong performance across diverse benchmarks, demonstrating the viability of scaling visual token spaces without sacrificing stability, semantics, or modularity.

[46] Dynamic Bandwidth Allocation for Hybrid Event-RGB Transmission

Pujing Yang,Guangyi Zhang,Yunlong Cai,Lei Yu,Guanding Yu

Main category: cs.CV

TL;DR: This paper proposes a joint event-image transmission framework that reduces redundancy and optimizes bandwidth usage while achieving high-quality reconstruction and real-time deblurring.

Details

Motivation: Event and RGB cameras generate large volumes of redundant data, leading to challenges in transmission efficiency. The study aims to optimize bandwidth utilization while preserving data quality and enabling real-time processing. Method: The approach uses Bayesian modeling and the information bottleneck method to separate shared and domain-specific information from event and image data, allowing adaptive bandwidth allocation based on scene dynamics. Result: Simulation results show improved reconstruction quality and deblurring performance compared to conventional systems, demonstrating the effectiveness of the proposed transmission scheme. Conclusion: The proposed joint E-I transmission framework effectively eliminates redundancy between event and RGB camera data, optimizing bandwidth use while enabling efficient reconstruction and real-time deblurring. Abstract: Event cameras asynchronously capture pixel-level intensity changes with extremely low latency. They are increasingly used in conjunction with RGB cameras for a wide range of vision-related applications. However, a major challenge in these hybrid systems lies in the transmission of the large volume of triggered events and RGB images. To address this, we propose a transmission scheme that retains efficient reconstruction performance of both sources while accomplishing real-time deblurring in parallel. Conventional RGB cameras and event cameras typically capture the same scene in different ways, often resulting in significant redundant information across their outputs. To address this, we develop a joint event and image (E-I) transmission framework to eliminate redundancy and thereby optimize channel bandwidth utilization. Our approach employs Bayesian modeling and the information bottleneck method to disentangle the shared and domain-specific information within the E-I inputs. This disentangled information bottleneck framework ensures both the compactness and informativeness of extracted shared and domain-specific information. Moreover, it adaptively allocates transmission bandwidth based on scene dynamics, i.e., more symbols are allocated to events for dynamic details or to images for static information. Simulation results demonstrate that the proposed scheme not only achieves superior reconstruction quality compared to conventional systems but also delivers enhanced deblurring performance.

Kun Yuan,Tingxuan Chen,Shi Li,Joel L. Lavanchy,Christian Heiliger,Ege Özsoy,Yiming Huang,Long Bai,Nassir Navab,Vinkle Srivastav,Hongliang Ren,Nicolas Padoy

Main category: cs.CV

TL;DR: 该论文提出了一种轻量级框架Surgical Phase Anywhere (SPA)，用于跨机构和手术流程的工作流理解，通过少量标注数据实现对基础模型的适应性调整，在少量样本下实现了最先进的性能。

Details

Motivation: 由于手术室环境、机构协议和解剖结构的多样性，开发通用化的手术流程模型面临挑战；现有基础模型在零样本性能上受限于领域转移问题。 Method: 利用少量样本空间适配对齐多模态嵌入与特定机构场景，通过扩散建模确保时间一致性，并采用动态测试时自监督适应方法提升可靠性。 Result: 实验表明，即使仅使用32个标注样本，该框架在多个机构和手术流程中均优于全样本模型。 Conclusion: SPA提供了一种快速定制化相位识别的方法，解决了领域转移问题并提升了模型的实用性。 Abstract: The complexity and diversity of surgical workflows, driven by heterogeneous operating room settings, institutional protocols, and anatomical variability, present a significant challenge in developing generalizable models for cross-institutional and cross-procedural surgical understanding. While recent surgical foundation models pretrained on large-scale vision-language data offer promising transferability, their zero-shot performance remains constrained by domain shifts, limiting their utility in unseen surgical environments. To address this, we introduce Surgical Phase Anywhere (SPA), a lightweight framework for versatile surgical workflow understanding that adapts foundation models to institutional settings with minimal annotation. SPA leverages few-shot spatial adaptation to align multi-modal embeddings with institution-specific surgical scenes and phases. It also ensures temporal consistency through diffusion modeling, which encodes task-graph priors derived from institutional procedure protocols. Finally, SPA employs dynamic test-time adaptation, exploiting the mutual agreement between multi-modal phase prediction streams to adapt the model to a given test video in a self-supervised manner, enhancing the reliability under test-time distribution shifts. SPA is a lightweight adaptation framework, allowing hospitals to rapidly customize phase recognition models by defining phases in natural language text, annotating a few images with the phase labels, and providing a task graph defining phase transitions. The experimental results show that the SPA framework achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures, even outperforming full-shot models with 32-shot labeled data. Code is available at https://github.com/CAMMA-public/SPA

[48] A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features

Ayush Lodh,Ritabrata Chakraborty,Shivakumara Palaiahnakote,Umapada Pal

Main category: cs.CV

TL;DR: 该论文提出了一种端到端的网络模型，通过在共享潜在空间中进行离线图像和在线笔画数据的早期融合，提高了手写识别的准确性，并实现了最先进的性能。

Details

Motivation: 作者认为手写识别可以从光栅化的复杂字形和笔的轨迹中携带的互补线索中受益，但大多数系统只利用了一种模态。 Method: 论文引入了一种端到端的网络模型，该模型在一个共享的潜在空间内执行离线图像和在线笔画数据的早期融合。一个块编码器将灰度裁剪转换为固定长度的视觉标记，同时一个轻量级的变压器嵌入（x, y, pen）序列。可学习的潜在查询同时关注这两个标记流，产生上下文增强的笔画嵌入，并在交叉熵损失目标下进行池化和解码。 Result: 全面的实验表明，该方法在IAMOn-DB和VNOn-DB上的准确率达到了最先进的水平，超过了之前最佳结果的1%。 Conclusion: 通过在任何高层次分类之前进行整合，时间线索在表示学习过程中相互加强，产生了更强的书写者独立性。 Abstract: We posit that handwriting recognition benefits from complementary cues carried by the rasterized complex glyph and the pen's trajectory, yet most systems exploit only one modality. We introduce an end-to-end network that performs early fusion of offline images and online stroke data within a shared latent space. A patch encoder converts the grayscale crop into fixed-length visual tokens, while a lightweight transformer embeds the $(x, y, \text{pen})$ sequence. Learnable latent queries attend jointly to both token streams, yielding context-enhanced stroke embeddings that are pooled and decoded under a cross-entropy loss objective. Because integration occurs before any high-level classification, temporal cues reinforce each other during representation learning, producing stronger writer independence. Comprehensive experiments on IAMOn-DB and VNOn-DB demonstrate that our approach achieves state-of-the-art accuracy, exceeding previous bests by up to 1\%. Our study also shows adaptation of this pipeline with gesturification on the ISI-Air dataset. Our code can be found here.

[49] Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification

Ning Luo,Meiyin Hu,Huan Wan,Yanyan Yang,Zhuohang Jiang,Xin Wei

Main category: cs.CV

TL;DR: The paper proposes HMDRN, a new method for few-shot fine-grained image classification that improves performance by combining dual-layer reconstruction and mask-enhanced processing to better capture and focus on discriminative features.

Details

Motivation: Current methods for few-shot fine-grained image classification have limitations: metric-based methods lose spatial information, while reconstruction-based methods fail to use hierarchical features and lack focus mechanisms on discriminative regions. Method: The Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN) combines a dual-layer feature reconstruction and fusion module with a spatial binary mask-enhanced transformer self-reconstruction module. This approach balances high-level semantics and mid-level structural details while focusing on discriminative regions and filtering background noise. Result: HMDRN outperforms state-of-the-art methods on three challenging fine-grained datasets using Conv-4 and ResNet-12 backbones. Ablation studies confirm the effectiveness of its components, and visualizations demonstrate superior feature reconstruction. Conclusion: HMDRN is an effective method for few-shot fine-grained image classification that outperforms existing methods by integrating dual-layer feature reconstruction and mask-enhanced feature processing. Abstract: Few-shot fine-grained image classification (FS-FGIC) presents a significant challenge, requiring models to distinguish visually similar subclasses with limited labeled examples. Existing methods have critical limitations: metric-based methods lose spatial information and misalign local features, while reconstruction-based methods fail to utilize hierarchical feature information and lack mechanisms to focus on discriminative regions. We propose the Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN), which integrates dual-layer feature reconstruction with mask-enhanced feature processing to improve fine-grained classification. HMDRN incorporates a dual-layer feature reconstruction and fusion module that leverages complementary visual information from different network hierarchies. Through learnable fusion weights, the model balances high-level semantic representations from the last layer with mid-level structural details from the penultimate layer. Additionally, we design a spatial binary mask-enhanced transformer self-reconstruction module that processes query features through adaptive thresholding while maintaining complete support features, enhancing focus on discriminative regions while filtering background noise. Extensive experiments on three challenging fine-grained datasets demonstrate that HMDRN consistently outperforms state-of-the-art methods across Conv-4 and ResNet-12 backbone architectures. Comprehensive ablation studies validate the effectiveness of each proposed component, revealing that dual-layer reconstruction enhances inter-class discrimination while mask-enhanced transformation reduces intra-class variations. Visualization results provide evidence of HMDRN's superior feature reconstruction capabilities.

[50] Forensic Study of Paintings Through the Comparison of Fabrics

Juan José Murillo-Fuentes,Pablo M. Olmos,Laura Alba-Carcelén

Main category: cs.CV

TL;DR: 本文介绍了一种利用Siamese深度学习模型自动评估画布相似性的新方法，适用于非连续卷轴位置的画布，通过特征表示学习和聚合预测提升了鲁棒性和准确性，成功应用于艺术作品分析。

Details

Motivation: 传统基于线密度图匹配的方法在画布不来自卷轴相邻位置时无法适用，因此需要一种新的方法。 Method: 设计并训练了一个Siamese深度学习模型来比较图像对，并提出了从多个样本对预测中聚合相似性估计的方法。 Result: 该方法在马德里普拉多博物馆的画布上进行了应用，证明即使在线密度相似的情况下，也能有效比较平纹画布。 Conclusion: 本文提出了一种基于深度学习的新方法，用于评估画布的相似性，为艺术品分析开辟了新途径。 Abstract: The study of canvas fabrics in works of art is a crucial tool for authentication, attribution and conservation. Traditional methods are based on thread density map matching, which cannot be applied when canvases do not come from contiguous positions on a roll. This paper presents a novel approach based on deep learning to assess the similarity of textiles. We introduce an automatic tool that evaluates the similarity between canvases without relying on thread density maps. A Siamese deep learning model is designed and trained to compare pairs of images by exploiting the feature representations learned from the scans. In addition, a similarity estimation method is proposed, aggregating predictions from multiple pairs of cloth samples to provide a robust similarity score. Our approach is applied to canvases from the Museo Nacional del Prado, corroborating the hypothesis that plain weave canvases, widely used in painting, can be effectively compared even when their thread densities are similar. The results demonstrate the feasibility and accuracy of the proposed method, opening new avenues for the analysis of masterpieces.

[51] From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios

Changliang Xia,Chengyou Jia,Zhuohang Dang,Minnan Luo

Main category: cs.CV

TL;DR: 本论文提出了DenseWorld基准和DenseDiT方法，用于解决现实世界中数据稀缺情况下的密集预测任务。

Details

Motivation: 现有的密集预测方法主要关注理想条件，难以推广到现实世界场景，并且现实数据有限。因此需要一种能够在真实场景中表现良好的统一策略。 Method: 作者引入了DenseWorld这一涵盖25个任务的基准，并提出DenseDiT模型，该模型通过参数复用机制和两个轻量级分支来整合多尺度上下文信息，仅使用不到0.1%的额外参数。 Result: 在DenseWorld上的评估显示现有方法性能显著下降，而DenseDiT在使用不到基线方法0.01%训练数据的情况下取得了更好的结果。 Conclusion: DenseDiT展示出了在现实世界密集预测任务中的实用价值，具有良好的泛化能力和数据效率。 Abstract: Dense prediction tasks hold significant importance of computer vision, aiming to learn pixel-wise annotated label for an input image. Despite advances in this field, existing methods primarily focus on idealized conditions, with limited generalization to real-world scenarios and facing the challenging scarcity of real-world data. To systematically study this problem, we first introduce DenseWorld, a benchmark spanning a broad set of 25 dense prediction tasks that correspond to urgent real-world applications, featuring unified evaluation across tasks. Then, we propose DenseDiT, which maximally exploits generative models' visual priors to perform diverse real-world dense prediction tasks through a unified strategy. DenseDiT combines a parameter-reuse mechanism and two lightweight branches that adaptively integrate multi-scale context, working with less than 0.1% additional parameters. Evaluations on DenseWorld reveal significant performance drops in existing general and specialized baselines, highlighting their limited real-world generalization. In contrast, DenseDiT achieves superior results using less than 0.01% training data of baselines, underscoring its practical value for real-world deployment. Our data, and checkpoints and codes are available at https://xcltql666.github.io/DenseDiTProj

[52] Breaking Spatial Boundaries: Spectral-Domain Registration Guided Hyperspectral and Multispectral Blind Fusion

Kunjing Yang,Libin Zheng,Minru Bai,Ting Lu,Leyuan Fang

Main category: cs.CV

TL;DR: 本文提出了一种新的盲融合方法，通过从光谱域解决配准问题，实现高光谱图像和多光谱图像的高效融合。

Details

Motivation: 现有的空间变换配准方法由于图像空间分辨率差异大、计算耗时，导致性能不理想，因此需要一种更有效的配准方法。 Method: 首先设计了一个轻量级光谱先验学习网络来增强多光谱图像的光谱分辨率，并通过对高光谱图像进行空间下采样完成配准；然后提出了基于组稀疏正则化的盲稀疏融合方法，并采用近端交替优化算法求解模型。 Result: 实验表明该方法在模拟和真实数据上均能有效提升图像融合效果和分类性能，同时减少了计算复杂度。 Conclusion: 该方法成功地从光谱域解决了高光谱与多光谱图像的配准和融合问题，具有较高的精度和实用性。 Abstract: The blind fusion of unregistered hyperspectral images (HSIs) and multispectral images (MSIs) has attracted growing attention recently. To address the registration challenge, most existing methods employ spatial transformations on the HSI to achieve alignment with the MSI. However, due to the substantial differences in spatial resolution of the images, the performance of these methods is often unsatisfactory. Moreover, the registration process tends to be time-consuming when dealing with large-sized images in remote sensing. To address these issues, we propose tackling the registration problem from the spectral domain. Initially, a lightweight Spectral Prior Learning (SPL) network is developed to extract spectral features from the HSI and enhance the spectral resolution of the MSI. Following this, the obtained image undergoes spatial downsampling to produce the registered HSI. In this process, subspace representation and cyclic training strategy are employed to improve spectral accuracy of the registered HSI obtained. Next, we propose a blind sparse fusion (BSF) method, which utilizes group sparsity regularization to equivalently promote the low-rankness of the image. This approach not only circumvents the need for rank estimation, but also reduces computational complexity. Then, we employ the Proximal Alternating Optimization (PAO) algorithm to solve the BSF model, and present its convergence analysis. Finally, extensive numerical experiments on simulated and real datasets are conducted to verify the effectiveness of our method in registration and fusion. We also demonstrate its efficacy in enhancing classification performance.

[53] Ctrl-Z Sampling: Diffusion Sampling with Controlled Random Zigzag Explorations

Shunqi Mao,Wei Guo,Chaoyi Zhang,Weidong Cai

Main category: cs.CV

TL;DR: 本文提出了一种新的采样策略 Controlled Random Zigzag Sampling，用于在扩散模型的条件生成过程中检测并逃离局部最大值，从而提高生成质量和对齐性。

Details

Motivation: 扩散模型在条件生成中经常收敛到局部最优解，导致全局不一致或条件不对齐的问题。 Method: 使用奖励模型检测局部最大值，注入噪声并恢复到更早的状态以逃离当前的优化平台，然后评估候选轨迹并在附近替代方案失败时进行更深入的回退。 Result: 实验结果表明 Ctrl-Z Sampling 在仅增加约7.6倍函数评估的情况下显著提高了生成质量。 Conclusion: Controlled Random Zigzag Sampling 是一种与模型无关的方法，通过动态交替前向优化和后向探索来增强生成输出的对齐性和视觉质量。 Abstract: Diffusion models have shown strong performance in conditional generation by progressively denoising Gaussian noise toward a target data distribution. This denoising process can be interpreted as a form of hill climbing in a learned latent space, where the model iteratively refines the sample toward regions of higher probability. However, diffusion models often converge to local optima that are locally visually coherent yet globally inconsistent or conditionally misaligned, due to latent space complexity and suboptimal initialization. Prior efforts attempted to address this by strengthening guidance signals or manipulating the initial noise distribution. We introduce Controlled Random Zigzag Sampling (Ctrl-Z Sampling), a novel sampling strategy designed to detect and escape such local maxima during conditional generation. The method first identifies potential local maxima using a reward model. Upon detection, it injects noise and reverts to a previous, noisier state to escape the current optimization plateau. The reward model then evaluates candidate trajectories, accepting only those that offer improvement, while progressively deeper retreat enables stronger escapes when nearby alternatives fail. This controlled random zigzag process allows dynamic alternation between forward refinement and backward exploration, enhancing both alignment and visual quality in the generated outputs. The proposed Ctrl-Z Sampling is model-agnostic and compatible with existing diffusion frameworks. Experimental results show that Ctrl-Z Sampling substantially improves generation quality with only around 7.6X increase in function evaluations.

[54] TDiR: Transformer based Diffusion for Image Restoration Tasks

Abbas Anwar,Mohammad Shullar,Ali Arshad Nasir,Mudassir Masood,Saeed Anwar

Main category: cs.CV

TL;DR: 该论文介绍了一种基于transformer的扩散模型，用于改善具有噪声、颜色偏移、模糊和光线散射等问题的图像质量，并证明了该模型在多个质量指标上优于现有的深度学习方法。

Details

Motivation: 由于在挑战性环境中捕获的图像经常经历各种形式的降解，包括噪声、颜色偏移、模糊和光线散射，这些效应显著降低了图像质量，阻碍了它们在下游任务中的适用性，因此需要改进退化图像的质量。 Method: 开发了一种基于transformer的扩散模型来解决图像修复问题，并使用公开可用的数据集对水下图像增强、去噪和去雨等任务进行评估。 Result: 研究结果表明，结合transformer的扩散模型在性能上超过了当前的方法，在提高退化图像质量方面展示了扩散模型和transformer的有效性。 Conclusion: 扩散模型与transformer的结合在图像修复任务中表现出色，提高了退化图像的质量，并扩展了其在需要高保真视觉数据的下游任务中的实用性。 Abstract: Images captured in challenging environments often experience various forms of degradation, including noise, color cast, blur, and light scattering. These effects significantly reduce image quality, hindering their applicability in downstream tasks such as object detection, mapping, and classification. Our transformer-based diffusion model was developed to address image restoration tasks, aiming to improve the quality of degraded images. This model was evaluated against existing deep learning methodologies across multiple quality metrics for underwater image enhancement, denoising, and deraining on publicly available datasets. Our findings demonstrate that the diffusion model, combined with transformers, surpasses current methods in performance. The results of our model highlight the efficacy of diffusion models and transformers in improving the quality of degraded images, consequently expanding their utility in downstream tasks that require high-fidelity visual data.

[55] Radiomic fingerprints for knee MR images assessment

Yaxi Chen,Simin Ni,Shaheer U. Saeed,Aleksandra Ivanova,Rikin Hargunani,Jie Huang,Chaozong Liu,Yipeng Hu

Main category: cs.CV

TL;DR: 本文介绍了一种新颖的放射组学指纹框架，利用深度学习模型为每个患者动态选择最具预测性的放射特征，以提高诊断准确性和提供临床洞察。

Details

Motivation: 传统放射基因组学方法使用固定的一组放射特征，无法代表个体病理变化，因此性能有限。 Method: 通过深度学习模型在大型放射组学特征库中预测特征相关性，并基于每个患者单独选择具有预测性的特征。同时训练放射组学选择模型和用于下游分类的低维逻辑回归。 Result: 该方法在多个诊断任务中表现出与现有最先进DL模型相当或更优的诊断准确性，并展示了其在临床洞察和潜在生物标志物发现方面的可解释性。 Conclusion: 提出了一种新的放射组学指纹框架，可以为每个患者动态构建放射组学特征集，并通过DL模型选择对个体患者临床状况具有预测性的特征。 Abstract: Accurate interpretation of knee MRI scans relies on expert clinical judgment, often with high variability and limited scalability. Existing radiomic approaches use a fixed set of radiomic features (the signature), selected at the population level and applied uniformly to all patients. While interpretable, these signatures are often too constrained to represent individual pathological variations. As a result, conventional radiomic-based approaches are found to be limited in performance, compared with recent end-to-end deep learning (DL) alternatives without using interpretable radiomic features. We argue that the individual-agnostic nature in current radiomic selection is not central to its intepretability, but is responsible for the poor generalization in our application. Here, we propose a novel radiomic fingerprint framework, in which a radiomic feature set (the fingerprint) is dynamically constructed for each patient, selected by a DL model. Unlike the existing radiomic signatures, our fingerprints are derived on a per-patient basis by predicting the feature relevance in a large radiomic feature pool, and selecting only those that are predictive of clinical conditions for individual patients. The radiomic-selecting model is trained simultaneously with a low-dimensional (considered relatively explainable) logistic regression for downstream classification. We validate our methods across multiple diagnostic tasks including general knee abnormalities, anterior cruciate ligament (ACL) tears, and meniscus tears, demonstrating comparable or superior diagnostic accuracy relative to state-of-the-art end-to-end DL models. More importantly, we show that the interpretability inherent in our approach facilitates meaningful clinical insights and potential biomarker discovery, with detailed discussion, quantitative and qualitative analysis of real-world clinical cases to evidence these advantages.

[56] On the Burstiness of Faces in Set

Jiong Wang

Main category: cs.CV

TL;DR: 本文研究了基于集合的人脸识别中的爆发性问题，并提出了一系列方法来抑制爆发性，从而提升识别性能。

Details

Motivation: 爆发性现象在基于集合的人脸识别中广泛存在，并从两个方面影响识别性能：训练阶段和评估阶段。 Method: 提出了三种检测人脸集中爆发性的策略，并在训练和评估阶段应用这些策略以增强低频人脸的采样比例或贡献。此外，还提出了质量感知的GMP方法。 Result: 通过大量实验表明，爆发性普遍存在，并且抑制爆发性能显著提高识别效果。 Conclusion: 抑制爆发性可以显著提高基于集合的人脸识别性能。 Abstract: Burstiness, a phenomenon observed in text and image retrieval, refers to that particular elements appear more times in a set than a statistically independent model assumes. We argue that in the context of set-based face recognition (SFR), burstiness exists widely and degrades the performance in two aspects: Firstly, the bursty faces, where faces with particular attributes %exist frequently in a face set, dominate the training instances and dominate the training face sets and lead to poor generalization ability to unconstrained scenarios. Secondly, the bursty faces %dominating the evaluation sets interfere with the similarity comparison in set verification and identification when evaluation. To detect the bursty faces in a set, we propose three strategies based on Quickshift++, feature self-similarity, and generalized max-pooling (GMP). We apply the burst detection results on training and evaluation stages to enhance the sampling ratios or contributions of the infrequent faces. When evaluation, we additionally propose the quality-aware GMP that enables awareness of the face quality and robustness to the low-quality faces for the original GMP. We give illustrations and extensive experiments on the SFR benchmarks to demonstrate that burstiness is widespread and suppressing burstiness considerably improves the recognition performance.

[57] From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents

Sergio Torres Aguilar

Main category: cs.CV

TL;DR: 该研究比较了五种最先进的目标检测架构在处理复杂历史文档布局中的性能，发现使用定向边界框（OBB）对提高精度至关重要，并且基于CNN的模型在复杂文档中表现更好。

Details

Motivation: 稳健的文档布局分析（DLA）对于自动处理和理解具有复杂页面组织的历史文档至关重要。 Method: 本文评估了两种基于Transformer的模型（Co-DETR、Grounding DINO）与三种YOLO变体（AABB、OBB、YOLO-World）在三个注释数据集上的表现，这些数据集代表了不同的文献复杂性。 Result: 在e-NDP数据集中，Co-DETR达到了最先进的结果（0.752 mAP@.50:.95），其次是YOLOv11X-OBB（0.721）。而在更复杂的CATMuS和HORAE数据集中，基于CNN的YOLOv11x-OBB显著优于其他所有模型（分别为0.564和0.568）。 Conclusion: Transformer模型适用于结构化布局，但CNN-OBB模型在处理视觉多样和复杂的文档时表现出更优的泛化能力。使用定向边界框（OBB）对于准确建模历史手稿的非笛卡尔性质至关重要，这并非次要改进，而是一个基本要求。 Abstract: Robust Document Layout Analysis (DLA) is critical for the automated processing and understanding of historical documents with complex page organizations. This paper benchmarks five state-of-the-art object detection architectures on three annotated datasets representing a spectrum of codicological complexity: The e-NDP, a corpus of Parisian medieval registers (1326-1504); CATMuS, a diverse multiclass dataset derived from various medieval and modern sources (ca.12th-17th centuries) and HORAE, a corpus of decorated books of hours (ca.13th-16th centuries). We evaluate two Transformer-based models (Co-DETR, Grounding DINO) against three YOLO variants (AABB, OBB, and YOLO-World). Our findings reveal significant performance variations dependent on model architecture, data set characteristics, and bounding box representation. In the e-NDP dataset, Co-DETR achieves state-of-the-art results (0.752 mAP@.50:.95), closely followed by YOLOv11X-OBB (0.721). Conversely, on the more complex CATMuS and HORAE datasets, the CNN-based YOLOv11x-OBB significantly outperforms all other models (0.564 and 0.568, respectively). This study unequivocally demonstrates that using Oriented Bounding Boxes (OBB) is not a minor refinement but a fundamental requirement for accurately modeling the non-Cartesian nature of historical manuscripts. We conclude that a key trade-off exists between the global context awareness of Transformers, ideal for structured layouts, and the superior generalization of CNN-OBB models for visually diverse and complex documents.

[58] Feature Hallucination for Self-supervised Action Recognition

Lei Wang,Piotr Koniusz

Main category: cs.CV

TL;DR: This paper proposes a deep translational action recognition framework that improves accuracy by using semantic reasoning and multimodal feature integration, achieving top performance on major benchmarks.

Details

Motivation: Traditional video action recognition relies heavily on raw pixel analysis, which is insufficient for complex human actions. The study aims to enhance recognition accuracy by incorporating high-level semantic reasoning and integrating multimodal features effectively. Method: The method involves a hallucination stream that infers missing cues without increasing computational overhead. Two novel domain-specific descriptors—Object Detection Features (ODF) and Saliency Detection Features (SDF)—are introduced to capture contextual and spatial-intensity patterns. These are integrated with auxiliary modalities like optical flow, skeleton data, and audio cues. Aleatoric uncertainty modeling and a robust loss function are used to handle feature noise and uncertainty. Result: The framework achieves state-of-the-art results on Kinetics-400, Kinetics-600, and Something-Something V2 benchmarks, demonstrating its ability to improve action recognition through semantic feature enhancement and robust modeling. Conclusion: The proposed deep translational action recognition framework achieves state-of-the-art performance on multiple benchmarks, showing its effectiveness in capturing fine-grained action dynamics through high-level semantic reasoning and multimodal feature integration. Abstract: Understanding human actions in videos requires more than raw pixel analysis; it relies on high-level semantic reasoning and effective integration of multimodal features. We propose a deep translational action recognition framework that enhances recognition accuracy by jointly predicting action concepts and auxiliary features from RGB video frames. At test time, hallucination streams infer missing cues, enriching feature representations without increasing computational overhead. To focus on action-relevant regions beyond raw pixels, we introduce two novel domain-specific descriptors. Object Detection Features (ODF) aggregate outputs from multiple object detectors to capture contextual cues, while Saliency Detection Features (SDF) highlight spatial and intensity patterns crucial for action recognition. Our framework seamlessly integrates these descriptors with auxiliary modalities such as optical flow, Improved Dense Trajectories, skeleton data, and audio cues. It remains compatible with state-of-the-art architectures, including I3D, AssembleNet, Video Transformer Network, FASTER, and recent models like VideoMAE V2 and InternVideo2. To handle uncertainty in auxiliary features, we incorporate aleatoric uncertainty modeling in the hallucination step and introduce a robust loss function to mitigate feature noise. Our multimodal self-supervised action recognition framework achieves state-of-the-art performance on multiple benchmarks, including Kinetics-400, Kinetics-600, and Something-Something V2, demonstrating its effectiveness in capturing fine-grained action dynamics.

[59] InvZW: Invariant Feature Learning via Noise-Adversarial Training for Robust Image Zero-Watermarking

Abdullah All Tanvir,Xin Zhong

Main category: cs.CV

TL;DR: This paper proposes a deep learning-based image zero-watermarking method that offers superior robustness against distortions by learning invariant features and optimizing reference codes for message matching.

Details

Motivation: The motivation is to develop a robust image watermarking technique that leaves the original image unaltered while ensuring high resistance to distortions. Method: The method involves a two-module framework: training a feature extractor via noise-adversarial learning and designing a multibit zero-watermarking scheme using reference codes optimized to match binary messages. Result: The experiments show that the proposed method achieves state-of-the-art robustness against various distortions and performs better than self-supervised and deep watermarking techniques. Conclusion: The paper concludes that the proposed deep learning framework for robust image zero-watermarking outperforms existing methods in terms of generalization and robustness. Abstract: This paper introduces a novel deep learning framework for robust image zero-watermarking based on distortion-invariant feature learning. As a zero-watermarking scheme, our method leaves the original image unaltered and learns a reference signature through optimization in the feature space. The proposed framework consists of two key modules. In the first module, a feature extractor is trained via noise-adversarial learning to generate representations that are both invariant to distortions and semantically expressive. This is achieved by combining adversarial supervision against a distortion discriminator and a reconstruction constraint to retain image content. In the second module, we design a learning-based multibit zero-watermarking scheme where the trained invariant features are projected onto a set of trainable reference codes optimized to match a target binary message. Extensive experiments on diverse image datasets and a wide range of distortions show that our method achieves state-of-the-art robustness in both feature stability and watermark recovery. Comparative evaluations against existing self-supervised and deep watermarking techniques further highlight the superiority of our framework in generalization and robustness.

[60] Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

Ben Kang,Xin Chen,Jie Zhao,Chunjuan Bo,Dong Wang,Huchuan Lu

Main category: cs.CV

TL;DR: HiT and DyHiT introduce efficient tracking methods using lightweight transformers and dynamic routing, achieving fast speeds and competitive performance on benchmarks.

Details

Motivation: Transformer-based trackers are powerful but slow, limiting their use on resource-constrained devices. Efficient and fast visual tracking is needed. Method: HiT introduces a Bridge Module and dual-image position encoding to enhance feature representation. DyHiT uses dynamic routing based on scene complexity for optimized performance. Result: HiT achieves 61 fps with 64.6% AUC on LaSOT. DyHiT reaches 111 fps while maintaining 62.4% AUC. SeqTrack-B256 sees a 2.68x speed boost with no loss in accuracy. Conclusion: HiT and DyHiT offer efficient, high-performance tracking models with dynamic route selection for varying scene complexities, significantly improving speed without sacrificing accuracy. Abstract: Transformer-based visual trackers have demonstrated significant advancements due to their powerful modeling capabilities. However, their practicality is limited on resource-constrained devices because of their slow processing speeds. To address this challenge, we present HiT, a novel family of efficient tracking models that achieve high performance while maintaining fast operation across various devices. The core innovation of HiT lies in its Bridge Module, which connects lightweight transformers to the tracking framework, enhancing feature representation quality. Additionally, we introduce a dual-image position encoding approach to effectively encode spatial information. HiT achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark, outperforming all previous efficient trackers.Building on HiT, we propose DyHiT, an efficient dynamic tracker that flexibly adapts to scene complexity by selecting routes with varying computational requirements. DyHiT uses search area features extracted by the backbone network and inputs them into an efficient dynamic router to classify tracking scenarios. Based on the classification, DyHiT applies a divide-and-conquer strategy, selecting appropriate routes to achieve a superior trade-off between accuracy and speed. The fastest version of DyHiT achieves 111 fps on NVIDIA Jetson AGX while maintaining an AUC of 62.4% on LaSOT.Furthermore, we introduce a training-free acceleration method based on the dynamic routing architecture of DyHiT. This method significantly improves the execution speed of various high-performance trackers without sacrificing accuracy. For instance, our acceleration method enables the state-of-the-art tracker SeqTrack-B256 to achieve a 2.68 times speedup on an NVIDIA GeForce RTX 2080 Ti GPU while maintaining the same AUC of 69.9% on the LaSOT.

[61] A Novel Large Vision Foundation Model (LVFM)-based Approach for Generating High-Resolution Canopy Height Maps in Plantations for Precision Forestry Management

Shen Tan,Xin Zhang,Liangxiu Han,Huaguo Huang,Han Wang

Main category: cs.CV

TL;DR: 本研究利用大视觉基础模型（LVFM）构建了一种高效的高分辨率冠层高度图生成方法，显著提升了树高估算精度和碳汇评估能力。

Details

Motivation: 标准的激光雷达方法成本高昂，而现有的基于RGB图像的深度学习方法难以准确提取冠层高度特征，因此需要一种更准确且具有成本效益的方法来监测林地地上生物量（AGB）。 Method: 开发了一种结合特征提取器、自监督特征增强模块和高度估计器的新模型，并使用1米分辨率的Google Earth图像进行测试。 Result: 模型在北京市房山区测试中取得了0.09米的平均绝对误差、0.24米的均方根误差以及与激光雷达CHM相关系数达0.78，并实现了超过90%的单株树木检测成功率和高精度的AGB估算。 Conclusion: 研究提出了一种基于大视觉基础模型（LVFM）生成高分辨率冠层高度图（CHM）的新方法，该方法在精度和泛化能力方面优于现有方法，为估算碳汇提供了一个有前景的工具。 Abstract: Accurate, cost-effective monitoring of plantation aboveground biomass (AGB) is crucial for supporting local livelihoods and carbon sequestration initiatives like the China Certified Emission Reduction (CCER) program. High-resolution canopy height maps (CHMs) are essential for this, but standard lidar-based methods are expensive. While deep learning with RGB imagery offers an alternative, accurately extracting canopy height features remains challenging. To address this, we developed a novel model for high-resolution CHM generation using a Large Vision Foundation Model (LVFM). Our model integrates a feature extractor, a self-supervised feature enhancement module to preserve spatial details, and a height estimator. Tested in Beijing's Fangshan District using 1-meter Google Earth imagery, our model outperformed existing methods, including conventional CNNs. It achieved a mean absolute error of 0.09 m, a root mean square error of 0.24 m, and a correlation of 0.78 against lidar-based CHMs. The resulting CHMs enabled over 90% success in individual tree detection, high accuracy in AGB estimation, and effective tracking of plantation growth, demonstrating strong generalization to non-training areas. This approach presents a promising, scalable tool for evaluating carbon sequestration in both plantations and natural forests.

[62] Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation

Changlu Guo,Anders Nymark Christensen,Morten Rieger Hannemose

Main category: cs.CV

TL;DR: Med-Art是一种针对医学图像生成的创新框架，在有限数据条件下结合视觉-语言模型和改进的扩散模型微调方法，取得了卓越性能。

Details

Motivation: 医学图像生成面临数据集规模小和医学文本数据稀缺的挑战，因此需要一种专门设计的框架来应对这些问题。 Method: Med-Art利用视觉-语言模型生成医学图像的视觉描述，并基于Diffusion Transformer (DiT) 对大规模预训练文本到图像模型PixArt-α进行适应。此外，提出了混合层次扩散微调（HLDF）方法，引入像素级损失函数以解决颜色过饱和等问题。 Result: Med-Art克服了医学文本数据稀缺的问题，同时通过HLDF方法解决了图像生成中的颜色过饱和问题，并在有限数据下实现了高性能表现。 Conclusion: Med-Art实现了在两个医学图像数据集上的SOTA性能，通过FID、KID和下游分类性能衡量。 Abstract: Text-to-image generative models have achieved remarkable breakthroughs in recent years. However, their application in medical image generation still faces significant challenges, including small dataset sizes, and scarcity of medical textual data. To address these challenges, we propose Med-Art, a framework specifically designed for medical image generation with limited data. Med-Art leverages vision-language models to generate visual descriptions of medical images which overcomes the scarcity of applicable medical textual data. Med-Art adapts a large-scale pre-trained text-to-image model, PixArt-$\alpha$, based on the Diffusion Transformer (DiT), achieving high performance under limited data. Furthermore, we propose an innovative Hybrid-Level Diffusion Fine-tuning (HLDF) method, which enables pixel-level losses, effectively addressing issues such as overly saturated colors. We achieve state-of-the-art performance on two medical image datasets, measured by FID, KID, and downstream classification performance.

[63] HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling

Tobias Vontobel,Seyedmorteza Sadat,Farnood Salehi,Romann M. Weber

Main category: cs.CV

TL;DR: 本文提出了一种名为HiWave的新方法，用于超高清图像合成，通过使用预训练扩散模型，解决了当前方法中存在的问题，并展示了卓越的性能。

Details

Motivation: 训练扩散模型在高分辨率下计算成本高昂，现有的零样本生成技术在超越训练分辨率时常常产生包括物体重复和空间不连贯等伪影。 Method: 该方法采用两阶段流程：首先从预训练模型生成基础图像，接着进行分块DDIM反转步骤和新的基于小波的细节增强模块。 Result: 广泛评估显示，HiWave有效减轻了之前方法中的常见视觉伪影，实现了更优的感知质量，并且用户研究确认了其性能优于现有替代方案。 Conclusion: HiWave是一个无需训练的零样本方法，能够显著提高超高清图像合成的视觉质量和结构一致性。 Abstract: Diffusion models have emerged as the leading approach for image synthesis, demonstrating exceptional photorealism and diversity. However, training diffusion models at high resolutions remains computationally prohibitive, and existing zero-shot generation techniques for synthesizing images beyond training resolutions often produce artifacts, including object duplication and spatial incoherence. In this paper, we introduce HiWave, a training-free, zero-shot approach that substantially enhances visual fidelity and structural coherence in ultra-high-resolution image synthesis using pretrained diffusion models. Our method employs a two-stage pipeline: generating a base image from the pretrained model followed by a patch-wise DDIM inversion step and a novel wavelet-based detail enhancer module. Specifically, we first utilize inversion methods to derive initial noise vectors that preserve global coherence from the base image. Subsequently, during sampling, our wavelet-domain detail enhancer retains low-frequency components from the base image to ensure structural consistency, while selectively guiding high-frequency components to enrich fine details and textures. Extensive evaluations using Stable Diffusion XL demonstrate that HiWave effectively mitigates common visual artifacts seen in prior methods, achieving superior perceptual quality. A user study confirmed HiWave's performance, where it was preferred over the state-of-the-art alternative in more than 80% of comparisons, highlighting its effectiveness for high-quality, ultra-high-resolution image synthesis without requiring retraining or architectural modifications.

[64] A Deep Learning Approach to Identify Rock Bolts in Complex 3D Point Clouds of Underground Mines Captured Using Mobile Laser Scanners

Dibyayan Patra,Pasindu Ranasinghe,Bikram Banerjee,Simit Raval

Main category: cs.CV

TL;DR: This study introduces DeepBolt, an automated method using deep learning to accurately identify rock bolts in 3D point clouds from underground mines, addressing issues like noise, occlusion, and class imbalance.

Details

Motivation: Automated detection of rock bolts is necessary due to challenges in manual surveying, such as low light conditions and time constraints in underground mines. Method: The paper proposes DeepBolt, a two-stage deep learning architecture designed to handle severe class imbalance for identifying rock bolts in 3D point clouds. Result: DeepBolt surpasses state-of-the-art models by up to 42.5% in Intersection over Union (IoU) and achieves 96.41% precision and 96.96% recall. Conclusion: DeepBolt demonstrates robust and effective identification of rock bolts in complex underground environments, outperforming existing techniques with high precision and recall. Abstract: Rock bolts are crucial components of the subterranean support systems in underground mines that provide adequate structural reinforcement to the rock mass to prevent unforeseen hazards like rockfalls. This makes frequent assessments of such bolts critical for maintaining rock mass stability and minimising risks in underground mining operations. Where manual surveying of rock bolts is challenging due to the low light conditions in the underground mines and the time-intensive nature of the process, automated detection of rock bolts serves as a plausible solution. To that end, this study focuses on the automatic identification of rock bolts within medium to large-scale 3D point clouds obtained from underground mines using mobile laser scanners. Existing techniques for automated rock bolt identification primarily rely on feature engineering and traditional machine learning approaches. However, such techniques lack robustness as these point clouds present several challenges due to data noise, varying environments, and complex surrounding structures. Moreover, the target rock bolts are extremely small objects within large-scale point clouds and are often partially obscured due to the application of reinforcement shotcrete. Addressing these challenges, this paper proposes an approach termed DeepBolt, which employs a novel two-stage deep learning architecture specifically designed for handling severe class imbalance for the automatic and efficient identification of rock bolts in complex 3D point clouds. The proposed method surpasses state-of-the-art semantic segmentation models by up to 42.5% in Intersection over Union (IoU) for rock bolt points. Additionally, it outperforms existing rock bolt identification techniques, achieving a 96.41% precision and 96.96% recall in classifying rock bolts, demonstrating its robustness and effectiveness in complex underground environments.

[65] AI-assisted radiographic analysis in detecting alveolar bone-loss severity and patterns

Chathura Wimalasiri,Piumal Rathnayake,Shamod Wijerathne,Sumudu Rasnayaka,Dhanushka Leuke Bandara,Roshan Ragel,Vajira Thambawita,Isuru Nawinne

Main category: cs.CV

TL;DR: 该研究开发了一种基于AI的深度学习框架，利用口腔内根尖片自动检测和量化牙槽骨吸收的程度和模式，结合YOLOv8、Keypoint R-CNN和YOLOv8x-seg模型实现高效准确的评估。

Details

Motivation: 牙周炎是一种导致牙槽骨吸收的慢性炎症疾病，严重影响口腔健康和生活质量。目前对骨吸收程度和模式的准确评估对于诊断和治疗计划至关重要，但传统方法依赖主观的人工评估，因此需要一种快速、客观且可重复的自动化工具来提升评估效果。 Method: 研究提出了一种结合YOLOv8用于牙齿检测、Keypoint R-CNN模型识别解剖标志点以精确计算骨吸收严重程度的方法，并通过YOLOv8x-seg模型分割骨水平和牙齿掩码，进一步利用几何分析区分骨吸收模式（水平型与角型）。 Result: 在1000张专家标注的放射影像数据集上评估显示，该方法在骨吸收严重程度检测上的准确度达到较高水平（组内相关系数最高达0.80），在骨吸收模式分类上的准确率为87%。 Conclusion: 该自动化系统提供了一种快速、客观且可重复的牙周评估工具，减少了对主观人工评估的依赖，有望提高牙周炎的早期诊断能力和个性化治疗规划，最终改善患者护理和临床结果。 Abstract: Periodontitis, a chronic inflammatory disease causing alveolar bone loss, significantly affects oral health and quality of life. Accurate assessment of bone loss severity and pattern is critical for diagnosis and treatment planning. In this study, we propose a novel AI-based deep learning framework to automatically detect and quantify alveolar bone loss and its patterns using intraoral periapical (IOPA) radiographs. Our method combines YOLOv8 for tooth detection with Keypoint R-CNN models to identify anatomical landmarks, enabling precise calculation of bone loss severity. Additionally, YOLOv8x-seg models segment bone levels and tooth masks to determine bone loss patterns (horizontal vs. angular) via geometric analysis. Evaluated on a large, expertly annotated dataset of 1000 radiographs, our approach achieved high accuracy in detecting bone loss severity (intra-class correlation coefficient up to 0.80) and bone loss pattern classification (accuracy 87%). This automated system offers a rapid, objective, and reproducible tool for periodontal assessment, reducing reliance on subjective manual evaluation. By integrating AI into dental radiographic analysis, our framework has the potential to improve early diagnosis and personalized treatment planning for periodontitis, ultimately enhancing patient care and clinical outcomes.

Manyi Li,Renshuai Tao,Yufan Liu,Chuangchuang Tan,Haotong Qin,Bing Li,Yunchao Wei,Yao Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的深度伪造图像检测框架PLADA，在线社交网络中的压缩图像问题和缺乏配对数据的问题得到了有效解决，性能优于当前最先进的方法。

Details

Motivation: 现有的深度伪造检测方法忽略了在线社交网络中压缩图像带来的“块效应”，导致在真实场景下效果不佳。 Method: 提出了PLADA框架，包含两个核心模块：用于处理“块效应”的双阶段注意力机制（B2E）和用于处理配对与非配对数据的开放数据聚合（ODA）。 Result: 在26个数据集中进行了广泛实验，证明PLADA在检测OSN上的深度伪造图像时表现优异，即使在配对数据有限和存在压缩的情况下也是如此。 Conclusion: PLADA框架在处理压缩图像和缺乏配对数据的情况下，表现出优于现有方法的深度伪造检测能力，并引入了“块效应”作为关键因素。 Abstract: With the rapid advancement of deep learning, particularly through generative adversarial networks (GANs) and diffusion models (DMs), AI-generated images, or ``deepfakes", have become nearly indistinguishable from real ones. These images are widely shared across Online Social Networks (OSNs), raising concerns about their misuse. Existing deepfake detection methods overlook the ``block effects" introduced by compression in OSNs, which obscure deepfake artifacts, and primarily focus on raw images, rarely encountered in real-world scenarios. To address these challenges, we propose PLADA (Pay Less Attention to Deceptive Artifacts), a novel framework designed to tackle the lack of paired data and the ineffective use of compressed images. PLADA consists of two core modules: Block Effect Eraser (B2E), which uses a dual-stage attention mechanism to handle block effects, and Open Data Aggregation (ODA), which processes both paired and unpaired data to improve detection. Extensive experiments across 26 datasets demonstrate that PLADA achieves a remarkable balance in deepfake detection, outperforming SoTA methods in detecting deepfakes on OSNs, even with limited paired data and compression. More importantly, this work introduces the ``block effect" as a critical factor in deepfake detection, providing a robust solution for open-world scenarios. Our code is available at https://github.com/ManyiLee/PLADA.

[67] MMSearch-R1: Incentivizing LMMs to Search

Jinming Wu,Zihao Deng,Wei Li,Yiding Liu,Bo You,Bo Li,Zejun Ma,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出 MMSearch-R1，一个端到端的强化学习框架，实现高效的多模态搜索，提升性能并减少搜索调用。

Details

Motivation: 传统方法如 RAG 和提示工程搜索代理依赖固定管道，导致搜索效率低下，需要一种更灵活、高效的多模态搜索方案。 Method: 使用端到端的强化学习框架，结合图像和文本搜索工具，并通过基于结果的奖励机制进行训练。 Result: MMSearch-R1 在相同模型大小下优于基于 RAG 的基线模型，同时减少了超过 30% 的搜索调用次数。 Conclusion: MMSearch-R1 提出了一种高效的多模态搜索框架，能够显著减少搜索调用次数，并在知识密集型任务中表现出色。 Abstract: Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.

[68] Lightweight Multi-Frame Integration for Robust YOLO Object Detection in Videos

Yitong Quan,Benjamin Kiefer,Martin Messmer,Andreas Zell

Main category: cs.CV

TL;DR: This paper proposes a simple yet effective strategy for video object detection by stacking consecutive frames as input to a YOLO-based model, enhancing detection robustness without increasing model complexity.

Details

Motivation: Modern image-based object detection models like YOLOv7 ignore valuable temporal context in videos, and existing video-based methods often introduce complexity. Transient challenges such as motion blur and occlusions degrade performance in real-world applications like surveillance and autonomous driving. Method: The proposed method involves stacking multiple consecutive frames as input to a YOLO-based detector while supervising only the output corresponding to a single target frame, leveraging temporal information with minimal modifications to existing architectures. Result: Extensive experiments on MOT20Det and the newly introduced BOAT360 dataset demonstrate improved detection robustness, particularly for lightweight models, while maintaining simplicity, computational efficiency, and real-time inference. Conclusion: The paper concludes that stacking multiple consecutive frames as input to a YOLO-based detector improves detection robustness, especially for lightweight models, narrowing the gap between compact and heavy detection networks. Abstract: Modern image-based object detection models, such as YOLOv7, primarily process individual frames independently, thus ignoring valuable temporal context naturally present in videos. Meanwhile, existing video-based detection methods often introduce complex temporal modules, significantly increasing model size and computational complexity. In practical applications such as surveillance and autonomous driving, transient challenges including motion blur, occlusions, and abrupt appearance changes can severely degrade single-frame detection performance. To address these issues, we propose a straightforward yet highly effective strategy: stacking multiple consecutive frames as input to a YOLO-based detector while supervising only the output corresponding to a single target frame. This approach leverages temporal information with minimal modifications to existing architectures, preserving simplicity, computational efficiency, and real-time inference capability. Extensive experiments on the challenging MOT20Det and our BOAT360 datasets demonstrate that our method improves detection robustness, especially for lightweight models, effectively narrowing the gap between compact and heavy detection networks. Additionally, we contribute the BOAT360 benchmark dataset, comprising annotated fisheye video sequences captured from a boat, to support future research in multi-frame video object detection in challenging real-world scenarios.

[69] AdvMIM: Adversarial Masked Image Modeling for Semi-Supervised Medical Image Segmentation

Lei Zhu,Jun Zhou,Rick Siow Mong Goh,Yong Liu

Main category: cs.CV

TL;DR: 本文提出了一种用于半监督医学图像分割的新方法，通过对抗性掩码图像建模提高Transformer的性能，并取得了显著的效果。

Details

Motivation: 尽管Transformer在医学图像分割任务中表现出色，但其需要大量标注数据，这限制了其在标注稀缺的半监督学习场景中的适用性。因此，本文旨在通过新方法提升Transformer在有限标注数据下的性能。 Method: 提出了一种对抗性掩码图像建模方法，并从多领域学习的角度进行理论分析，设计了新的对抗训练损失以缩小原始域和掩码域之间的差距。 Result: 在三个公开的医学图像分割数据集上进行了广泛的实验，结果显示所提出的方法显著优于现有方法。 Conclusion: 实验结果表明，所提出的方法在医学图像分割的半监督学习中具有显著优势，并且代码已公开提供使用。 Abstract: Vision Transformer has recently gained tremendous popularity in medical image segmentation task due to its superior capability in capturing long-range dependencies. However, transformer requires a large amount of labeled data to be effective, which hinders its applicability in annotation scarce semi-supervised learning scenario where only limited labeled data is available. State-of-the-art semi-supervised learning methods propose combinatorial CNN-Transformer learning to cross teach a transformer with a convolutional neural network, which achieves promising results. However, it remains a challenging task to effectively train the transformer with limited labeled data. In this paper, we propose an adversarial masked image modeling method to fully unleash the potential of transformer for semi-supervised medical image segmentation. The key challenge in semi-supervised learning with transformer lies in the lack of sufficient supervision signal. To this end, we propose to construct an auxiliary masked domain from original domain with masked image modeling and train the transformer to predict the entire segmentation mask with masked inputs to increase supervision signal. We leverage the original labels from labeled data and pseudo-labels from unlabeled data to learn the masked domain. To further benefit the original domain from masked domain, we provide a theoretical analysis of our method from a multi-domain learning perspective and devise a novel adversarial training loss to reduce the domain gap between the original and masked domain, which boosts semi-supervised learning performance. We also extend adversarial masked image modeling to CNN network. Extensive experiments on three public medical image segmentation datasets demonstrate the effectiveness of our method, where our method outperforms existing methods significantly. Our code is publicly available at https://github.com/zlheui/AdvMIM.

[70] Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization

Zhiwang Zhang,Dong Xu,Wanli Ouyang,Chuanqi Tan

Main category: cs.CV

TL;DR: This paper proposes a novel DaS framework for dense video captioning that combines visual and semantic information using a two-stage LSTM with hierarchical attention.

Details

Motivation: Dense video captioning requires rich semantic descriptions from multiple segments, and existing approaches may not effectively combine visual and semantic cues for summarization. Method: A division-and-summarization (DaS) framework is proposed, which partitions long videos into event proposals, extracts visual features, generates segment descriptions, and uses a two-stage LSTM with hierarchical attention to summarize sentences. Result: Experiments on the ActivityNet Captions dataset show that the DaS framework is effective in generating descriptive sentences for event proposals. Conclusion: The proposed DaS framework effectively summarizes dense video captions by leveraging both semantic and visual information through a two-stage LSTM network with hierarchical attention mechanism. Abstract: In this work, we propose a division-and-summarization (DaS) framework for dense video captioning. After partitioning each untrimmed long video as multiple event proposals, where each event proposal consists of a set of short video segments, we extract visual feature (e.g., C3D feature) from each segment and use the existing image/video captioning approach to generate one sentence description for this segment. Considering that the generated sentences contain rich semantic descriptions about the whole event proposal, we formulate the dense video captioning task as a visual cue aided sentence summarization problem and propose a new two stage Long Short Term Memory (LSTM) approach equipped with a new hierarchical attention mechanism to summarize all generated sentences as one descriptive sentence with the aid of visual features. Specifically, the first-stage LSTM network takes all semantic words from the generated sentences and the visual features from all segments within one event proposal as the input, and acts as the encoder to effectively summarize both semantic and visual information related to this event proposal. The second-stage LSTM network takes the output from the first-stage LSTM network and the visual features from all video segments within one event proposal as the input, and acts as the decoder to generate one descriptive sentence for this event proposal. Our comprehensive experiments on the ActivityNet Captions dataset demonstrate the effectiveness of our newly proposed DaS framework for dense video captioning.

[71] Causal Representation Learning with Observational Grouping for CXR Classification

Rajat Rasal,Avinash Kori,Ben Glocker

Main category: cs.CV

TL;DR: This paper proposes an end-to-end framework for learning causal representations by grouping observations, enhancing generalisability and robustness in chest X-ray disease classification.

Details

Motivation: Identifiable causal representation learning helps uncover true causal relationships, which can enhance the generalisability and robustness of latent features in medical imaging tasks. Method: An end-to-end framework was developed to enforce invariance with respect to race, sex, and imaging views by grouping observations for learning identifiable causal representations. Result: Experiments demonstrated that causal representations learned via grouping improved performance across multiple classification tasks. Conclusion: Causal representations through grouping observations improve generalisability and robustness in disease classification tasks on chest X-rays. Abstract: Identifiable causal representation learning seeks to uncover the true causal relationships underlying a data generation process. In medical imaging, this presents opportunities to improve the generalisability and robustness of task-specific latent features. This work introduces the concept of grouping observations to learn identifiable representations for disease classification in chest X-rays via an end-to-end framework. Our experiments demonstrate that these causal representations improve generalisability and robustness across multiple classification tasks when grouping is used to enforce invariance w.r.t race, sex, and imaging views.

[72] Dense Video Captioning using Graph-based Sentence Summarization

Zhiwang Zhang,Dong Xu,Wanli Ouyang,Luping Zhou

Main category: cs.CV

TL;DR: This paper introduces a graph-based two-stage framework (GPaS) for dense video captioning, which improves performance by analyzing scene evolution and semantic word relationships through GCN-LSTM modules.

Details

Motivation: Existing methods struggle to handle changes in scenes and objects over long event proposals. This work aims to improve captioning performance by better capturing semantic relationships and scene evolution. Method: The method involves a two-stage graph-based framework: first splitting event proposals into shorter segments for detailed captioning, then summarizing these captions into a single sentence. Semantic words are treated as nodes in a graph, with interactions learned via GCN-LSTM modules. Result: The proposed approach demonstrates effectiveness through comparisons with state-of-the-art methods on the ActivityNet Captions dataset and YouCook II dataset. Conclusion: The proposed GPaS framework effectively improves dense video captioning by exploring scene evolution within event temporal proposals, using a two-stage approach that includes partition and summarization. Abstract: Recently, dense video captioning has made attractive progress in detecting and captioning all events in a long untrimmed video. Despite promising results were achieved, most existing methods do not sufficiently explore the scene evolution within an event temporal proposal for captioning, and therefore perform less satisfactorily when the scenes and objects change over a relatively long proposal. To address this problem, we propose a graph-based partition-and-summarization (GPaS) framework for dense video captioning within two stages. For the ``partition" stage, a whole event proposal is split into short video segments for captioning at a finer level. For the ``summarization" stage, the generated sentences carrying rich description information for each segment are summarized into one sentence to describe the whole event. We particularly focus on the ``summarization" stage, and propose a framework that effectively exploits the relationship between semantic words for summarization. We achieve this goal by treating semantic words as nodes in a graph and learning their interactions by coupling Graph Convolutional Network (GCN) and Long Short Term Memory (LSTM), with the aid of visual cues. Two schemes of GCN-LSTM Interaction (GLI) modules are proposed for seamless integration of GCN and LSTM. The effectiveness of our approach is demonstrated via an extensive comparison with the state-of-the-arts methods on the two benchmarks ActivityNet Captions dataset and YouCook II dataset.

[73] Learning-Based Distance Estimation for 360° Single-Sensor Setups

Yitong Quan,Benjamin Kiefer,Martin Messmer,Andreas Zell

Main category: cs.CV

TL;DR: This paper proposes a deep learning-based method for distance estimation using a single 360° fisheye lens camera, offering better performance than traditional methods without requiring precise calibration.

Details

Motivation: Traditional geometric methods face challenges with lens distortions and environmental variability in omnidirectional imaging. This motivates the need for a more robust and adaptable solution for accurate distance estimation. Method: The method involves a neural network-based model that learns to estimate object distances directly from raw omnidirectional inputs without relying on precise lens calibration. Result: Experiments on LOAF, ULM360, and Boat360 datasets showed that the proposed model outperforms traditional and learning-based methods in both accuracy and robustness for distance estimation. Conclusion: The paper concludes that the proposed deep learning approach for monocular distance estimation using a 360° fisheye lens camera surpasses traditional geometry-based methods and learning baselines in accuracy and robustness, making it suitable for low-cost applications like robotics and surveillance. Abstract: Accurate distance estimation is a fundamental challenge in robotic perception, particularly in omnidirectional imaging, where traditional geometric methods struggle with lens distortions and environmental variability. In this work, we propose a neural network-based approach for monocular distance estimation using a single 360{\deg} fisheye lens camera. Unlike classical trigonometric techniques that rely on precise lens calibration, our method directly learns and infers the distance of objects from raw omnidirectional inputs, offering greater robustness and adaptability across diverse conditions. We evaluate our approach on three 360{\deg} datasets (LOAF, ULM360, and a newly captured dataset Boat360), each representing distinct environmental and sensor setups. Our experimental results demonstrate that the proposed learning-based model outperforms traditional geometry-based methods and other learning baselines in both accuracy and robustness. These findings highlight the potential of deep learning for real-time omnidirectional distance estimation, making our approach particularly well-suited for low-cost applications in robotics, autonomous navigation, and surveillance.

[74] TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

Pritam Mishra,Coloma Ballester,Dimosthenis Karatzas

Main category: cs.CV

TL;DR: A new self-supervised video summarization model efficiently captures spatial and temporal dependencies, achieving strong results without relying on attention mechanisms or supervised annotations.

Details

Motivation: The motivation is to overcome the limitations of current methods that are either computationally expensive, reliant on supervised annotations, or not robust across different datasets. Method: The method involves a self-supervised learning framework with Markov process-driven loss metrics that capture spatial and temporal dependencies without attention, RNNs, or transformers. Result: The approach achieves state-of-the-art performance on SUMME and TVSUM datasets, outperforming unsupervised methods and rivaling the best supervised models. Conclusion: The proposed self-supervised video summarization model offers a more generalizable and efficient solution, challenging the reliance on complex architectures. Abstract: The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets, outperforming all existing unsupervised methods. It also rivals the best supervised models, demonstrating the potential for efficient, annotation-free architectures. This paves the way for more generalizable video summarization techniques and challenges the prevailing reliance on complex architectures.

[75] WonderFree: Enhancing Novel View Quality and Cross-View Consistency for 3D Scene Exploration

Chaojun Ni,Jie Li,Haoyun Li,Hengyu Liu,Xiaofeng Wang,Zheng Zhu,Guosheng Zhao,Boyuan Wang,Chenxin Li,Guan Huang,Wenjun Mei

Main category: cs.CV

TL;DR: 本文提出了一種名為WonderFree的新模型，用於從單一圖像中互動生成3D場景，解決了新視角質量和跨視角一致性問題，實現了高質量的渲染和空間一致性。

Details

Motivation: 當前3D生成方法在探索性方面有限，無法在大範圍移動時渲染高質量圖像，尤其是在向前移動到未見區域時。 Method: 將挑戰分解為兩個子問題：新視角質量和跨視角一致性。引入WorldRestorer以消除浮點物和偽影，並提出ConsistView以維持多視角間的時空一致性。 Result: 實驗結果顯示WonderFree不僅提高了不同視角下的渲染質量，而且顯著改善了全局一致性和連貫性，得到了CLIP指標和用戶研究的支持。 Conclusion: WonderFree是首個允許用戶從任意角度和方向互動生成3D世界的模型，提供了無縫且沉浸式的3D探索體驗。 Abstract: Interactive 3D scene generation from a single image has gained significant attention due to its potential to create immersive virtual worlds. However, a key challenge in current 3D generation methods is the limited explorability, which cannot render high-quality images during larger maneuvers beyond the original viewpoint, particularly when attempting to move forward into unseen areas. To address this challenge, we propose WonderFree, the first model that enables users to interactively generate 3D worlds with the freedom to explore from arbitrary angles and directions. Specifically, we decouple this challenge into two key subproblems: novel view quality, which addresses visual artifacts and floating issues in novel views, and cross-view consistency, which ensures spatial consistency across different viewpoints. To enhance rendering quality in novel views, we introduce WorldRestorer, a data-driven video restoration model designed to eliminate floaters and artifacts. In addition, a data collection pipeline is presented to automatically gather training data for WorldRestorer, ensuring it can handle scenes with varying styles needed for 3D scene generation. Furthermore, to improve cross-view consistency, we propose ConsistView, a multi-view joint restoration mechanism that simultaneously restores multiple perspectives while maintaining spatiotemporal coherence. Experimental results demonstrate that WonderFree not only enhances rendering quality across diverse viewpoints but also significantly improves global coherence and consistency. These improvements are confirmed by CLIP-based metrics and a user study showing a 77.20% preference for WonderFree over WonderWorld enabling a seamless and immersive 3D exploration experience. The code, model, and data will be publicly available.

[76] SFNet: Fusion of Spatial and Frequency-Domain Features for Remote Sensing Image Forgery Detection

Ji Qi,Xinchang Zhang,Dingqi Ye,Yongjia Ruan,Xin Guo,Shaowen Wang,Haifeng Li

Main category: cs.CV

TL;DR: 本论文提出了一种名为SFNet的新框架，用于检测遥感图像中的伪造内容。

Details

Motivation: 生成式人工智能的发展使得假遥感图像越来越难以检测，这可能导致错误的情报、假新闻甚至阴谋论。现有的伪造检测方法通常依赖于单一的视觉特征来捕获预定义的伪迹，因此在面对不同地理地形、土地覆盖类型或遥感图像内的特定特征时，其检测效果可能不佳。 Method: SFNet采用两个独立的特征提取器来捕捉输入遥感图像的空间域和频率域特征。为了充分利用这些互补的域特征，SFNet设计了域特征映射模块和混合域特征优化模块(CBAM注意力)来依次对齐和融合多域特征，同时抑制冗余信息。 Result: 实验结果显示，SFNet相较于最先进的遥感伪造检测方法，在三个数据集上的准确率提高了4%-15.18%，并且展现了强大的泛化能力。 Conclusion: SFNet是一个新的伪造检测框架，通过利用空间域和頻率域特征，能够识别不同遥感数据中的假图像，并表现出强大的泛化能力。 Abstract: The rapid advancement of generative artificial intelligence is producing fake remote sensing imagery (RSI) that is increasingly difficult to detect, potentially leading to erroneous intelligence, fake news, and even conspiracy theories. Existing forgery detection methods typically rely on single visual features to capture predefined artifacts, such as spatial-domain cues to detect forged objects like roads or buildings in RSI, or frequency-domain features to identify artifacts from up-sampling operations in adversarial generative networks (GANs). However, the nature of artifacts can significantly differ depending on geographic terrain, land cover types, or specific features within the RSI. Moreover, these complex artifacts evolve as generative models become more sophisticated. In short, over-reliance on a single visual cue makes existing forgery detectors struggle to generalize across diverse remote sensing data. This paper proposed a novel forgery detection framework called SFNet, designed to identify fake images in diverse remote sensing data by leveraging spatial and frequency domain features. Specifically, to obtain rich and comprehensive visual information, SFNet employs two independent feature extractors to capture spatial and frequency domain features from input RSIs. To fully utilize the complementary domain features, the domain feature mapping module and the hybrid domain feature refinement module(CBAM attention) of SFNet are designed to successively align and fuse the multi-domain features while suppressing redundant information. Experiments on three datasets show that SFNet achieves an accuracy improvement of 4%-15.18% over the state-of-the-art RS forgery detection methods and exhibits robust generalization capabilities. The code is available at https://github.com/GeoX-Lab/RSTI/tree/main/SFNet.

[77] Video Perception Models for 3D Scene Synthesis

Rui Huang,Guangyao Zhai,Zuria Bauer,Marc Pollefeys,Federico Tombari,Leonidas Guibas,Gao Huang,Francis Engelmann

Main category: cs.CV

TL;DR: This paper presents VIPScene, a novel framework that uses video generation models and multimodal reasoning to improve 3D scene synthesis, achieving better realism, consistency, and generalization than existing methods.

Details

Motivation: Automating 3D scene synthesis can benefit fields like architectural design, robotics simulation, virtual reality, and gaming. Current approaches face limitations in spatial reasoning and multi-view consistency. Method: The VIPScene framework integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene. It also introduces FPVScore for coherence and plausibility evaluation. Result: VIPScene enables flexible scene synthesis with high realism and structural consistency while providing more precise analysis through FPVScore. Conclusion: VIPScene significantly outperforms existing methods and generalizes well across diverse scenarios in 3D scene synthesis. Abstract: Traditionally, 3D scene synthesis requires expert knowledge and significant manual effort. Automating this process could greatly benefit fields such as architectural design, robotics simulation, virtual reality, and gaming. Recent approaches to 3D scene synthesis often rely on the commonsense reasoning of large language models (LLMs) or strong visual priors of modern image generation models. However, current LLMs demonstrate limited 3D spatial reasoning ability, which restricts their ability to generate realistic and coherent 3D scenes. Meanwhile, image generation-based methods often suffer from constraints in viewpoint selection and multi-view inconsistencies. In this work, we present Video Perception models for 3D Scene synthesis (VIPScene), a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene. This enables flexible scene synthesis with high realism and structural consistency. For more precise analysis, we further introduce First-Person View Score (FPVScore) for coherence and plausibility evaluation, utilizing continuous first-person perspective to capitalize on the reasoning ability of multimodal large language models. Extensive experiments show that VIPScene significantly outperforms existing methods and generalizes well across diverse scenarios. The code will be released.

[78] Shape2Animal: Creative Animal Generation from Natural Silhouettes

Quoc-Duy Tran,Anh-Tuan Vo,Dinh-Khoi Vo,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: 本文介绍了Shape2Animal框架，它模仿了人类在模糊刺激中感知有意义模式的独特能力，通过开放词汇分割、视觉语言模型和文本到图像扩散模型，将自然物体轮廓重新解释为可能的动物形态并生成相应的图像。

Details

Motivation: 人类具有在模糊刺激中感知有意义模式的独特能力，这种认知现象被称为空想性错视。 Method: 框架使用开放词汇分割提取物体轮廓，通过视觉语言模型解释语义合适的动物概念，然后利用文本到图像扩散模型合成符合输入形状的动物图像，并将其无缝融合到原始场景中。 Result: 该框架能够在各种真实世界输入中表现出强大的鲁棒性和创造潜力。 Conclusion: Shape2Animal可以为视觉叙事、教育内容、数字艺术和互动媒体设计提供新的机会。 Abstract: Humans possess a unique ability to perceive meaningful patterns in ambiguous stimuli, a cognitive phenomenon known as pareidolia. This paper introduces Shape2Animal framework to mimics this imaginative capacity by reinterpreting natural object silhouettes, such as clouds, stones, or flames, as plausible animal forms. Our automated framework first performs open-vocabulary segmentation to extract object silhouette and interprets semantically appropriate animal concepts using vision-language models. It then synthesizes an animal image that conforms to the input shape, leveraging text-to-image diffusion model and seamlessly blends it into the original scene to generate visually coherent and spatially consistent compositions. We evaluated Shape2Animal on a diverse set of real-world inputs, demonstrating its robustness and creative potential. Our Shape2Animal can offer new opportunities for visual storytelling, educational content, digital art, and interactive media design. Our project page is here: https://shape2image.github.io

[79] Joint attitude estimation and 3D neural reconstruction of non-cooperative space objects

Clément Forray,Pauline Delporte,Nicolas Delaygue,Florence Genin,Dawa Derksen

Main category: cs.CV

TL;DR: 这项研究利用神经辐射场（NeRF）技术，优化相机姿态以克服有限视角等问题，实现了对非合作太空物体的高效三维重建。

Details

Motivation: 获得对地球轨道上物体当前状态和行为的更好了解对于主动碎片清除、在轨维护或异常检测等应用至关重要。 Method: 文章利用神经辐射场（NeRF）从模拟图像中进行非合作太空物体的三维重建，并主要关注相机姿态与NeRF的联合优化。 Result: 实验结果表明，当逐个连续图像训练时，可以获得最准确的3D重建效果，并通过均匀旋转优化和正则化方法估计相机姿态。 Conclusion: 本文的结论是，通过逐次优化相机姿态并使用正则化方法，可以实现对非合作太空物体的准确三维重建。 Abstract: Obtaining a better knowledge of the current state and behavior of objects orbiting Earth has proven to be essential for a range of applications such as active debris removal, in-orbit maintenance, or anomaly detection. 3D models represent a valuable source of information in the field of Space Situational Awareness (SSA). In this work, we leveraged Neural Radiance Fields (NeRF) to perform 3D reconstruction of non-cooperative space objects from simulated images. This scenario is challenging for NeRF models due to unusual camera characteristics and environmental conditions : mono-chromatic images, unknown object orientation, limited viewing angles, absence of diffuse lighting etc. In this work we focus primarly on the joint optimization of camera poses alongside the NeRF. Our experimental results show that the most accurate 3D reconstruction is achieved when training with successive images one-by-one. We estimate camera poses by optimizing an uniform rotation and use regularization to prevent successive poses from being too far apart.

[80] Disentangled representations of microscopy images

Jacopo Dapueto,Vito Paolo Pastore,Nicoletta Noceti,Francesca Odone

Main category: cs.CV

TL;DR: 该论文提出了一种基于解缠表示学习（DRL）的方法，通过从合成数据中学习表示并将其迁移到实际显微图像分类任务中，从而在保持较高准确性的同时增强模型的可解释性。

Details

Motivation: 尽管深度神经网络在显微图像分析领域表现出色，但模型的可解释性仍然是一个开放性的挑战，而这是显微图像分析所必需的重要特性。 Method: 该论文提出了一种基于解缠表示学习（DRL）的框架，利用从合成数据中学习到的表示，并将其应用到真实显微图像分类任务中，以提升模型的可解释性。 Result: 实验基于三个不同显微镜图像领域（浮游生物、酵母液泡和人类细胞）的基准数据集，验证了该方法在准确性和可解释性之间达到了较好的平衡。 Conclusion: 该研究展示了基于DRL的方法能够有效提升显微图像分类任务中模型的可解释性，同时保持较高的性能，为未来相关研究提供了新的思路。 Abstract: Microscopy image analysis is fundamental for different applications, from diagnosis to synthetic engineering and environmental monitoring. Modern acquisition systems have granted the possibility to acquire an escalating amount of images, requiring a consequent development of a large collection of deep learning-based automatic image analysis methods. Although deep neural networks have demonstrated great performance in this field, interpretability, an essential requirement for microscopy image analysis, remains an open challenge. This work proposes a Disentangled Representation Learning (DRL) methodology to enhance model interpretability for microscopy image classification. Exploiting benchmark datasets from three different microscopic image domains (plankton, yeast vacuoles, and human cells), we show how a DRL framework, based on transferring a representation learnt from synthetic data, can provide a good trade-off between accuracy and interpretability in this domain.

[81] IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals

Markus Gross,Aya Fahmy,Danit Niwattananan,Dominik Muhle,Rui Song,Daniel Cremers,Henri Meeß

Main category: cs.CV

TL;DR: IPFormer improves 3D Panoptic Scene Completion using adaptive instance proposals from image context, achieving superior performance and faster runtime.

Details

Motivation: The motivation is to overcome the limitations of static queries in Transformer-based SSC approaches and explore camera image-based methods for PSC, enhancing object-level sensitivity in scene understanding. Method: IPFormer adaptively initializes queries as panoptic instance proposals derived from image context and refines them through attention-based encoding and decoding. Result: IPFormer surpasses state-of-the-art methods in overall panoptic metrics PQ† and PQ-All, achieves a runtime reduction exceeding 14×, and shows significant improvement through ablation studies on dynamic instance proposals. Conclusion: IPFormer introduces context-adaptive instance proposals for vision-based 3D Panoptic Scene Completion, showing significant improvements in performance metrics and runtime efficiency. Abstract: Semantic Scene Completion (SSC) has emerged as a pivotal approach for jointly learning scene geometry and semantics, enabling downstream applications such as navigation in mobile robotics. The recent generalization to Panoptic Scene Completion (PSC) advances the SSC domain by integrating instance-level information, thereby enhancing object-level sensitivity in scene understanding. While PSC was introduced using LiDAR modality, methods based on camera images remain largely unexplored. Moreover, recent Transformer-based SSC approaches utilize a fixed set of learned queries to reconstruct objects within the scene volume. Although these queries are typically updated with image context during training, they remain static at test time, limiting their ability to dynamically adapt specifically to the observed scene. To overcome these limitations, we propose IPFormer, the first approach that leverages context-adaptive instance proposals at train and test time to address vision-based 3D Panoptic Scene Completion. Specifically, IPFormer adaptively initializes these queries as panoptic instance proposals derived from image context and further refines them through attention-based encoding and decoding to reason about semantic instance-voxel relationships. Experimental results show that our approach surpasses state-of-the-art methods in overall panoptic metrics PQ$^\dagger$ and PQ-All, matches performance in individual metrics, and achieves a runtime reduction exceeding 14$\times$. Furthermore, our ablation studies reveal that dynamically deriving instance proposals from image context, as opposed to random initialization, leads to a 3.62% increase in PQ-All and a remarkable average improvement of 18.65% in combined Thing-metrics. These results highlight our introduction of context-adaptive instance proposals as a pioneering effort in addressing vision-based 3D Panoptic Scene Completion.

cs.LG [Back]

[82] A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior

Francesco Ignazio Re,Andreas Opedal,Glib Manaiev,Mario Giulianelli,Ryan Cotterell

Main category: cs.LG

TL;DR: This paper introduces a marked spatio-temporal point process model for reading behavior, showing improved saccade modeling with a Hawkes process but limited gains from surprisal in predicting fixation durations.

Details

Motivation: Standard approaches to modeling reading behavior overlook detailed spatio-temporal dynamics. This paper aims to develop a more general model that captures these dynamics for better insights into online sentence processing. Method: A probabilistic model based on a marked spatio-temporal point process was developed to capture the dynamics of fixations and saccades. Saccades were modeled using a Hawkes process, while fixation durations incorporated time-convolved predictors. Result: The proposed Hawkes process model outperforms baselines in fitting human saccades. Contextual surprisal as a predictor only marginally improves fixation duration prediction accuracy. Conclusion: The Hawkes process model provides a better fit for human saccades, and contextual surprisal only marginally improves fixation duration predictions, suggesting limitations in surprisal theory for explaining fine-grained eye movements. Abstract: Reading is a process that unfolds across space and time, alternating between fixations where a reader focuses on a specific point in space, and saccades where a reader rapidly shifts their focus to a new point. An ansatz of psycholinguistics is that modeling a reader's fixations and saccades yields insight into their online sentence processing. However, standard approaches to such modeling rely on aggregated eye-tracking measurements and models that impose strong assumptions, ignoring much of the spatio-temporal dynamics that occur during reading. In this paper, we propose a more general probabilistic model of reading behavior, based on a marked spatio-temporal point process, that captures not only how long fixations last, but also where they land in space and when they take place in time. The saccades are modeled using a Hawkes process, which captures how each fixation excites the probability of a new fixation occurring near it in time and space. The duration time of fixation events is modeled as a function of fixation-specific predictors convolved across time, thus capturing spillover effects. Empirically, our Hawkes process model exhibits a better fit to human saccades than baselines. With respect to fixation durations, we observe that incorporating contextual surprisal as a predictor results in only a marginal improvement in the model's predictive accuracy. This finding suggests that surprisal theory struggles to explain fine-grained eye movements.

[83] MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations

Vardhan Dongre,Chi Gui,Shubham Garg,Hooshang Nayyeri,Gokhan Tur,Dilek Hakkani-Tür,Vikram S. Adve

Main category: cs.LG

TL;DR: MIRAGE is a new benchmark for evaluating multimodal expert-level reasoning in agricultural consultative interactions, combining natural language, expert responses, and image-based context with high fidelity and diversity.

Details

Motivation: Existing benchmarks often rely on well-specified inputs and closed-set taxonomies, which do not reflect the complexity of real-world expert consultations. MIRAGE aims to bridge this gap by offering a high-fidelity benchmark that captures the full complexity of expert interactions in an open-world setting. Method: MIRAGE was constructed using over 35,000 real user-expert interactions and curated through a multi-step pipeline. It combines natural user queries, expert-authored responses, and image-based context, covering diverse crop health, pest diagnosis, and crop management scenarios with more than 7,000 unique biological entities. Result: MIRAGE features underspecified, context-rich scenarios requiring models to infer latent knowledge gaps, handle rare entities, and proactively guide or respond to interactions, making it one of the most taxonomically diverse and real-world grounded benchmarks for vision-language models. Conclusion: MIRAGE is a comprehensive and challenging benchmark for vision-language models, designed to evaluate multimodal expert-level reasoning and decision-making in real-world agricultural scenarios. Abstract: We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the agriculture domain, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models, grounded in the real world. Unlike existing benchmarks that rely on well-specified user inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich scenarios with open-world settings, requiring models to infer latent knowledge gaps, handle rare entities, and either proactively guide the interaction or respond. Project Page: https://mirage-benchmark.github.io

Table of Contents

cs.CL [Back]

[1] CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation

[2] Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs

[3] Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation

[4] A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs

[5] SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization

[6] Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder

[7] ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset

[8] A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection

[9] Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests

[10] CCRS: A Zero-Shot LLM-as-a-Judge Framework for Comprehensive RAG Evaluation

[11] AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control

[12] SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs

[13] COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees

[14] How to Retrieve Examples in In-context Learning to Improve Conversational Emotion Recognition using Large Language Models?

[15] Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation

[16] Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems

[17] Enhancing Large Language Models through Structured Reasoning

[18] CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment

[19] Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models

[20] Biomed-Enriched: A Biomedical Dataset Enriched with LLMs for Pretraining and Extracting Rare and Hidden Content

[21] TAPS: Tool-Augmented Personalisation via Structured Tagging

[22] An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

[23] Probing AI Safety with Source Code

[24] Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations

[25] Knowledge-Aware Diverse Reranking for Cross-Source Question Answering

[26] GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching

[27] ReCode: Updating Code API Knowledge with Reinforcement Learning

[28] OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

[29] When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs

[30] Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

[31] DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

[32] Memento: Note-Taking for Your Future Self

[33] Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs

cs.CV [Back]

[34] Computer Vision based Automated Quantification of Agricultural Sprayers Boom Displacement

[35] EBC-ZIP: Improving Blockwise Crowd Counting with Zero-Inflated Poisson Regression

[36] ToSA: Token Merging with Spatial Awareness

[37] BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos

[38] From 2D to 3D Cognition: A Brief Survey of General World Models

[39] EAR: Erasing Concepts from Unified Autoregressive Models

[40] Loss-Aware Automatic Selection of Structured Pruning Criteria for Deep Neural Network Acceleration

[41] Towards Efficient Exemplar Based Image Editing with Multimodal VLMs

[42] Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

[43] Towards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition

[44] Progressive Alignment Degradation Learning for Pansharpening

[45] UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation

[46] Dynamic Bandwidth Allocation for Hybrid Event-RGB Transmission

[47] Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement

[48] A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features

[49] Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification

[50] Forensic Study of Paintings Through the Comparison of Fabrics

[51] From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios

[52] Breaking Spatial Boundaries: Spectral-Domain Registration Guided Hyperspectral and Multispectral Blind Fusion

[53] Ctrl-Z Sampling: Diffusion Sampling with Controlled Random Zigzag Explorations

[54] TDiR: Transformer based Diffusion for Image Restoration Tasks

[55] Radiomic fingerprints for knee MR images assessment

[56] On the Burstiness of Faces in Set

[57] From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents

[58] Feature Hallucination for Self-supervised Action Recognition

[59] InvZW: Invariant Feature Learning via Noise-Adversarial Training for Robust Image Zero-Watermarking

[60] Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking

[61] A Novel Large Vision Foundation Model (LVFM)-based Approach for Generating High-Resolution Canopy Height Maps in Plantations for Precision Forestry Management

[62] Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation

[63] HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling

[64] A Deep Learning Approach to Identify Rock Bolts in Complex 3D Point Clouds of Underground Mines Captured Using Mobile Laser Scanners

[65] AI-assisted radiographic analysis in detecting alveolar bone-loss severity and patterns

[66] Pay Less Attention to Deceptive Artifacts: Robust Detection of Compressed Deepfakes on Online Social Networks

[67] MMSearch-R1: Incentivizing LMMs to Search

[68] Lightweight Multi-Frame Integration for Robust YOLO Object Detection in Videos

[69] AdvMIM: Adversarial Masked Image Modeling for Semi-Supervised Medical Image Segmentation

[70] Show, Tell and Summarize: Dense Video Captioning Using Visual Cue Aided Sentence Summarization

[71] Causal Representation Learning with Observational Grouping for CXR Classification

[72] Dense Video Captioning using Graph-based Sentence Summarization

[73] Learning-Based Distance Estimation for 360° Single-Sensor Setups

[74] TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

[75] WonderFree: Enhancing Novel View Quality and Cross-View Consistency for 3D Scene Exploration

[76] SFNet: Fusion of Spatial and Frequency-Domain Features for Remote Sensing Image Forgery Detection

[77] Video Perception Models for 3D Scene Synthesis