Skip to content

Table of Contents

cs.CL [Back]

[1] Veracity: An Open-Source AI Fact-Checking System

Taylor Lynn Curtis,Maximilian Puelma Touzel,William Garneau,Manon Gruaz,Mike Pinder,Li Wei Wang,Sukanya Krishna,Luda Cohen,Jean-François Godbout,Reihaneh Rabbany,Kellin Pelrine

Main category: cs.CL

TL;DR: 本文介绍了Veracity,一个结合AI和网络检索以对抗错误信息并提升公众媒体素养的开源事实核查系统。

Details Motivation: 错误信息的扩散对社会构成重大威胁,而生成式人工智能的能力加剧了这一问题。 Method: 利用大型语言模型(LLMs)和网络检索代理的协同作用来分析用户提交的声明,并提供直观解释的真实度评估。 Result: 展示了Veracity不仅能够检测错误信息,还能解释其推理过程,并具备多语言支持、数值评分和交互界面等关键功能。 Conclusion: Veracity是一个有前景的工具,可以有效对抗错误信息,并通过透明和可访问的事实核查促进媒体素养和更知情的社会。 Abstract: The proliferation of misinformation poses a significant threat to society, exacerbated by the capabilities of generative AI. This demo paper introduces Veracity, an open-source AI system designed to empower individuals to combat misinformation through transparent and accessible fact-checking. Veracity leverages the synergy between Large Language Models (LLMs) and web retrieval agents to analyze user-submitted claims and provide grounded veracity assessments with intuitive explanations. Key features include multilingual support, numerical scoring of claim veracity, and an interactive interface inspired by familiar messaging applications. This paper will showcase Veracity's ability to not only detect misinformation but also explain its reasoning, fostering media literacy and promoting a more informed society.

[2] Rethinking LLM Training through Information Geometry and Quantum Metrics

Riccardo Di Sipio

Main category: cs.CL

TL;DR: 本文讨论了在大型语言模型(LLM)训练中,利用信息几何和自然梯度下降等曲率感知方法加深对优化过程的理解,并探讨了与量子类比相关的高效优化。

Details Motivation: 由于大型语言模型的优化发生在具有非欧几里得结构的高维参数空间中,因此需要更原则性的学习方法来解释尖锐极小值、泛化能力和观察到的缩放定律等现象。 Method: 文章使用信息几何框架,特别是Fisher信息度量,分析模型优化过程,并推测基于Fubini-Study度量和量子Fisher信息的量子类比可能带来的高效优化方法。 Result: 通过几何视角,文章阐明了曲率感知方法如何加深对LLM训练的理解,并提出了未来可能的优化方向。 Conclusion: 曲率感知方法为理解大型语言模型的训练提供了有价值的视角,而量子类比则暗示了潜在的高效优化策略。 Abstract: Optimization in large language models (LLMs) unfolds over high-dimensional parameter spaces with non-Euclidean structure. Information geometry frames this landscape using the Fisher information metric, enabling more principled learning via natural gradient descent. Though often impractical, this geometric lens clarifies phenomena such as sharp minima, generalization, and observed scaling laws. We argue that curvature-aware approaches deepen our understanding of LLM training. Finally, we speculate on quantum analogies based on the Fubini-Study metric and Quantum Fisher Information, hinting at efficient optimization in quantum-enhanced systems.

[3] MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Zijian Zhou,Ao Qu,Zhaoxuan Wu,Sunghwan Kim,Alok Prakash,Daniela Rus,Jinhua Zhao,Bryan Kian Hsiang Low,Paul Pu Liang

Main category: cs.CL

TL;DR: 本文介绍了一种新的强化学习框架MEM1,用于解决长多轮任务中的内存管理和推理问题,从而提高性能并减少内存使用。

Details Motivation: 现代语言代理必须处理长期、多轮的交互,而现有的LLM系统依赖于全上下文提示,导致内存增长无界、计算成本增加以及推理性能下降。 Method: 提出了一种端到端的强化学习框架MEM1,通过一个紧凑的共享内部状态支持记忆整合和推理,以恒定内存运行长多轮任务。 Result: 实验显示,与Qwen2.5-14B-Instruct相比,在16个目标的多跳问答任务中,MEM1-7B显著提升了性能并降低了内存使用。 Conclusion: MEM1-7B不仅在性能上提高了3.5倍,同时将内存使用减少了3.7倍,并且能够超越训练时的视野进行泛化。 Abstract: Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.

[4] Finance Language Model Evaluation (FLaME)

Glenn Matlin,Mika Okamoto,Huzaifa Pardawala,Yang Yang,Sudheer Chava

Main category: cs.CL

TL;DR: This paper introduces FLaME, a benchmarking framework showing that Language Models are more capable in financial NLP tasks than previously believed.

Details Motivation: Existing evaluation frameworks have gaps that lead to an incorrect understanding of LMs' performance on financial NLP tasks. There is a need to accurately assess their capabilities. Method: The study conducts an empirical analysis of 23 foundation LMs over 20 core NLP tasks in finance, comparing standard LMs with 'reasoning-reinforced' LMs. Result: The research introduces FLaME, the first holistic benchmarking suite for evaluating financial language models, revealing the true potential of LMs in finance. Conclusion: Language Models have significant potential for Financial NLP tasks, which can be effectively evaluated using the FLaME benchmarking suite. Abstract: Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs' performance on common Finance NLP (FinNLP) tasks. To demonstrate the potential of LMs for these FinNLP tasks, we present the first holistic benchmarking suite for Financial Language Model Evaluation (FLaME). We are the first research paper to comprehensively study LMs against 'reasoning-reinforced' LMs, with an empirical study of 23 foundation LMs over 20 core NLP tasks in finance. We open-source our framework software along with all data and results.

[5] Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

Yifan Hu,Frank Liang,Dachuan Zhao,Jonathan Geuter,Varshini Reddy,Craig W. Schmidt,Chris Tanner

Main category: cs.CL

TL;DR: 本文提出并验证了两种新的信息理论引导的预分词策略,以提升中文BPE分词的质量。

Details Motivation: 由于其简单性和在下游任务中的强大实证表现,字节对编码(BPE)已成为现代语言模型广泛采用的子词分词方法。然而,将BPE应用于中文等未分段的语言时存在重大挑战,因为其频率驱动的合并操作对语言边界是不可知的。 Method: 我们提出了两种基于信息理论线索的预分词策略来指导BPE分割:第一种方法使用点互信息和左右熵来识别连贯字符范围,第二种方法利用来自预训练GPT-2模型的预测熵来检测边界不确定性。 Result: 我们在PKU数据集的一个子集上评估了这两种方法,并证明与标准BPE相比,在分割精度、召回率和F1分数上有显著改进。 Conclusion: 熵引导的预分词不仅增强了与标准语言单位的对齐,还为提高低资源和多语言环境下的分词质量提供了有希望的方向。 Abstract: Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy-informed pre-tokenization strategies that guide BPE segmentation using unsupervised information-theoretic cues. The first approach uses pointwise mutual information and left/right entropy to identify coherent character spans, while the second leverages predictive entropy derived from a pretrained GPT-2 model to detect boundary uncertainty. We evaluate both methods on a subset of the PKU dataset and demonstrate substantial improvements in segmentation precision, recall, and F1 score compared to standard BPE. Our results suggest that entropy-guided pre-tokenization not only enhances alignment with gold-standard linguistic units but also offers a promising direction for improving tokenization quality in low-resource and multilingual settings.

[6] Language Models can perform Single-Utterance Self-Correction of Perturbed Reasoning

Sam Silver,Jimin Sun,Ivan Zhang,Sara Hooker,Eddie Kim

Main category: cs.CL

TL;DR: This paper explores the self-correction capabilities of Large Language Models (LLMs) when faced with variations in problem descriptions and reasoning errors. It finds that LLMs possess stronger intrinsic self-correction abilities than commonly believed.

Details Motivation: The performance of LLMs remains brittle to minor variations in problem description and prompting strategy. Additionally, reasoning is vulnerable to sampling-induced errors which autoregressive models must address using self-correction. Method: Experiments measuring models' ability to self-correct synthetic perturbations introduced into their Chain of Thought reasoning were conducted. Result: Robust single-utterance intrinsic self-correction behavior across a range of open-weight models and datasets was observed, ranging from subtle, implicit corrections to explicit acknowledgments and corrections of errors. Conclusion: LLMs have stronger intrinsic self-correction capabilities than previously thought, suggesting that recent model work involves amplification of existing traits. Abstract: Large Language Models (LLMs) have demonstrated impressive mathematical reasoning capabilities, yet their performance remains brittle to minor variations in problem description and prompting strategy. Furthermore, reasoning is vulnerable to sampling-induced errors which autoregressive models must primarily address using self-correction via additionally-generated tokens. To better understand self-correction capabilities of recent models, we conduct experiments measuring models' ability to self-correct synthetic perturbations introduced into their Chain of Thought (CoT) reasoning. We observe robust single-utterance intrinsic self-correction behavior across a range of open-weight models and datasets, ranging from subtle, implicit corrections to explicit acknowledgments and corrections of errors. Our findings suggest that LLMs, including those not finetuned for long CoT, may possess stronger intrinsic self-correction capabilities than commonly shown in the literature. The presence of this ability suggests that recent "reasoning" model work involves amplification of traits already meaningfully present in models.

[7] From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents

Mohammad Amaan Sayeed,Mohammed Talha Alam,Raza Imam,Shahab Saquib Sohail,Amir Hussain

Main category: cs.CL

TL;DR: This paper introduces Tibbe-AG, an evaluation pipeline that combines centuries-old Islamic medical texts with modern AI techniques to improve culturally sensitive medical question-answering. Retrieval and self-evaluation methods significantly enhance accuracy and safety.

Details Motivation: Islamic medical texts contain valuable preventive care, nutrition, and holistic therapy knowledge but are underutilized in modern AI systems. Existing language model benchmarks do not adequately validate culturally grounded medical guidance at scale. Method: The authors proposed a unified evaluation pipeline called Tibbe-AG, which aligns 30 curated Prophetic-medicine questions with verified remedies. They tested three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) using direct generation, retrieval-augmented generation, and a scientific self-critique filter. A secondary LLM was used as an agentic judge to evaluate the answers, producing a 3C3H quality score. Result: Retrieval-augmented generation improved factual accuracy by 13%, and the addition of an agentic prompt provided an additional 10% improvement through deeper mechanistic insight and safety considerations. Conclusion: The study concludes that integrating classical Islamic medical texts with modern AI techniques like retrieval and self-evaluation can enable reliable and culturally sensitive medical question-answering. Abstract: Centuries-old Islamic medical texts like Avicenna's Canon of Medicine and the Prophetic Tibb-e-Nabawi encode a wealth of preventive care, nutrition, and holistic therapies, yet remain inaccessible to many and underutilized in modern AI systems. Existing language-model benchmarks focus narrowly on factual recall or user preference, leaving a gap in validating culturally grounded medical guidance at scale. We propose a unified evaluation pipeline, Tibbe-AG, that aligns 30 carefully curated Prophetic-medicine questions with human-verified remedies and compares three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) under three configurations: direct generation, retrieval-augmented generation, and a scientific self-critique filter. Each answer is then assessed by a secondary LLM serving as an agentic judge, yielding a single 3C3H quality score. Retrieval improves factual accuracy by 13%, while the agentic prompt adds another 10% improvement through deeper mechanistic insight and safety considerations. Our results demonstrate that blending classical Islamic texts with retrieval and self-evaluation enables reliable, culturally sensitive medical question-answering.

[8] Reranking-based Generation for Unbiased Perspective Summarization

Narutatsu Ri,Nicholas Deas,Kathleen McKeown

Main category: cs.CL

TL;DR: This paper improves the evaluation and performance of perspective summarization by identifying better metrics and applying reranking and preference tuning techniques.

Details Motivation: Existing evaluation frameworks for summarization lack verification of metric applicability, and improved summarization methods are still underdeveloped, especially beyond zero-shot inference settings. Method: The researchers built a test set for benchmarking metric reliability using human annotations, compared traditional metrics with language model-based metrics, and evaluated the efficacy of reranking-based methods and preference tuning with synthetically generated data. Result: Language model-based metrics outperformed traditional metrics in evaluating summary quality, while reranking-based methods and preference tuning significantly enhanced summarization performance. Conclusion: The study contributes to the reliable evaluation and development of perspective summarization methods by identifying effective metrics and demonstrating improved performance through reranking-based methods and preference tuning. Abstract: Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model-based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.

[9] A Vietnamese Dataset for Text Segmentation and Multiple Choices Reading Comprehension

Toan Nguyen Hai,Ha Nguyen Viet,Truong Quan Xuan,Duc Do Minh

Main category: cs.CL

TL;DR: 本文提出了一个越南语文本分割和阅读理解数据集VSMRC,并验证了多语言模型在越南语NLP任务中的优势。

Details Motivation: 越南语缺乏针对自然语言处理任务的强大资源,例如文本分割和机器阅读理解(MRC),因此需要开发相关资源以促进研究。 Method: 构建了VSMRC数据集,包含15,942个用于文本分割的文档和16,347个合成的多项选择问答对,并通过实验比较了mBERT与单语模型的表现。 Result: 实验表明,mBERT在这两个任务上均优于单语模型,在MRC测试集上的准确率达到88.01%,在文本分割测试集上的F1得分为63.15%。 Conclusion: VSMRC为越南语的文本分割和机器阅读理解提供了可靠且多样化的资源,并展示了多语言模型在低资源语言中的潜力。 Abstract: Vietnamese, the 20th most spoken language with over 102 million native speakers, lacks robust resources for key natural language processing tasks such as text segmentation and machine reading comprehension (MRC). To address this gap, we present VSMRC, the Vietnamese Text Segmentation and Multiple-Choice Reading Comprehension Dataset. Sourced from Vietnamese Wikipedia, our dataset includes 15,942 documents for text segmentation and 16,347 synthetic multiple-choice question-answer pairs generated with human quality assurance, ensuring a reliable and diverse resource. Experiments show that mBERT consistently outperforms monolingual models on both tasks, achieving an accuracy of 88.01% on MRC test set and an F1 score of 63.15\% on text segmentation test set. Our analysis reveals that multilingual models excel in NLP tasks for Vietnamese, suggesting potential applications to other under-resourced languages. VSMRC is available at HuggingFace

[10] Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion

Markus Frohmann,Gabriel Meseguer-Brocal,Markus Schedl,Elena V. Epure

Main category: cs.CL

TL;DR: This paper introduces DE-detect, a novel approach combining transcribed lyrics and audio features to reliably detect AI-generated music, overcoming limitations of current methods.

Details Motivation: Existing AI-generated music detection methods have limitations: audio-based detectors struggle with generalization and perturbations, while lyrics-based methods require accurate, clean lyrics which are often unavailable. Method: A multimodal, modular late-fusion pipeline combining automatically transcribed sung lyrics and speech features from the audio was developed to detect AI-generated music. Result: DE-detect outperforms existing lyrics-based detectors and demonstrates increased robustness to audio perturbations. Conclusion: The proposed DE-detect method effectively detects AI-generated music in real-world scenarios and is robust against audio perturbations. Abstract: The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.

[11] From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation

Zhihan Guo,Jiele Wu,Wenqian Cui,Yifei Zhang,Minda Hu,Yufei Wang,Irwin King

Main category: cs.CL

TL;DR: 本文提出了一种名为ProxyReward的基于强化学习的框架,用于解决开放性长文本生成任务中缺乏高质量参考数据的问题,并在相关任务上超越了GPT-4-Turbo的表现。

Details Motivation: 当前关于大语言模型长上下文的研究主要集中于理解长上下文,而对开放性长文本生成(Open-LTG)探索不足。现有方法因缺乏准确的奖励信号和高质量标注数据,难以有效训练长上下文生成模型。 Method: 提出了ProxyReward框架,包括ProxyReward Dataset和ProxyReward Signal。前者通过简单提示让模型自动生成数据,避免大量人工标注;后者提供针对信息全面性和准确性的评估方法,作为更有效的奖励信号。 Result: 实验结果表明,ProxyReward在Open-LTG任务上显著提升了性能,使用开源模型时表现提高了20%,并优于LLM-as-a-Judge的方法,甚至超过了GPT-4-Turbo。 Conclusion: ProxyReward为增强大语言模型处理复杂开放性问题的能力提供了有效方法,减少了对人工标注数据和通用评估指标的依赖。 Abstract: Current research on long-form context in Large Language Models (LLMs) primarily focuses on the understanding of long-contexts, the Open-ended Long Text Generation (Open-LTG) remains insufficiently explored. Training a long-context generation model requires curation of gold standard reference data, which is typically nonexistent for informative Open-LTG tasks. However, previous methods only utilize general assessments as reward signals, which limits accuracy. To bridge this gap, we introduce ProxyReward, an innovative reinforcement learning (RL) based framework, which includes a dataset and a reward signal computation method. Firstly, ProxyReward Dataset generation is accomplished through simple prompts that enables the model to create automatically, obviating extensive labeled data or significant manual effort. Secondly, ProxyReward Signal offers a targeted evaluation of information comprehensiveness and accuracy for specific questions. The experimental results indicate that our method ProxyReward surpasses even GPT-4-Turbo. It can significantly enhance performance by 20% on the Open-LTG task when training widely used open-source models, while also surpassing the LLM-as-a-Judge approach. Our work presents effective methods to enhance the ability of LLMs to address complex open-ended questions posed by human.

[12] EvoLM: In Search of Lost Language Model Training Dynamics

Zhenting Qi,Fan Nie,Alexandre Alahi,James Zou,Himabindu Lakkaraju,Yilun Du,Eric Xing,Sham Kakade,Hanlin Zhang

Main category: cs.CL

TL;DR: EvoLM offers a comprehensive analysis of language model training dynamics across stages, providing insights into training efficiency, mitigation of forgetting, and trade-offs in post-training configurations.

Details Motivation: Modern language model training involves multiple stages, making it challenging for developers to assess the impact of design choices at each stage. This work aims to provide transparency and systematic analysis through EvoLM. Method: The authors trained over 100 language models with varying parameters from scratch using EvoLM. They evaluated both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including in-domain and out-of-domain generalization. Result: Key insights were identified, including diminishing returns from excessive training, the importance of mitigating forgetting during domain-specific training, the role of continued pre-training in bridging training phases, and trade-offs in configuring supervised fine-tuning and reinforcement learning. Conclusion: EvoLM provides a systematic and transparent analysis of language models' training dynamics across multiple stages, highlighting key insights such as diminishing returns from excessive training, the importance of mitigating forgetting during continued pre-training, and intricate trade-offs in configuring fine-tuning and reinforcement learning. Abstract: Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage. We present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. By training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including considerations of both in-domain and out-of-domain generalization. Key insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.

[13] Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3

Xinyue Huang,Ziqi Lin,Fang Sun,Wenchao Zhang,Kejian Tong,Yunbo Liu

Main category: cs.CL

TL;DR: This paper introduces a novel Retrieval-Augmented Generation framework based on LLaMA 3 that improves multi-hop reasoning and contextual understanding for complex question answering tasks.

Details Motivation: To address challenges in multi-hop reasoning and contextual understanding across lengthy documents in complex question answering tasks. Method: A dense retrieval module with context fusion and multi-hop reasoning mechanisms, built upon LLaMA 3, was integrated. A joint optimization strategy combining retrieval likelihood and generation cross-entropy was employed. Result: The proposed system showed superior performance over retrieval-augmented and generative baselines in delivering precise, contextually grounded answers. Conclusion: The proposed RAG framework outperforms existing baselines in complex question answering tasks. Abstract: This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent response generation. A joint optimization strategy combining retrieval likelihood and generation cross-entropy improves the model's robustness and adaptability. Experimental results show that the proposed system outperforms existing retrieval-augmented and generative baselines, confirming its effectiveness in delivering precise, contextually grounded answers.

[14] DynScaling: Efficient Verifier-free Inference Scaling via Dynamic and Integrated Sampling

Fei Wang,Xingchen Wan,Ruoxi Sun,Jiefeng Chen,Sercan Ö. Arık

Main category: cs.CL

TL;DR: This paper proposes DynScaling, a new method for improving large language model (LLM) performance during inference time while adhering to practical resource constraints. It introduces an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework. The experiments show that DynScaling outperforms existing approaches without needing external verifiers.

Details Motivation: Inference-time scaling is effective in boosting large language model (LLM) performance but is often hindered by reliance on external verifiers or lack of optimization for realistic computational constraints. Method: The proposed DynScaling approach combines an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework. Result: Experimental results demonstrate that DynScaling consistently surpasses existing verifier-free inference scaling baselines in both task performance and computational cost. Conclusion: DynScaling effectively improves LLM performance under practical resource constraints without the need for external verifiers. Abstract: Inference-time scaling has proven effective in boosting large language model (LLM) performance through increased test-time computation. Yet, its practical application is often hindered by reliance on external verifiers or a lack of optimization for realistic computational constraints. We propose DynScaling, which addresses these limitations through two primary innovations: an integrated parallel-sequential sampling strategy and a bandit-based dynamic budget allocation framework. The integrated sampling strategy unifies parallel and sequential sampling by constructing synthetic sequential reasoning chains from initially independent parallel responses, promoting diverse and coherent reasoning trajectories. The dynamic budget allocation framework formulates the allocation of computational resources as a multi-armed bandit problem, adaptively distributing the inference budget across queries based on the uncertainty of previously sampled responses, thereby maximizing computational efficiency. By combining these components, DynScaling effectively improves LLM performance under practical resource constraints without the need for external verifiers. Experimental results demonstrate that DynScaling consistently surpasses existing verifier-free inference scaling baselines in both task performance and computational cost.

[15] A Hybrid DeBERTa and Gated Broad Learning System for Cyberbullying Detection in English Text

Devesh Kumar

Main category: cs.CL

TL;DR: This paper presents a hybrid architecture for effective cyberbullying detection using a modified DeBERTa model combined with a Gated Broad Learning System classifier, achieving high accuracy while incorporating explainability features.

Details Motivation: The motivation stems from the need to combat cyberbullying on online communication platforms, which affects a significant percentage of teenagers, by leveraging the strengths of transformer-based models and broad learning systems. Method: The paper proposes a hybrid architecture combining a modified DeBERTa model with Squeeze-and-Excitation blocks and sentiment analysis capabilities, along with a Gated Broad Learning System (GBLS) classifier. Result: The proposed ModifiedDeBERTa + GBLS model achieved good performance across four English datasets: 79.3% accuracy on HateXplain, 95.41% on SOSNet, 91.37% on Mendeley-I, and 94.67% on Mendeley-II. Conclusion: The paper concludes that the ModifiedDeBERTa + GBLS model outperforms existing approaches for cyberbullying detection and incorporates explainability mechanisms to address transparency requirements. Ablation studies and failure case analysis provide insights for future improvements. Abstract: The proliferation of online communication platforms has created unprecedented opportunities for global connectivity while simultaneously enabling harmful behaviors such as cyberbullying, which affects approximately 54.4\% of teenagers according to recent research. This paper presents a hybrid architecture that combines the contextual understanding capabilities of transformer-based models with the pattern recognition strengths of broad learning systems for effective cyberbullying detection. This approach integrates a modified DeBERTa model augmented with Squeeze-and-Excitation blocks and sentiment analysis capabilities with a Gated Broad Learning System (GBLS) classifier, creating a synergistic framework that outperforms existing approaches across multiple benchmark datasets. The proposed ModifiedDeBERTa + GBLS model achieved good performance on four English datasets: 79.3\% accuracy on HateXplain, 95.41\% accuracy on SOSNet, 91.37\% accuracy on Mendeley-I, and 94.67\% accuracy on Mendeley-II. Beyond performance gains, the framework incorporates comprehensive explainability mechanisms including token-level attribution analysis, LIME-based local interpretations, and confidence calibration, addressing critical transparency requirements in automated content moderation. Ablation studies confirm the meaningful contribution of each architectural component, while failure case analysis reveals specific challenges in detecting implicit bias and sarcastic content, providing valuable insights for future improvements in cyberbullying detection systems.

[16] Knee-Deep in C-RASP: A Transformer Depth Hierarchy

Andy Yang,Michaël Cadilhac,David Chiang

Main category: cs.CL

TL;DR: 这篇论文研究了transformer模型深度与其能力之间的关系,提出了理论框架来解释更深层模型为何更具表现力,并通过实验验证了这一理论。

Details Motivation: 论文的动机在于探究更深层的transformer模型究竟获得了哪些能力,试图从理论上解释深度对模型性能的影响。 Method: 论文的方法包括理论证明和实证研究两部分。理论部分通过将特定子类transformers与C-RASP编程语言等价起来,并利用时间逻辑分析其表现力;实证部分验证了理论预测transformer在序列依赖任务上的泛化能力。 Result: 论文的结果包括:1)证明了特定类型的transformer与C-RASP编程语言在表现力上是等价的;2)更深层次的C-RASP程序(及对应的transformer)具有更强的表现力;3)实验证明该理论能够准确预测transformer在序列任务中的泛化能力。 Conclusion: 论文的结论是更深的transformer模型比浅层模型更具表现力,且深度对模型泛化能力有重要影响。 Abstract: It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained with greater depth? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). These results are established by studying a form of temporal logic with counting operators, which was shown equivalent to C-RASP in previous work. Finally, we provide empirical evidence that our theory predicts the depth required for transformers without positional encodings to length-generalize on a family of sequential dependency tasks.

[17] Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning

Duc Hieu Ho,Chenglin Fan

Main category: cs.CL

TL;DR: This paper proposes a new prompting strategy to improve the honesty and helpfulness of large language model outputs through structured self-refinement without additional training.

Details Motivation: Producing consistently honest and helpful outputs from large language models remains a challenge, which this paper aims to address. Method: A comprehensive benchmark evaluation of ten large language models and a novel prompting strategy called self-critique-guided curiosity refinement prompting were used. This method includes self-critique and refinement steps without additional training. Result: The proposed method showed consistent improvements across all models on the HONESET dataset using the $\mathrm{H}^2$ framework, reducing poor-quality responses and achieving relative gains in $\mathrm{H}^2$ scores ranging from 1.4% to 4.3% compared to curiosity-driven prompting. Conclusion: Structured self-refinement is an effective, scalable, and training-free strategy to enhance the honesty and helpfulness of large language model outputs. Abstract: Large language models (LLMs) have demonstrated robust capabilities across various natural language tasks. However, producing outputs that are consistently honest and helpful remains an open challenge. To overcome this challenge, this paper tackles the problem through two complementary directions. It conducts a comprehensive benchmark evaluation of ten widely used large language models, including both proprietary and open-weight models from OpenAI, Meta, and Google. In parallel, it proposes a novel prompting strategy, self-critique-guided curiosity refinement prompting. The key idea behind this strategy is enabling models to self-critique and refine their responses without additional training. The proposed method extends the curiosity-driven prompting strategy by incorporating two lightweight in-context steps including self-critique step and refinement step. The experiment results on the HONESET dataset evaluated using the framework $\mathrm{H}^2$ (honesty and helpfulness), which was executed with GPT-4o as a judge of honesty and helpfulness, show consistent improvements across all models. The approach reduces the number of poor-quality responses, increases high-quality responses, and achieves relative gains in $\mathrm{H}^2$ scores ranging from 1.4% to 4.3% compared to curiosity-driven prompting across evaluated models. These results highlight the effectiveness of structured self-refinement as a scalable and training-free strategy to improve the trustworthiness of LLMs outputs.

[18] Cyberbullying Detection in Hinglish Text Using MURIL and Explainable AI

Devesh Kumar

Main category: cs.CL

TL;DR: This paper proposes a MURIL-based framework for detecting cyberbullying in Hinglish text, showing improved performance over existing models and identifying areas for future research.

Details Motivation: Increased cyberbullying on digital platforms, especially in code-mixed Hinglish communication, which current systems are not well-equipped to handle. Method: Framework for cyberbullying detection using MURIL architecture with attribution analysis and cross-linguistic pattern recognition; ablation studies to evaluate components. Result: MURIL-based approach outperformed existing models (RoBERTa, IndicBERT) across six datasets, achieving high accuracies ranging from 75.41% to 94.63%. Conclusion: Selective layer freezing, appropriate classification head design, and specialized preprocessing for code-mixed content improve detection performance. Failure analysis identifies challenges like context-dependent interpretation, cultural understanding, and cross-linguistic sarcasm detection. Abstract: The growth of digital communication platforms has led to increased cyberbullying incidents worldwide, creating a need for automated detection systems to protect users. The rise of code-mixed Hindi-English (Hinglish) communication on digital platforms poses challenges for existing cyberbullying detection systems, which were designed primarily for monolingual text. This paper presents a framework for cyberbullying detection in Hinglish text using the Multilingual Representations for Indian Languages (MURIL) architecture to address limitations in current approaches. Evaluation across six benchmark datasets -- Bohra \textit{et al.}, BullyExplain, BullySentemo, Kumar \textit{et al.}, HASOC 2021, and Mendeley Indo-HateSpeech -- shows that the MURIL-based approach outperforms existing multilingual models including RoBERTa and IndicBERT, with improvements of 1.36 to 13.07 percentage points and accuracies of 86.97\% on Bohra, 84.62\% on BullyExplain, 86.03\% on BullySentemo, 75.41\% on Kumar datasets, 83.92\% on HASOC 2021, and 94.63\% on Mendeley dataset. The framework includes explainability features through attribution analysis and cross-linguistic pattern recognition. Ablation studies show that selective layer freezing, appropriate classification head design, and specialized preprocessing for code-mixed content improve detection performance, while failure analysis identifies challenges including context-dependent interpretation, cultural understanding, and cross-linguistic sarcasm detection, providing directions for future research in multilingual cyberbullying detection.

[19] FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

Natapong Nitarach,Warit Sirichotedumrong,Panop Pitchayarthorn,Pittawat Taveekitworachai,Potsawee Manakul,Kunat Pipatanakul

Main category: cs.CL

TL;DR: FinCoT, a domain-aligned structured prompting method, boosts performance, lowers costs, and produces expert-aligned reasoning in financial NLP tasks.

Details Motivation: Prior work in FinNLP has primarily focused on standard or unstructured CoT prompting, with limited attention given to structured CoT prompting. Additionally, reasoning structures in structured CoT prompting are often designed by non-domain experts. Method: The study evaluates three prompting styles (standard prompting, unstructured CoT prompting, structured CoT prompting) and FinCoT on CFA-style questions across ten financial domains. Result: FinCoT improved performance from 63.2% to 80.5%, while Qwen-2.5-7B-Instruct improved from 69.7% to 74.2%. It also reduced generated tokens eight-fold compared to structured CoT prompting. Conclusion: Structured CoT prompting, specifically FinCoT, significantly improves performance in financial NLP tasks while reducing inference costs and generating more interpretable reasoning traces compared to standard or unstructured CoT prompting. Abstract: This paper presents FinCoT, a structured chain-of-thought (CoT) prompting approach that incorporates insights from domain-specific expert financial reasoning to guide the reasoning traces of large language models. We investigate that there are three main prompting styles in FinNLP: (1) standard prompting--zero-shot prompting; (2) unstructured CoT--CoT prompting without an explicit reasoning structure, such as the use of tags; and (3) structured CoT prompting--CoT prompting with explicit instructions or examples that define structured reasoning steps. Previously, FinNLP has primarily focused on prompt engineering with either standard or unstructured CoT prompting. However, structured CoT prompting has received limited attention in prior work. Furthermore, the design of reasoning structures in structured CoT prompting is often based on heuristics from non-domain experts. In this study, we investigate each prompting approach in FinNLP. We evaluate the three main prompting styles and FinCoT on CFA-style questions spanning ten financial domains. We observe that FinCoT improves performance from 63.2% to 80.5% and Qwen-2.5-7B-Instruct from 69.7% to 74.2%, while reducing generated tokens eight-fold compared to structured CoT prompting. Our findings show that domain-aligned structured prompts not only improve performance and reduce inference costs but also yield more interpretable and expert-aligned reasoning traces.

[20] Under the Shadow of Babel: How Language Shapes Reasoning in LLMs

Chenxi Wang,Yixuan Zhang,Lang Gao,Zixiang Xu,Zirui Song,Yanbo Wang,Xiuying Chen

Main category: cs.CL

TL;DR: The study explores how LLMs internalize logical structures from different languages using a new bilingual dataset called BICAUSE, revealing insights into attention patterns, language-specific preferences, and shared semantic understanding.

Details Motivation: To examine the hypothesis that LLMs internalize habitual logical structures from languages as suggested by linguistic relativity. Method: BICAUSE dataset was introduced for structured bilingual analysis of causal reasoning in LLMs. Result: Three key findings: typologically aligned attention patterns, internalization of language-specific preferences leading to performance degradation, and convergence toward semantically aligned abstractions across languages during successful reasoning. Conclusion: LLMs internalize reasoning biases shaped by language, not just mimic linguistic forms. Abstract: Language is not only a tool for communication but also a medium for human cognition and reasoning. If, as linguistic relativity suggests, the structure of language shapes cognitive patterns, then large language models (LLMs) trained on human language may also internalize the habitual logical structures embedded in different languages. To examine this hypothesis, we introduce BICAUSE, a structured bilingual dataset for causal reasoning, which includes semantically aligned Chinese and English samples in both forward and reversed causal forms. Our study reveals three key findings: (1) LLMs exhibit typologically aligned attention patterns, focusing more on causes and sentence-initial connectives in Chinese, while showing a more balanced distribution in English. (2) Models internalize language-specific preferences for causal word order and often rigidly apply them to atypical inputs, leading to degraded performance, especially in Chinese. (3) When causal reasoning succeeds, model representations converge toward semantically aligned abstractions across languages, indicating a shared understanding beyond surface form. Overall, these results suggest that LLMs not only mimic surface linguistic forms but also internalize the reasoning biases shaped by language. Rooted in cognitive linguistic theory, this phenomenon is for the first time empirically verified through structural analysis of model internals.

[21] SGIC: A Self-Guided Iterative Calibration Framework for RAG

Guanhua Chen,Yutong Yao,Lidia S. Chao,Xuebo Liu,Derek F. Wong

Main category: cs.CL

TL;DR: This paper introduces the SGIC framework, which leverages uncertainty scores to improve LLM calibration, especially over multiple rounds, enhancing response accuracy.

Details Motivation: Current RAG methodologies often overlook the reasoning capabilities of LLMs; this work addresses that gap by focusing on calibration improvements through specific cues. Method: SGIC: Self-Guided Iterative Calibration Framework that uses uncertainty scores to assess document relevance and response confidence, iteratively refining calibration. Result: The proposed framework enhances performance in capturing critical information and improving response accuracy for LLMs. Conclusion: The SGIC framework significantly improves the calibration of LLMs by using uncertainty scores, applicable to both closed-source and open-weight models. Abstract: Recent research in retrieval-augmented generation (RAG) has concentrated on retrieving useful information from candidate documents. However, numerous methodologies frequently neglect the calibration capabilities of large language models (LLMs), which capitalize on their robust in-context reasoning prowess. This work illustrates that providing LLMs with specific cues substantially improves their calibration efficacy, especially in multi-round calibrations. We present a new SGIC: Self-Guided Iterative Calibration Framework that employs uncertainty scores as a tool. Initially, this framework calculates uncertainty scores to determine both the relevance of each document to the query and the confidence level in the responses produced by the LLMs. Subsequently, it reevaluates these scores iteratively, amalgamating them with prior responses to refine calibration. Furthermore, we introduce an innovative approach for constructing an iterative self-calibration training set, which optimizes LLMs to efficiently harness uncertainty scores for capturing critical information and enhancing response accuracy. Our proposed framework significantly improves performance on both closed-source and open-weight LLMs.

[22] JETHICS: Japanese Ethics Understanding Evaluation Dataset

Masashi Takeshita,Rafal Rzepka

Main category: cs.CL

TL;DR: JETHICS是一个用于评估AI模型伦理理解能力的日语数据集,实验结果表明当前的大型语言模型在该任务上还有较大的提升空间。

Details Motivation: 为了评估和提升AI模型对伦理理解的能力,特别是在日语环境下的表现。 Method: 构建了包含78K示例的数据集,并按照现有的英语ETHICS数据集的构建方法进行设计。 Result: 评估实验显示即使是GPT-4o也只能达到约0.7的平均得分,而表现最好的日语LLM得分约为0.5。 Conclusion: JETHICS是一个用于评估AI模型伦理理解的日语数据集,实验结果显示当前的大型语言模型仍有改进空间。 Abstract: In this work, we propose JETHICS, a Japanese dataset for evaluating ethics understanding of AI models. JETHICS contains 78K examples and is built by following the construction methods of the existing English ETHICS dataset. It includes four categories based normative theories and concepts from ethics and political philosophy; and one representing commonsense morality. Our evaluation experiments on non-proprietary large language models (LLMs) and on GPT-4o reveal that even GPT-4o achieves only an average score of about 0.7, while the best-performing Japanese LLM attains around 0.5, indicating a relatively large room for improvement in current LLMs.

[23] Web(er) of Hate: A Survey on How Hate Speech Is Typed

Luna Wang,Andrew Caines,Alice Hutchings

Main category: cs.CL

TL;DR: This paper argues for a more reflective and transparent methodology in creating hate speech datasets to improve their reliability and validity.

Details Motivation: The motivation is to balance competing priorities and improve dataset reliability by acknowledging value judgments during dataset construction. Method: Drawing on Max Weber's notion of ideal types, the paper critically examines methodological choices in a diverse range of hate speech datasets. Result: The result is an argument for reflexivity in dataset creation, highlighting common themes and practices in hate speech dataset design decisions. Conclusion: The paper concludes that a reflexive approach should be adopted in dataset creation to enhance transparency and methodological rigor. Abstract: The curation of hate speech datasets involves complex design decisions that balance competing priorities. This paper critically examines these methodological choices in a diverse range of datasets, highlighting common themes and practices, and their implications for dataset reliability. Drawing on Max Weber's notion of ideal types, we argue for a reflexive approach in dataset creation, urging researchers to acknowledge their own value judgments during dataset construction, fostering transparency and methodological rigour.

[24] Comparative Analysis of Abstractive Summarization Models for Clinical Radiology Reports

Anindita Bhattacharya,Tohida Rehman,Debarshi Kumar Sanyal,Samiran Chattopadhyay

Main category: cs.CL

TL;DR: This paper explores the use of advanced summarization models to automatically generate concise impressions from the detailed findings in radiology reports, comparing multiple models and evaluating their effectiveness using various metrics.

Details Motivation: The findings section of radiology reports is often lengthy, while the impression section is more concise and captures key diagnostic conclusions. This research aims to automate the generation of concise impressions from these detailed findings. Method: A comparative analysis was conducted on pre-trained and open-source large language models, including T5-base, BART-base, PEGASUS-x-base, ChatGPT-4, LLaMA-3-8B, and a custom Pointer Generator Network with a coverage mechanism, using the MIMIC-CXR dataset. Multiple evaluation metrics such as ROUGE, METEOR, and BERTScore were employed. Result: The study analyzed the performance of several advanced abstractive summarization models, identifying their strengths and limitations in handling medical text summarization. Conclusion: This study provides insights into the effectiveness of various abstractive summarization models in generating concise impressions from detailed radiology findings, offering valuable information for medical professionals seeking automation in healthcare. Abstract: The findings section of a radiology report is often detailed and lengthy, whereas the impression section is comparatively more compact and captures key diagnostic conclusions. This research explores the use of advanced abstractive summarization models to generate the concise impression from the findings section of a radiology report. We have used the publicly available MIMIC-CXR dataset. A comparative analysis is conducted on leading pre-trained and open-source large language models, including T5-base, BART-base, PEGASUS-x-base, ChatGPT-4, LLaMA-3-8B, and a custom Pointer Generator Network with a coverage mechanism. To ensure a thorough assessment, multiple evaluation metrics are employed, including ROUGE-1, ROUGE-2, ROUGE-L, METEOR, and BERTScore. By analyzing the performance of these models, this study identifies their respective strengths and limitations in the summarization of medical text. The findings of this paper provide helpful information for medical professionals who need automated summarization solutions in the healthcare sector.

[25] End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data

Aishwarya Pothula,Bhavana Akkiraju,Srihari Bandarupalli,Charan D,Santosh Kesiraju,Anil Kumar Vuppala

Main category: cs.CL

TL;DR: 本文提出了一种利用弱标签数据构建低资源语言语音到文本翻译系统的方法,并展示了其性能与主流基线系统相当。

Details Motivation: 高质量标注数据的稀缺性是开发有效端到端语音到文本翻译系统的主要挑战,尤其是在低资源语言中。 Method: 通过使用最先进的句子编码器进行双语文本挖掘,构建了Shrutilipi-anuvaad数据集,并创建了多个不同质量和数量的训练数据版本以研究其影响。 Result: 研究表明,使用弱标签数据可以建立有效的ST系统,性能与SONAR和SeamlessM4T等基线系统相当。 Conclusion: 弱标签数据可以用于构建低资源语言对的语音到文本翻译系统,其性能可与大规模多模态多语言基线系统相媲美。 Abstract: The scarcity of high-quality annotated data presents a significant challenge in developing effective end-to-end speech-to-text translation (ST) systems, particularly for low-resource languages. This paper explores the hypothesis that weakly labeled data can be used to build ST models for low-resource language pairs. We constructed speech-to-text translation datasets with the help of bitext mining using state-of-the-art sentence encoders. We mined the multilingual Shrutilipi corpus to build Shrutilipi-anuvaad, a dataset comprising ST data for language pairs Bengali-Hindi, Malayalam-Hindi, Odia-Hindi, and Telugu-Hindi. We created multiple versions of training data with varying degrees of quality and quantity to investigate the effect of quality versus quantity of weakly labeled data on ST model performance. Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines such as SONAR and SeamlessM4T.

[26] Advancing Automated Speaking Assessment Leveraging Multifaceted Relevance and Grammar Information

Hao-Chien Lu,Jhen-Ke Lin,Hong-Yun Lin,Chung-Chun Wang,Berlin Chen

Main category: cs.CL

TL;DR: 这篇论文介绍了一种改进的自动口语评估系统,通过整合多维度相关性模块和细粒度语法错误特征,显著提升了评估的整体性能。

Details Motivation: 当前的自动口语评估系统在多方面评价中往往未能充分利用内容相关性,忽视了图像或范例线索,并且语法分析过于表面化,缺乏详细的错误类型识别。 Method: 引入了两个新改进:1) 多维度相关性模块整合问题、图像内容、范例和L2说话者的口语反应进行全面的内容相关性评估;2) 使用高级语法错误纠正(GEC)和详细注释导出细粒度的语法错误特征。 Result: 实验和消融研究表明,这些组件显著改善了内容相关性、语言使用和整体自动口语评估性能。 Conclusion: 该论文提出的混合评分模型通过多维度相关性模块和细粒度语法错误特征显著提升了自动口语评估系统的性能。 Abstract: Current automated speaking assessment (ASA) systems for use in multi-aspect evaluations often fail to make full use of content relevance, overlooking image or exemplar cues, and employ superficial grammar analysis that lacks detailed error types. This paper ameliorates these deficiencies by introducing two novel enhancements to construct a hybrid scoring model. First, a multifaceted relevance module integrates question and the associated image content, exemplar, and spoken response of an L2 speaker for a comprehensive assessment of content relevance. Second, fine-grained grammar error features are derived using advanced grammar error correction (GEC) and detailed annotation to identify specific error categories. Experiments and ablation studies demonstrate that these components significantly improve the evaluation of content relevance, language use, and overall ASA performance, highlighting the benefits of using richer, more nuanced feature sets for holistic speaking assessment.

[27] PL-Guard: Benchmarking Language Model Safety for Polish

Aleksandra Krasnodębska,Karolina Seweryn,Szymon Łukasik,Wojciech Kusa

Main category: cs.CL

TL;DR: A new Polish language safety benchmark shows that a HerBERT-based classifier performs best against adversarial challenges.

Details Motivation: To address the lack of safety assessments and moderation tools for low-resource languages, focusing on Polish. Method: Creation of a manually annotated benchmark dataset and adversarially perturbed variants; experiments with LLM-based and classifier-based models like Llama-Guard-3-8B, HerBERT-based classifier, and PLLuM. Result: The HerBERT-based classifier outperforms other models in overall performance, particularly under adversarial conditions. Conclusion: The HerBERT-based classifier demonstrates the highest performance in language model safety classification for Polish, especially under adversarial conditions. Abstract: Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.

[28] Generalizability of Media Frames: Corpus creation and analysis across countries

Agnese Daffara,Sourabh Dattawad,Sebastian Padó,Tanise Ceron

Main category: cs.CL

TL;DR: This paper investigates the cross-cultural applicability of the Media Frame Corpus (MFC) by applying it to Brazilian Portuguese news articles in political and economic domains.

Details Motivation: The motivation is to determine whether the U.S.-centric Media Frame Corpus (MFC) can be effectively applied to understand political discourse in a different cultural context, specifically Brazil. Method: The authors introduced FrameNews-PT, a dataset of Brazilian Portuguese news articles, and annotated it using the MFC framework. They conducted multiple annotation rounds to assess how well MFC frames generalize to Brazilian political and economic debates and evaluated model performance on out-of-domain data. Result: The results show that the 15 MFC frames remain largely applicable in the Brazilian context with minor guideline adjustments. However, some frames are seldom used, and new issues are interpreted through general 'fall-back' frames. Conclusion: The study concludes that while the Media Frame Corpus (MFC) frames are broadly applicable in the Brazilian context with minor revisions, cross-cultural application of framing requires careful consideration as some frames are rarely used or adapted. Abstract: Frames capture aspects of an issue that are emphasized in a debate by interlocutors and can help us understand how political language conveys different perspectives and ultimately shapes people's opinions. The Media Frame Corpus (MFC) is the most commonly used framework with categories and detailed guidelines for operationalizing frames. It is, however, focused on a few salient U.S. news issues, making it unclear how well these frames can capture news issues in other cultural contexts. To explore this, we introduce FrameNews-PT, a dataset of Brazilian Portuguese news articles covering political and economic news and annotate it within the MFC framework. Through several annotation rounds, we evaluate the extent to which MFC frames generalize to the Brazilian debate issues. We further evaluate how fine-tuned and zero-shot models perform on out-of-domain data. Results show that the 15 MFC frames remain broadly applicable with minor revisions of the guidelines. However, some MFC frames are rarely used, and novel news issues are analyzed using general 'fall-back' frames. We conclude that cross-cultural frame use requires careful consideration.

[29] Analyzing the Influence of Knowledge Graph Information on Relation Extraction

Cedric Möller,Ricardo Usbeck

Main category: cs.CL

TL;DR: Using knowledge graph data boosts relation extraction model performance, especially when training data is imbalanced.

Details Motivation: To determine whether knowledge graph entity positions can enhance relation extraction performance, especially when there is an imbalance in training examples for different relations. Method: Experiments were conducted across multiple datasets with varying relations, training examples, and knowledge graphs. Established relation extraction methods were combined with graph-aware Neural Bellman-Ford networks to evaluate the impact of knowledge graph-based features. Result: Incorporating knowledge graph information led to significant performance improvements in both supervised and zero-shot settings across various datasets. Conclusion: The integration of knowledge graph information significantly improves the performance of relation extraction models, particularly in scenarios with imbalanced training data. Abstract: We examine the impact of incorporating knowledge graph information on the performance of relation extraction models across a range of datasets. Our hypothesis is that the positions of entities within a knowledge graph provide important insights for relation extraction tasks. We conduct experiments on multiple datasets, each varying in the number of relations, training examples, and underlying knowledge graphs. Our results demonstrate that integrating knowledge graph information significantly enhances performance, especially when dealing with an imbalance in the number of training examples for each relation. We evaluate the contribution of knowledge graph-based features by combining established relation extraction methods with graph-aware Neural Bellman-Ford networks. These features are tested in both supervised and zero-shot settings, demonstrating consistent performance improvements across various datasets.

[30] DISCIE -- Discriminative Closed Information Extraction

Cedric Möller,Ricardo Usbeck

Main category: cs.CL

TL;DR: 本文提出了一种高效的封闭信息抽取新方法,在大规模数据下表现出色,尤其在长尾关系和使用小型模型方面优于现有技术。

Details Motivation: 现有的端到端生成模型在大规模封闭信息抽取任务中面临挑战,尤其是在处理数百万实体和数百种关系时,因此需要一种更高效、更准确的方法。 Method: 该论文采用了一种结合类型和实体特定信息的判别方法,以提高关系抽取的准确性,并强调通过使用较小的模型来提升效率。 Result: 实验表明,该方法在性能上优于最先进的端到端生成模型,特别是在大规模封闭信息抽取问题上,并且通过整合类型信息,能够达到甚至超越较大生成模型的表现。 Conclusion: 该论文提出了一种新颖的封闭信息抽取方法,具有更高的准确性和效率,尤其在长尾关系和大规模数据情况下表现突出。 Abstract: This paper introduces a novel method for closed information extraction. The method employs a discriminative approach that incorporates type and entity-specific information to improve relation extraction accuracy, particularly benefiting long-tail relations. Notably, this method demonstrates superior performance compared to state-of-the-art end-to-end generative models. This is especially evident for the problem of large-scale closed information extraction where we are confronted with millions of entities and hundreds of relations. Furthermore, we emphasize the efficiency aspect by leveraging smaller models. In particular, the integration of type-information proves instrumental in achieving performance levels on par with or surpassing those of a larger generative model. This advancement holds promise for more accurate and efficient information extraction techniques.

[31] Can structural correspondences ground real world representational content in Large Language Models?

Iwan Williams

Main category: cs.CL

TL;DR: 本文探讨了大语言模型是否能表示现实世界内容,认为仅靠结构对应是不够的,还需其在任务表现中发挥实际作用。

Details Motivation: 由于 LLMs 仅接触文本数据,缺乏与外部现实的直接交互,因此需要探究它们是否能表示现实世界的内容。 Method: 基于结构对应理论探讨 LLMs 的表征能力,并初步调查相关证据。 Result: 单纯的结构对应不足以支持 LLMs 对现实世界的表征;只有当这些对应关系有助于解释任务的成功完成时,才可能构成真正的表征。 Conclusion: LLMs 的表征能力不仅取决于与现实世界的结构对应关系,还需要这些对应关系在任务执行中发挥作用,以证明其能够代表现实世界内容。 Abstract: Large Language Models (LLMs) such as GPT-4 produce compelling responses to a wide range of prompts. But their representational capacities are uncertain. Many LLMs have no direct contact with extra-linguistic reality: their inputs, outputs and training data consist solely of text, raising the questions (1) can LLMs represent anything and (2) if so, what? In this paper, I explore what it would take to answer these questions according to a structural-correspondence based account of representation, and make an initial survey of this evidence. I argue that the mere existence of structural correspondences between LLMs and worldly entities is insufficient to ground representation of those entities. However, if these structural correspondences play an appropriate role - they are exploited in a way that explains successful task performance - then they could ground real world contents. This requires overcoming a challenge: the text-boundedness of LLMs appears, on the face of it, to prevent them engaging in the right sorts of tasks.

[32] InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems

Kexin Huang,Qian Tu,Liwei Fan,Chenchen Yang,Dong Zhang,Shimin Li,Zhaoye Fei,Qinyuan Cheng,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出了一种新的基准测试工具InstructTTSEval,用于测量复杂的自然语言风格控制能力,并通过评估指出当前指令跟随TTS系统有显著的改进空间。

Details Motivation: 传统TTS系统在风格控制上灵活性有限,现有系统对复杂指令的理解和执行能力尚不清楚,缺乏高质量的基准和自动化评估指标。 Method: 引入了一个名为InstructTTSEval的基准测试,包括三个任务,利用Gemini作为自动裁判来评估其遵循指令的能力。 Result: 评估显示现有的可访问指令跟随TTS系统仍有很大的改进空间。 Conclusion: InstructTTSEval将推动更强大、灵活和准确的指令跟随TTS的发展。 Abstract: In modern speech synthesis, paralinguistic information--such as a speaker's vocal timbre, emotional state, and dynamic prosody--plays a critical role in conveying nuance beyond mere semantics. Traditional Text-to-Speech (TTS) systems rely on fixed style labels or inserting a speech prompt to control these cues, which severely limits flexibility. Recent attempts seek to employ natural-language instructions to modulate paralinguistic features, substantially improving the generalization of instruction-driven TTS models. Although many TTS systems now support customized synthesis via textual description, their actual ability to interpret and execute complex instructions remains largely unexplored. In addition, there is still a shortage of high-quality benchmarks and automated evaluation metrics specifically designed for instruction-based TTS, which hinders accurate assessment and iterative optimization of these models. To address these limitations, we introduce InstructTTSEval, a benchmark for measuring the capability of complex natural-language style control. We introduce three tasks, namely Acoustic-Parameter Specification, Descriptive-Style Directive, and Role-Play, including English and Chinese subsets, each with 1k test cases (6k in total) paired with reference audio. We leverage Gemini as an automatic judge to assess their instruction-following abilities. Our evaluation of accessible instruction-following TTS systems highlights substantial room for further improvement. We anticipate that InstructTTSEval will drive progress toward more powerful, flexible, and accurate instruction-following TTS.

[33] Large Language Models in Argument Mining: A Survey

Hao Li,Viktor Schlegel,Yizheng Sun,Riza Batista-Navarro,Goran Nenadic

Main category: cs.CL

TL;DR: This survey explores the transformative effect of Large Language Models on Argument Mining, offering a detailed overview of current methodologies, datasets, and future research directions.

Details Motivation: The motivation is to provide a comprehensive overview of the impact of Large Language Models on Argument Mining, including their influence on in-context learning, prompt-based generation, and cross-domain adaptability. Method: The paper systematically synthesizes recent advancements in LLM-driven AM, providing a concise review of foundational theories, annotation frameworks, and datasets. It presents a taxonomy of AM subtasks and examines how modern LLM techniques have influenced their execution. Result: A key result is the development of a comprehensive taxonomy of AM subtasks and an understanding of how prompting, chain-of-thought reasoning, and retrieval augmentation have changed their implementation. The survey also identifies critical challenges like long-context reasoning, interpretability, and annotation bottlenecks. Conclusion: The paper concludes by highlighting emerging trends and suggesting a forward-looking research agenda for LLM-based computational argumentation, aiming to guide researchers in this evolving field. Abstract: Argument Mining (AM), a critical subfield of Natural Language Processing (NLP), focuses on extracting argumentative structures from text. The advent of Large Language Models (LLMs) has profoundly transformed AM, enabling advanced in-context learning, prompt-based generation, and robust cross-domain adaptability. This survey systematically synthesizes recent advancements in LLM-driven AM. We provide a concise review of foundational theories and annotation frameworks, alongside a meticulously curated catalog of datasets. A key contribution is our comprehensive taxonomy of AM subtasks, elucidating how contemporary LLM techniques -- such as prompting, chain-of-thought reasoning, and retrieval augmentation -- have reconfigured their execution. We further detail current LLM architectures and methodologies, critically assess evaluation practices, and delineate pivotal challenges including long-context reasoning, interpretability, and annotation bottlenecks. Conclusively, we highlight emerging trends and propose a forward-looking research agenda for LLM-based computational argumentation, aiming to strategically guide researchers in this rapidly evolving domain.

[34] HausaNLP at SemEval-2025 Task 11: Advancing Hausa Text-based Emotion Detection

Sani Abdullahi Sani,Salim Abubakar,Falalu Ibrahim Lawan,Abdulhamid Abubakar,Maryam Bala

Main category: cs.CL

TL;DR: 本文探讨了在资源匮乏的非洲语言豪萨语中进行多标签情感检测的方法,通过微调基于Transformer的模型AfriBERTa,实现了74.00%的验证准确率和73.50%的F1分数。

Details Motivation: 为了提高低资源非洲语言豪萨语的情感检测效果,探索基于Transformer的模型的应用潜力。 Method: 使用Hugging Face Trainer API对预训练于非洲语言的Transformer模型AfriBERTa进行微调,并进行了数据预处理和分词操作。 Result: 系统达到了74.00%的验证准确率和73.50%的F1得分,证明了其在低资源语言情感检测中的有效性。 Conclusion: 基于Transformer的模型在低资源语言的情感检测任务中表现出色,为未来相关研究提供了有效方法。 Abstract: This paper presents our approach to multi-label emotion detection in Hausa, a low-resource African language, as part of SemEval Track A. We fine-tuned AfriBERTa, a transformer-based model pre-trained on African languages, to classify Hausa text into six emotions: anger, disgust, fear, joy, sadness, and surprise. Our methodology involved data preprocessing, tokenization, and model fine-tuning using the Hugging Face Trainer API. The system achieved a validation accuracy of 74.00%, with an F1-score of 73.50%, demonstrating the effectiveness of transformer-based models for emotion detection in low-resource languages.

[35] RiOT: Efficient Prompt Refinement with Residual Optimization Tree

Chenyi Zhou,Zhengyan Shi,Yuan Yao,Lei Liang,Huajun Chen,Qiang Zhang

Main category: cs.CL

TL;DR: RiOT is a new framework for automatic prompt optimization that improves diversity, reduces semantic drift, and outperforms existing methods across various reasoning tasks.

Details Motivation: Existing automatic prompt optimization methods face two main challenges: limited diversity in prompt exploration and semantic drift, where optimizations degrade cross-task performance. This work aims to overcome these limitations. Method: RiOT uses text gradients for iterative refinement of prompts, generates diverse candidates, employs perplexity-based selection, and incorporates text residual connections to manage semantic drift. A tree structure ensures scalability and flexibility. Result: RiOT outperforms prior prompt optimization approaches and manual prompting on five benchmarks covering commonsense, mathematical, logical, temporal, and semantic reasoning. Conclusion: RiOT effectively addresses the challenges of lack of diversity and semantic drift in automatic prompt optimization, demonstrating superior performance across multiple reasoning tasks compared to previous methods and manual prompting. Abstract: Recent advancements in large language models (LLMs) have highlighted their potential across a variety of tasks, but their performance still heavily relies on the design of effective prompts. Existing methods for automatic prompt optimization face two challenges: lack of diversity, limiting the exploration of valuable and innovative directions and semantic drift, where optimizations for one task can degrade performance in others. To address these issues, we propose Residual Optimization Tree (RiOT), a novel framework for automatic prompt optimization. RiOT iteratively refines prompts through text gradients, generating multiple semantically diverse candidates at each step, and selects the best prompt using perplexity. Additionally, RiOT incorporates the text residual connection to mitigate semantic drift by selectively retaining beneficial content across optimization iterations. A tree structure efficiently manages the optimization process, ensuring scalability and flexibility. Extensive experiments across five benchmarks, covering commonsense, mathematical, logical, temporal, and semantic reasoning, demonstrate that RiOT outperforms both previous prompt optimization methods and manual prompting.

[36] From LLM-anation to LLM-orchestrator: Coordinating Small Models for Data Labeling

Yao Lu,Zhaiyuan Ji,Jiawei Du,Yu Shanqing,Qi Xuan,Tianyi Zhou

Main category: cs.CL

TL;DR: 本文提出了一种基于LLM和SLM协同工作的自动注释框架AutoAnnotator,有效降低了注释成本并提升了准确率。

Details Motivation: 现有的基于大语言模型的注释方法在大规模标注时API调用成本高,且在需要细粒度语义理解的任务中准确率不如专门的小语言模型。 Method: 设计了一个基于大语言模型(LLM)和小语言模型(SLM)的双层协同注释框架AutoAnnotator,其中上层元控制器利用LLM选择SLM、生成代码和验证困难样本,下层任务专家层采用多模型投票机制进行注释,并通过持续学习策略对SLM进行分阶段微调。 Result: 实验表明,与现有开源/API LLM方法相比,AutoAnnotator在零样本、一样本、思维链(CoT)和多数投票设置中表现更优,相较于直接使用GPT-3.5-turbo进行注释,成本降低了74.15%,准确率提高了6.21%。 Conclusion: AutoAnnotator是一个全自动注释框架,通过多模型协作和持续学习策略,在降低注释成本的同时提高了注释准确率。 Abstract: Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task-specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. Project page: https://github.com/Zhaiyuan-Ji/AutoAnnotator.

[37] OJBench: A Competition Level Code Benchmark For Large Language Models

Zhexu Wang,Yiping Liu,Yejie Wang,Wenyang He,Bofei Gao,Muxi Diao,Yanxu Chen,Kelin Fu,Flood Sung,Zhilin Yang,Tianyu Liu,Weiran Xu

Main category: cs.CL

TL;DR: OJBench是为评估大语言模型在竞争级代码推理能力而设计的新基准,研究表明当前最先进的模型仍难以应对这类挑战性问题。

Details Motivation: 现有代码基准测试在评估大语言模型的数学和代码推理能力方面存在局限性,特别是在竞争级别上。为了弥补这一差距,作者引入了OJBench。 Method: 构建了一个包含232个编程竞赛问题的数据集OJBench,并对37个模型进行了全面评估,包括闭源和开源模型、面向推理和非面向推理的模型。 Result: 实验结果表明,即使是最先进的推理模型(如o4-mini和Gemini-2.5-pro-exp)在高度挑战性的竞赛级问题上也表现不佳。 Conclusion: OJBench是一个具有挑战性的新基准,用于评估大语言模型在竞争级代码推理方面的能力。研究结果显示,即使是先进的推理模型,在解决竞赛级别的问题时也面临显著挑战。 Abstract: Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning abilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models' reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems. This highlights the significant challenges that models face in competitive-level code reasoning.

[38] NepaliGPT: A Generative Language Model for the Nepali Language

Shushanta Pudasaini,Aman Shakya,Siddhartha Shrestha,Sahil Bhatta,Sunil Thapa,Sushmita Palikhe

Main category: cs.CL

TL;DR: This paper proposes NepaliGPT, the first generative large language model for the Nepali language, accompanied by an advanced corpus and benchmark dataset, showing promising results in text generation.

Details Motivation: There is no generative language model for the Nepali language, which hinders downstream tasks like fine-tuning. This research aims to fill the gap in Nepali NLP by proposing NepaliGPT. Method: The study involves the creation of an advanced corpus called the Devanagari Corpus and introduces the first NepaliGPT benchmark dataset with 4,296 question-answer pairs. The proposed model is evaluated using metrics like perplexity, ROUGE-1 score, causal coherence, and causal consistency. Result: NepaliGPT achieves a perplexity of 26.32245, ROUGE-1 score of 0.2604, causal coherence of 81.25%, and causal consistency of 85.41% in text generation. Conclusion: NepaliGPT is a successful generative large language model tailored for the Nepali language, achieving notable metrics in text generation and setting a foundation for future research in Nepali NLP. Abstract: After the release of ChatGPT, Large Language Models (LLMs) have gained huge popularity in recent days and thousands of variants of LLMs have been released. However, there is no generative language model for the Nepali language, due to which other downstream tasks, including fine-tuning, have not been explored yet. To fill this research gap in the Nepali NLP space, this research proposes \textit{NepaliGPT}, a generative large language model tailored specifically for the Nepali language. This research introduces an advanced corpus for the Nepali language collected from several sources, called the Devanagari Corpus. Likewise, the research introduces the first NepaliGPT benchmark dataset comprised of 4,296 question-answer pairs in the Nepali language. The proposed LLM NepaliGPT achieves the following metrics in text generation: Perplexity of 26.32245, ROUGE-1 score of 0.2604, causal coherence of 81.25\%, and causal consistency of 85.41\%.

[39] When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework

Zhen Xu,Shang Zhu,Jue Wang,Junlin Wang,Ben Athiwaratkun,Chi Wang,James Zou,Ce Zhang

Main category: cs.CL

TL;DR: 该研究提出了一个处理长文本的理论框架,并验证了多代理分块策略的有效性,从而为大语言模型的长上下文处理提供了新的思路。

Details Motivation: 解决大语言模型在处理长文本时遇到的挑战,包括任务噪声、模型噪声和聚合器噪声,以找到更有效的长文本处理方法。 Method: 提出了一种理论框架,将长上下文任务的失败模式分为三类:跨块依赖(任务噪声)、随着上下文大小增长的混淆(模型噪声)和部分结果的不完美整合(聚合器噪声)。并在此框架下分析了多代理分块策略的有效性。 Result: 实验表明,在检索、问答和摘要等任务中,多代理分块策略是有效的,并且通过探索输入长度与模型噪声的超线性增长关系,解释了为何在大型输入情况下,配置较弱的模型也能超越如GPT4o这样的先进模型。 Conclusion: 本文提出了一个理论框架,用于理解在处理长文本时大语言模型(LLMs)的失败模式,并展示了如何通过多代理分块策略来有效应对这些挑战。 Abstract: We investigate the challenge of applying Large Language Models (LLMs) to long texts. We propose a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise). Under this view, we analyze when it is effective to use multi-agent chunking, i.e., dividing a length sequence into smaller chunks and aggregating the processed results of each chunk. Our experiments on tasks such as retrieval, question answering, and summarization confirm both the theoretical analysis and the conditions that favor multi-agent chunking. By exploring superlinear model noise growth with input length, we also explain why, for large inputs, a weaker model configured with chunk-based processing can surpass a more advanced model like GPT4o applied in a single shot. Overall, we present a principled understanding framework and our results highlight a direct pathway to handling long contexts in LLMs with carefully managed chunking and aggregator strategies.

[40] REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing

Kangqi Chen,Andreas Kosmas Kakolyris,Rakesh Nadig,Manos Frouzakis,Nika Mansouri Ghiasi,Yu Liang,Haiyu Mao,Jisung Park,Mohammad Sadrosadati,Onur Mutlu

Main category: cs.CL

TL;DR: REIS is an optimized In-Storage Processing system designed specifically for Retrieval-Augmented Generation, enhancing the efficiency of the retrieval process by addressing key limitations in existing methods.

Details Motivation: The retrieval stage in RAG systems becomes a bottleneck due to large database sizes and data movement overheads during Approximate Nearest Neighbor Search (ANNS). Existing In-Storage Processing (ISP) solutions are limited by non-tailored algorithms, lack of data retrieval acceleration, and significant hardware changes. REIS aims to overcome these issues. Method: REIS introduces three mechanisms: a database layout linking embedding vectors to documents, an ISP-tailored data placement technique distributing embeddings across storage planes with a lightweight Flash Translation Layer, and utilization of existing computational resources within the storage system for ANNS execution. Result: Compared to a server-grade system, REIS improves retrieval performance by an average of 13x and energy efficiency by 55x. Conclusion: REIS is an effective In-Storage Processing (ISP) system tailored for Retrieval-Augmented Generation (RAG), addressing existing limitations in Approximate Nearest Neighbor Search (ANNS) execution, data retrieval, and hardware modifications, thereby significantly improving performance and energy efficiency. Abstract: Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. To overcome this issue, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes a significant bottleneck in inference pipelines. In this stage, a user query is mapped to an embedding vector and an Approximate Nearest Neighbor Search (ANNS) algorithm searches for similar vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS by performing computations inside storage. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications, limiting performance and hindering their adoption. We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored data placement technique that distributes embeddings across the planes of the storage system and employs a lightweight Flash Translation Layer. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system. Compared to a server-grade system, REIS improves the performance (energy efficiency) of retrieval by an average of 13x (55x).

[41] StoryWriter: A Multi-Agent Framework for Long Story Generation

Haotian Xia,Hao Peng,Yunjia Qi,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 本文提出了一个名为StoryWriter的多代理框架来解决长故事生成中的问题,并创建了一个大型的高质量长故事数据集。

Details Motivation: 现有大语言模型在长故事生成方面面临话语连贯性和叙事复杂性的挑战。 Method: 提出了一种名为StoryWriter的多代理故事生成框架,包括大纲代理、规划代理和写作代理三个模块。 Result: 通过人工和自动化评估发现,StoryWriter在故事质量和长度上均显著优于现有方法,并生成了一个包含约6000个高质量长故事的数据集。 Conclusion: StoryWriter是一个有效的长故事生成框架,能够显著优于现有的故事生成基线,并且开发了高质量的长故事数据集。 Abstract: Long story generation remains a challenge for existing large language models (LLMs), primarily due to two main factors: (1) discourse coherence, which requires plot consistency, logical coherence, and completeness in the long-form generation, and (2) narrative complexity, which requires an interwoven and engaging narrative. To address these challenges, we propose StoryWriter, a multi-agent story generation framework, which consists of three main modules: (1) outline agent, which generates event-based outlines containing rich event plots, character, and event-event relationships. (2) planning agent, which further details events and plans which events should be written in each chapter to maintain an interwoven and engaging story. (3) writing agent, which dynamically compresses the story history based on the current event to generate and reflect new plots, ensuring the coherence of the generated story. We conduct both human and automated evaluation, and StoryWriter significantly outperforms existing story generation baselines in both story quality and length. Furthermore, we use StoryWriter to generate a dataset, which contains about $6,000$ high-quality long stories, with an average length of $8,000$ words. We train the model Llama3.1-8B and GLM4-9B using supervised fine-tuning on LongStory and develop StoryWriter_GLM and StoryWriter_GLM, which demonstrates advanced performance in long story generation.

[42] Towards Generalizable Generic Harmful Speech Datasets for Implicit Hate Speech Detection

Saad Almohaimeed,Saleh Almohaimeed,Damla Turgut,Ladislau Bölöni

Main category: cs.CL

TL;DR: This paper addresses the challenge of detecting implicit hate speech on social media by enhancing existing harmful speech datasets, leading to improved detection performance.

Details Motivation: Implicit hate speech poses a critical challenge for social media platforms, necessitating more effective detection techniques. Method: The method involves influential sample identification, reannotation, and augmentation using Llama-3 70B and GPT-4o. Result: Experimental results show a +12.9-point F1 score improvement compared to the baseline in detecting implicit hate speech. Conclusion: The proposed method effectively improves the detection of implicit hate speech and enhances generalizability across diverse datasets. Abstract: Implicit hate speech has recently emerged as a critical challenge for social media platforms. While much of the research has traditionally focused on harmful speech in general, the need for generalizable techniques to detect veiled and subtle forms of hate has become increasingly pressing. Based on lexicon analysis, we hypothesize that implicit hate speech is already present in publicly available harmful speech datasets but may not have been explicitly recognized or labeled by annotators. Additionally, crowdsourced datasets are prone to mislabeling due to the complexity of the task and often influenced by annotators' subjective interpretations. In this paper, we propose an approach to address the detection of implicit hate speech and enhance generalizability across diverse datasets by leveraging existing harmful speech datasets. Our method comprises three key components: influential sample identification, reannotation, and augmentation using Llama-3 70B and GPT-4o. Experimental results demonstrate the effectiveness of our approach in improving implicit hate detection, achieving a +12.9-point F1 score improvement compared to the baseline.

[43] Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples

Soumya Suvra Ghosal,Vaibhav Singh,Akash Ghosh,Soumyabrata Pal,Subhadip Baidya,Sriparna Saha,Dinesh Manocha

Main category: cs.CL

TL;DR: RELIC improves reward modeling for low-resource Indic languages by selecting relevant examples from high-resource languages, outperforming current methods.

Details Motivation: Most open-source multilingual reward models are trained on high-resource language datasets, leading to unreliable reward signals for low-resource Indic languages. Collecting preference data for these languages is expensive and impractical. Method: RELIC trains a retriever with a pairwise ranking objective to select useful examples from high-resource languages to improve reward modeling for low-resource languages. Result: Experiments show that RELIC significantly improves reward model accuracy for low-resource Indic languages, achieving a 12.81% and 10.13% improvement over zero-shot prompting and state-of-the-art example selection methods, respectively, on Bodo using a LLaMA-3.2-3B model. Conclusion: RELIC is an effective in-context learning framework for improving reward model accuracy in low-resource Indic languages, outperforming zero-shot prompting and existing example selection methods. Abstract: Reward models are essential for aligning large language models (LLMs) with human preferences. However, most open-source multilingual reward models are primarily trained on preference datasets in high-resource languages, resulting in unreliable reward signals for low-resource Indic languages. Collecting large-scale, high-quality preference data for these languages is prohibitively expensive, making preference-based training approaches impractical. To address this challenge, we propose RELIC, a novel in-context learning framework for reward modeling in low-resource Indic languages. RELIC trains a retriever with a pairwise ranking objective to select in-context examples from auxiliary high-resource languages that most effectively highlight the distinction between preferred and less-preferred responses. Extensive experiments on three preference datasets- PKU-SafeRLHF, WebGPT, and HH-RLHF-using state-of-the-art open-source reward models demonstrate that RELIC significantly improves reward model accuracy for low-resource Indic languages, consistently outperforming existing example selection methods. For example, on Bodo-a low-resource Indic language-using a LLaMA-3.2-3B reward model, RELIC achieves a 12.81% and 10.13% improvement in accuracy over zero-shot prompting and state-of-the-art example selection method, respectively.

[44] Automatic Speech Recognition Biases in Newcastle English: an Error Analysis

Dana Serditova,Kevin Tang,Jochen Steffens

Main category: cs.CL

TL;DR: 这项研究发现,自动语音识别系统在处理纽卡斯尔英语方言时存在明显错误,这些错误主要源于方言特征而非社会因素。

Details Motivation: 由于自动语音识别(ASR)系统在区域方言上的表现不佳,且区域性偏见尚未得到充分研究,因此本研究旨在填补这一空白。 Method: 研究采用了两个阶段的分析方法:首先对子样本进行了人工错误分析,识别出语音、词汇和形态句法错误;其次对地区代词“yous”和“wor”的识别进行了案例研究。 Result: 结果表明,ASR错误直接与地区方言特征相关,而社会因素在ASR不匹配中的作用较小。 Conclusion: 该研究强调了在ASR训练数据中增加方言多样性的必要性,并突出了社会语言学分析在诊断和解决区域偏见方面的价值。 Abstract: Automatic Speech Recognition (ASR) systems struggle with regional dialects due to biased training which favours mainstream varieties. While previous research has identified racial, age, and gender biases in ASR, regional bias remains underexamined. This study investigates ASR performance on Newcastle English, a well-documented regional dialect known to be challenging for ASR. A two-stage analysis was conducted: first, a manual error analysis on a subsample identified key phonological, lexical, and morphosyntactic errors behind ASR misrecognitions; second, a case study focused on the systematic analysis of ASR recognition of the regional pronouns ``yous'' and ``wor''. Results show that ASR errors directly correlate with regional dialectal features, while social factors play a lesser role in ASR mismatches. We advocate for greater dialectal diversity in ASR training data and highlight the value of sociolinguistic analysis in diagnosing and addressing regional biases.

[45] Weight Factorization and Centralization for Continual Learning in Speech Recognition

Enes Yavuz Ugan,Ngoc-Quan Pham,Alexander Waibel

Main category: cs.CL

TL;DR: This paper proposes a continual learning approach inspired by the human waking-sleeping cycle to address catastrophic forgetting in speech recognition models, particularly effective in rehearsal-free, multilingual scenarios through factorization and centralization phases.

Details Motivation: Modern speech recognition models need to adapt to new data without retraining on the full dataset, especially when original training data is inaccessible. Continual training in multilingual settings without rehearsal often leads to catastrophic forgetting, necessitating a solution that mimics efficient human learning mechanisms. Method: Inspired by the human brain's waking-sleeping learning cycle, the method introduces a two-phase continual learning approach: factorization (knowledge learning) and centralization (knowledge merging), using multiple scattering low-rank adapters to accumulate knowledge. Result: Experiments on code-switching datasets demonstrated that the centralization phase can effectively prevent catastrophic forgetting by efficiently accumulating knowledge in low-rank adapters. Conclusion: The proposed continual learning approach with factorization and centralization phases effectively prevents catastrophic forgetting in neural network-based speech recognition models under rehearsal-free, multilingual, and language-agnostic conditions. Abstract: Modern neural network based speech recognition models are required to continually absorb new data without re-training the whole system, especially in downstream applications using foundation models, having no access to the original training data. Continually training the models in a rehearsal-free, multilingual, and language agnostic condition, likely leads to catastrophic forgetting, when a seemingly insignificant disruption to the weights can destructively harm the quality of the models. Inspired by the ability of human brains to learn and consolidate knowledge through the waking-sleeping cycle, we propose a continual learning approach with two distinct phases: factorization and centralization, learning and merging knowledge accordingly. Our experiments on a sequence of varied code-switching datasets showed that the centralization stage can effectively prevent catastrophic forgetting by accumulating the knowledge in multiple scattering low-rank adapters.

[46] Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement

Tuan-Nam Nguyen,Ngoc-Quan Pham,Seymanur Akti,Alexander Waibel

Main category: cs.CL

TL;DR: 本文介绍了一种新的流式口音转换模型,可以在保持说话人身份和改善发音的同时实现实时语音处理。

Details Motivation: 为了实现对非母语语音的实时处理并转化为母语口音,同时保持说话人的身份特征及语调。 Method: 通过使用Emformer编码器和优化的推理机制修改之前的AC架构,并集成一个母语文本到语音(TTS)模型以生成理想的真值数据用于高效训练。 Result: 该流式AC模型在维持稳定延迟的同时达到了与顶级AC模型相当的性能。 Conclusion: 该论文提出了一种流式口音转换模型,能够在保持说话人身份、语调和改善发音的同时将非母语语音转化为类似母语的口音,并且是首个能够实现流处理的AC系统。 Abstract: We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.

[47] Measuring (a Sufficient) World Model in LLMs: A Variance Decomposition Framework

Nadav Kunievsky,James A. Evans

Main category: cs.CL

TL;DR: 本文提出了一种评估大型语言模型是否具备稳健世界模型的新框架,发现更大模型在部分领域表现更好,但整体优势有限,强调需要发展语义诊断工具以更准确评估模型的理解能力。

Details Motivation: 确定大型语言模型(LLMs)是否具备结构化的世界理解能力,以支持超出表面模式的泛化能力,这是评估其可靠性的核心问题,尤其是在高风险应用中。 Method: 提出了一种正式框架来评估LLM是否表现出稳健的世界模型,该框架通过分解模型响应变异性为三个组成部分(用户目的、用户表达和模型不稳定性)进行分析,并利用该框架在多个领域评估LLMs的表现。 Result: 研究结果显示,更大的模型将更多的输出变化归因于用户目的的变化,表明其具备更强的世界模型稳健性。然而,这种改进并不均匀,在某些领域大模型并未始终优于小模型,且其稳健性优势通常有限。 Conclusion: 评估LLM是否具备足够强大的世界模型对于确保其在高风险应用中的可靠性至关重要。研究强调需要超越基于准确性的基准测试,转向语义诊断方法以更直接地评估模型对世界的内部理解的结构和稳定性。 Abstract: Understanding whether large language models (LLMs) possess a world model-a structured understanding of the world that supports generalization beyond surface-level patterns-is central to assessing their reliability, especially in high-stakes applications. We propose a formal framework for evaluating whether an LLM exhibits a sufficiently robust world model, defined as producing consistent outputs across semantically equivalent prompts while distinguishing between prompts that express different intents. We introduce a new evaluation approach to measure this that decomposes model response variability into three components: variability due to user purpose, user articulation, and model instability. An LLM with a strong world model should attribute most of the variability in its responses to changes in foundational purpose rather than superficial changes in articulation. This approach allows us to quantify how much of a model's behavior is semantically grounded rather than driven by model instability or alternative wording. We apply this framework to evaluate LLMs across diverse domains. Our results show how larger models attribute a greater share of output variability to changes in user purpose, indicating a more robust world model. This improvement is not uniform, however: larger models do not consistently outperform smaller ones across all domains, and their advantage in robustness is often modest. These findings highlight the importance of moving beyond accuracy-based benchmarks toward semantic diagnostics that more directly assess the structure and stability of a model's internal understanding of the world.

[48] A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications

Hanshu Rao,Weisi Liu,Haohan Wang,I-Chan Huang,Zhe He,Xiaolei Huang

Main category: cs.CL

TL;DR: 本文综述了大型语言模型在生物医学合成数据生成中的应用趋势、方法与挑战,强调了在临床领域的潜力与不足。

Details Motivation: 为了解决生物医学领域中数据稀缺、隐私问题以及数据质量问题,合成数据生成技术得到了快速发展,尤其是基于大型语言模型(LLMs)的方法。 Method: 该研究遵循PRISMA-ScR指南,综合分析了2020年至2025年间发表的59项研究,这些研究来自PubMed、ACM、Web of Science和Google Scholar。 Result: 分析发现:1)78.0%的研究涉及非结构化文本数据,13.6%为表格数据,8.4%为多模态数据;2)72.9%使用提示方法,22.0%采用微调LLMs,5.1%使用专用模型;3)评估方式包括内在指标(27.1%)、人工评估(55.9%)和LLM评估(13.6%)。 Conclusion: 该论文总结了当前大型语言模型在生物医学领域合成数据生成中的应用趋势、方法和评估,并指出了其在临床领域适应性、资源可及性和评估标准化方面存在的局限性。 Abstract: Synthetic data generation--mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields--has been facilitated by rapid advances of large language models (LLMs). This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically examines biomedical research and application trends in synthetic data generation, emphasizing clinical applications, methodologies, and evaluations. Our analysis identifies data modalities of unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%); generation methods of prompting (72.9%), fine-tuning (22.0%) LLMs and specialized model (5.1%); and heterogeneous evaluations of intrinsic metrics (27.1%), human-in-the-loop assessments (55.9%), and LLM-based evaluations (13.6%). The analysis addresses current limitations in what, where, and how health professionals can leverage synthetic data generation for biomedical domains. Our review also highlights challenges in adaption across clinical domains, resource and model accessibility, and evaluation standardizations.

[49] Modeling Public Perceptions of Science in Media

Jiaxin Pei,Dustin Wright,Isabelle Augenstin,David Jurgens

Main category: cs.CL

TL;DR: This paper presents a computational framework for modeling public perception of science across multiple dimensions, develops NLP models to predict these perceptions, and demonstrates their impact on public engagement using a large dataset and a real-world Reddit experiment.

Details Motivation: Effective public engagement with science is essential for building trust and understanding in the scientific community. However, science communicators face challenges in anticipating how audiences will interact with scientific information due to the increasing volume of available data. Method: The authors developed a computational framework that models public perception across twelve dimensions, created a large-scale science news perception dataset, and used NLP models to predict public perception scores. They also conducted a natural experiment on Reddit to analyze public engagement. Result: A framework for modeling public perception across twelve dimensions was introduced. A dataset with 10,489 annotations from diverse populations was created. NLP models were developed to predict perception scores with strong performance. The study revealed that frequency of science news consumption influences perception more than demographic factors. On Reddit, posts with higher positive perception scores received more comments and upvotes. Conclusion: This research highlights the significance of detailed perception modeling in science communication and provides new methods for predicting public interest and engagement with scientific content. Abstract: Effectively engaging the public with science is vital for fostering trust and understanding in our scientific community. Yet, with an ever-growing volume of information, science communicators struggle to anticipate how audiences will perceive and interact with scientific news. In this paper, we introduce a computational framework that models public perception across twelve dimensions, such as newsworthiness, importance, and surprisingness. Using this framework, we create a large-scale science news perception dataset with 10,489 annotations from 2,101 participants from diverse US and UK populations, providing valuable insights into public responses to scientific information across domains. We further develop NLP models that predict public perception scores with a strong performance. Leveraging the dataset and model, we examine public perception of science from two perspectives: (1) Perception as an outcome: What factors affect the public perception of scientific information? (2) Perception as a predictor: Can we use the estimated perceptions to predict public engagement with science? We find that individuals' frequency of science news consumption is the driver of perception, whereas demographic factors exert minimal influence. More importantly, through a large-scale analysis and carefully designed natural experiment on Reddit, we demonstrate that the estimated public perception of scientific information has direct connections with the final engagement pattern. Posts with more positive perception scores receive significantly more comments and upvotes, which is consistent across different scientific information and for the same science, but are framed differently. Overall, this research underscores the importance of nuanced perception modeling in science communication, offering new pathways to predict public interest and engagement with scientific content.

[50] Initial Investigation of LLM-Assisted Development of Rule-Based Clinical NLP System

Jianlin Shi,Brian T. Bucher

Main category: cs.CL

TL;DR: 本研究探讨了如何利用大语言模型(LLMs)辅助开发基于规则的自然语言处理(NLP)系统,以提高临床环境中任务的效率和效果。

Details Motivation: 尽管机器学习和大语言模型取得了进展,但规则型NLP系统因其可解释性和高效性仍在临床领域广泛使用。然而,其手动开发与维护成本高,尤其是在面对大量语言变化的任务时。 Method: 提出了一种新方法,在规则型NLP系统的开发阶段引入大语言模型。实验集中在开发过程的前两步:从临床记录中找到相关片段,并从中提取关键词用于基于规则的命名实体识别(NER)。 Result: 实验结果表明,大语言模型在识别临床相关文本片段上具有出色的召回率(Deepseek: 0.98, Qwen: 0.99),且在提取NER关键词上的准确率为1.0。 Conclusion: 该研究为NLP系统的开发提供了一个有前景的新方向,相较基于深度学习的方法,能实现更快、更具成本效益且更透明的规则系统自动化或半自动化开发。 Abstract: Despite advances in machine learning (ML) and large language models (LLMs), rule-based natural language processing (NLP) systems remain active in clinical settings due to their interpretability and operational efficiency. However, their manual development and maintenance are labor-intensive, particularly in tasks with large linguistic variability. To overcome these limitations, we proposed a novel approach employing LLMs solely during the rule-based systems development phase. We conducted the initial experiments focusing on the first two steps of developing a rule-based NLP pipeline: find relevant snippets from the clinical note; extract informative keywords from the snippets for the rule-based named entity recognition (NER) component. Our experiments demonstrated exceptional recall in identifying clinically relevant text snippets (Deepseek: 0.98, Qwen: 0.99) and 1.0 in extracting key terms for NER. This study sheds light on a promising new direction for NLP development, enabling semi-automated or automated development of rule-based systems with significantly faster, more cost-effective, and transparent execution compared with deep learning model-based solutions.

[51] GeoGuess: Multimodal Reasoning based on Hierarchy of Visual Information in Street View

Fenghua Cheng,Jinxiang Wang,Sen Wang,Zi Huang,Xue Li

Main category: cs.CL

TL;DR: 本文提出了一个新的多模态推理任务 GeoGuess,要求根据街景图像确定位置并提供详细解释,并开发了方法 SightSense 来实现高性能推理与解释生成。

Details Motivation: 现有评估多模态推理能力的任务缺乏对不同粒度级别上的分层视觉线索(如局部细节和全局上下文)的讨论,尽管这在实际场景中经常涉及。为弥补这一不足,作者引入了一个新的任务 GeoGuess。 Method: 提出 SightSense 方法,结合多模态和多层次推理,基于视觉信息层次结构和外部知识进行预测和解释。 Result: 建立了一个基准数据集 GeoExplain,并展示了 SightSense 在 GeoGuess 任务中的卓越性能。 Conclusion: GeoGuess 是一个需要多层次视觉信息和地理知识推理能力的新任务,SightSense 在该任务上表现出色,并能生成全面的解释。 Abstract: Multimodal reasoning is a process of understanding, integrating and inferring information across different data modalities. It has recently attracted surging academic attention as a benchmark for Artificial Intelligence (AI). Although there are various tasks for evaluating multimodal reasoning ability, they still have limitations. Lack of reasoning on hierarchical visual clues at different levels of granularity, e.g., local details and global context, is of little discussion, despite its frequent involvement in real scenarios. To bridge the gap, we introduce a novel and challenging task for multimodal reasoning, namely GeoGuess. Given a street view image, the task is to identify its location and provide a detailed explanation. A system that succeeds in GeoGuess should be able to detect tiny visual clues, perceive the broader landscape, and associate with vast geographic knowledge. Therefore, GeoGuess would require the ability to reason between hierarchical visual information and geographic knowledge. In this work, we establish a benchmark for GeoGuess by introducing a specially curated dataset GeoExplain which consists of panoramas-geocoordinates-explanation tuples. Additionally, we present a multimodal and multilevel reasoning method, namely SightSense which can make prediction and generate comprehensive explanation based on hierarchy of visual information and external knowledge. Our analysis and experiments demonstrate their outstanding performance in GeoGuess.

[52] Long-Context Generalization with Sparse Attention

Pavlo Vasylenko,Marcos Treviso,André F. T. Martins

Main category: cs.CL

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that sparse attention mechanisms using $\alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $\alpha$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Finally, we show that the ability to locate and generalize fixed-size patterns can be further improved through a careful design of position encodings, which impacts both dense and sparse attention methods. By integrating ASEntmax into standard transformer layers alongside proper positional encodings, we show that our models greatly outperform softmax, scalable softmax, and fixed-temperature $\alpha$-entmax baselines on long-context generalization.

[53] Arch-Router: Aligning LLM Routing with Human Preferences

Co Tran,Salman Paracha,Adil Hafeez,Shuguang Chen

Main category: cs.CL

TL;DR: Arch-Router is a practical framework for enhancing model selection in LLMs by aligning with user-defined preferences.

Details Motivation: Existing LLM routing approaches are limited in capturing human preferences and selecting from a narrow pool of models. Method: A compact 1.5B model called Arch-Router was developed to map queries to domain-action preferences for model routing decisions. Result: Experiments showed that Arch-Router outperforms top proprietary models in matching queries with human preferences. Conclusion: The proposed Arch-Router framework improves the routing of large language models by aligning with user preferences, offering transparency and flexibility. Abstract: With the rapid proliferation of large language models (LLMs) -- each optimized for different strengths, style, or latency/cost profile -- routing has become an essential technique to operationalize the use of different models. However, existing LLM routing approaches are limited in two key ways: they evaluate performance using benchmarks that often fail to capture human preferences driven by subjective evaluation criteria, and they typically select from a limited pool of models. In this work, we propose a preference-aligned routing framework that guides model selection by matching queries to user-defined domains (e.g., travel) or action types (e.g., image editing) -- offering a practical mechanism to encode preferences in routing decisions. Specifically, we introduce \textbf{Arch-Router}, a compact 1.5B model that learns to map queries to domain-action preferences for model routing decisions. Our approach also supports seamlessly adding new models for routing without requiring retraining or architectural modifications. Experiments on conversational datasets demonstrate that our approach achieves state-of-the-art (SOTA) results in matching queries with human preferences, outperforming top proprietary models. Our approach captures subjective evaluation criteria and makes routing decisions more transparent and flexible. Our model is available at: \texttt{https://huggingface.co/katanemo/Arch-Router-1.5B}.

[54] Mechanisms vs. Outcomes: Probing for Syntax Fails to Explain Performance on Targeted Syntactic Evaluations

Ananth Agarwal,Jasper Jian,Christopher D. Manning,Shikhar Murty

Main category: cs.CL

TL;DR: The study investigates how Large Language Models represent syntactic structure, revealing that current probing methods do not reliably predict syntactic performance in downstream tasks.

Details Motivation: While Large Language Models exhibit mastery of syntax, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Method: Adopting a 'mechanisms vs. outcomes' framework, we evaluate 32 open-weight transformer models. Result: Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks. Conclusion: Syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Abstract: Large Language Models (LLMs) exhibit a robust mastery of syntax when processing and generating text. While this suggests internalized understanding of hierarchical syntax and dependency relations, the precise mechanism by which they represent syntactic structure is an open area within interpretability research. Probing provides one way to identify the mechanism of syntax being linearly encoded in activations, however, no comprehensive study has yet established whether a model's probing accuracy reliably predicts its downstream syntactic performance. Adopting a "mechanisms vs. outcomes" framework, we evaluate 32 open-weight transformer models and find that syntactic features extracted via probing fail to predict outcomes of targeted syntax evaluations across English linguistic phenomena. Our results highlight a substantial disconnect between latent syntactic representations found via probing and observable syntactic behaviors in downstream tasks.

[55] LegiGPT: Party Politics and Transport Policy with Large Language Model

Hyunsoo Yun,Eun Hak Lee

Main category: cs.CL

TL;DR: 本研究开发了一种新的框架LegiGPT,结合大型语言模型和可解释人工智能技术,用于分析交通相关的立法提案,发现立法者政治立场及其他因素显著影响政策制定过程。

Details Motivation: 了解立法者政治意识形态对立法决策的重要影响,对于政策制定至关重要。 Method: 使用基于零样本提示的GPT-4多阶段过滤和分类流水线,并应用XAI技术来检查政党隶属关系及相关属性之间的关系。 Result: 研究结果揭示了保守派和进步派发起人的人数和比例、选区规模和选举人口是决定立法结果的关键因素。 Conclusion: LegiGPT通过整合大型语言模型和可解释人工智能技术,提供了一个理解立法动态并指导未来政策制定的有价值工具。 Abstract: Given the significant influence of lawmakers' political ideologies on legislative decision-making, understanding their impact on policymaking is critically important. We introduce a novel framework, LegiGPT, which integrates a large language model (LLM) with explainable artificial intelligence (XAI) to analyze transportation-related legislative proposals. LegiGPT employs a multi-stage filtering and classification pipeline using zero-shot prompting with GPT-4. Using legislative data from South Korea's 21st National Assembly, we identify key factors - including sponsor characteristics, political affiliations, and geographic variables - that significantly influence transportation policymaking. The LLM was used to classify transportation-related bill proposals through a stepwise filtering process based on keywords, phrases, and contextual relevance. XAI techniques were then applied to examine relationships between party affiliation and associated attributes. The results reveal that the number and proportion of conservative and progressive sponsors, along with district size and electoral population, are critical determinants shaping legislative outcomes. These findings suggest that both parties contributed to bipartisan legislation through different forms of engagement, such as initiating or supporting proposals. This integrated approach provides a valuable tool for understanding legislative dynamics and guiding future policy development, with broader implications for infrastructure planning and governance.

[56] ReasonGRM: Enhancing Generative Reward Models through Large Reasoning Models

Bin Chen,Xinzge Gao,Chuanrui Hu,Penghang Yu,Hua Zhang,Bing-Kun Bao

Main category: cs.CL

TL;DR: 本文提出ReasonGRM框架,通过改进生成奖励模型的推理路径和训练方法,显著提升偏好建模的效果。

Details Motivation: 生成奖励模型(GRMs)由于推理能力不足,容易产生幻觉或遗漏关键信息,需要改进以提升复杂任务中的表现。 Method: 提出一个三阶段的生成奖励模型框架(ReasonGRM),包括Zero-RL生成推理路径、R*评分指标筛选路径以及强化学习优化模型。 Result: 实验结果显示,ReasonGRM在三个公开基准测试中平均优于之前最佳GRMs 1.8%,并超过GPT-4o等专有模型最多5.6%。 Conclusion: ReasonGRM通过推理感知训练和高质量推理路径选择,在生成奖励模型中取得了优异的性能,展示了其在偏好建模中的有效性。 Abstract: Generative Reward Models (GRMs) provide greater flexibility than scalar reward models in capturing human preferences, but their effectiveness is limited by poor reasoning capabilities. This often results in incomplete or overly speculative reasoning paths, leading to hallucinations or missing key information in complex tasks. We address this challenge with ReasonGRM, a three-stage generative reward modeling framework. In the first stage, Zero-RL is used to generate concise, outcome-directed reasoning paths that reduce the likelihood of critical omissions. In the second stage, we introduce a novel evaluation metric, $R^\star$, which scores reasoning paths based on their generation likelihood. This favors paths that reach correct answers with minimal exploration, helping to reduce hallucination-prone data during training. In the final stage, the model is further refined through reinforcement learning on challenging examples to enhance its preference discrimination capabilities. Experiments on three public benchmarks show that ReasonGRM achieves competitive or state-of-the-art performance, outperforming previous best GRMs by 1.8\% on average and surpassing proprietary models such as GPT-4o by up to 5.6\%. These results demonstrate the effectiveness of reasoning-aware training and highlight the importance of high-quality rationale selection for reliable preference modeling.

[57] The Role of Model Confidence on Bias Effects in Measured Uncertainties

Xinyi Liu,Weiguang Wang,Hangfeng He

Main category: cs.CL

TL;DR: This paper investigates how prompt-induced bias impacts uncertainty quantification in LLMs, finding that reducing such bias improves uncertainty estimation, especially for epistemic uncertainty, with effects varying based on model confidence levels.

Details Motivation: Accurately assessing epistemic uncertainty in Large Language Models (LLMs) is crucial for reliable outcomes in open-ended tasks, but it is challenging due to the presence of aleatoric uncertainty and the influence of bias. Understanding how bias affects these uncertainties can improve uncertainty quantification techniques. Method: The authors conducted experiments on Visual Question Answering (VQA) tasks using GPT-4o and Qwen2-VL to analyze how prompt biases affect epistemic and aleatoric uncertainty estimation. They built on prior work showing that LLMs tend to copy input information when confidence is low. Result: Mitigating prompt-introduced bias improves uncertainty quantification in GPT-4o. Biases cause greater changes in both epistemic and aleatoric uncertainties when bias-free model confidence is lower. Lower confidence leads to underestimation of epistemic uncertainty due to bias (overconfidence), while its effect on aleatoric uncertainty estimation is neutral. Conclusion: The study concludes that mitigating prompt-introduced bias enhances uncertainty quantification in LLMs, particularly for epistemic uncertainty, and highlights the differential impact of bias on uncertainties depending on model confidence levels. Abstract: With the growing adoption of Large Language Models (LLMs) for open-ended tasks, accurately assessing epistemic uncertainty, which reflects a model's lack of knowledge, has become crucial to ensuring reliable outcomes. However, quantifying epistemic uncertainty in such tasks is challenging due to the presence of aleatoric uncertainty, which arises from multiple valid answers. While bias can introduce noise into epistemic uncertainty estimation, it may also reduce noise from aleatoric uncertainty. To investigate this trade-off, we conduct experiments on Visual Question Answering (VQA) tasks and find that mitigating prompt-introduced bias improves uncertainty quantification in GPT-4o. Building on prior work showing that LLMs tend to copy input information when model confidence is low, we further analyze how these prompt biases affect measured epistemic and aleatoric uncertainty across varying bias-free confidence levels with GPT-4o and Qwen2-VL. We find that all considered biases induce greater changes in both uncertainties when bias-free model confidence is lower. Moreover, lower bias-free model confidence leads to greater underestimation of epistemic uncertainty (i.e. overconfidence) due to bias, whereas it has no significant effect on the direction of changes in aleatoric uncertainty estimation. These distinct effects deepen our understanding of bias mitigation for uncertainty quantification and potentially inform the development of more advanced techniques.

[58] LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

Daejin Jo,Jeeyoung Yun,Byungseok Roh,Sungwoong Kim

Main category: cs.CL

TL;DR: LM-SPT improves speech tokenization by enhancing semantic alignment with language models through a novel distillation approach, leading to better performance in speech-to-text and text-to-speech tasks.

Details Motivation: To address the issue of speech token sequences being longer than textual counterparts, which hampers efficient speech-language modeling, and to improve semantic alignment by avoiding distortion caused by traditional pooling techniques. Method: The paper proposes LM-SPT, which introduces a novel semantic distillation process. It reconstructs speech from semantic tokens and minimizes discrepancies between original and reconstructed waveform representations using a frozen ASR encoder. It also includes architectural improvements for encoder and decoder design. Result: LM-SPT achieves better reconstruction fidelity compared to baselines. SLMs trained on LM-SPT tokens perform competitively on speech-to-text tasks and outperform baselines on text-to-speech tasks. Conclusion: LM-SPT is a promising speech tokenization method that enhances semantic alignment with language models, offering multiple frame rates and achieving superior performance in speech-to-text and text-to-speech tasks. Abstract: With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models. In particular, previous methods use SSL teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, they still produce speech token sequences significantly longer than their textual counterparts, creating challenges for efficient speech-language modeling. Reducing the frame rate is a natural solution, but standard techniques, such as rigid average pooling across frames, can distort or dilute the semantic structure required for effective LM alignment. To address this, we propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation. Instead of directly matching teacher and student features via pooling, we reconstruct speech solely from semantic tokens and minimize the discrepancy between the encoded representations of the original and reconstructed waveforms, obtained from a frozen automatic speech recognition (ASR) encoder. This indirect yet data-driven supervision enables the tokenizer to learn discrete units that are more semantically aligned with language models. LM-SPT further incorporates architectural improvements to the encoder and decoder for speech tokenization, and supports multiple frame rates, including 25Hz, 12.5Hz, and 6.25Hz. Experimental results show that LM-SPT achieves superior reconstruction fidelity compared to baselines, and that SLMs trained with LM-SPT tokens achieve competitive performances on speech-to-text and consistently outperform baselines on text-to-speech tasks.

[59] Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly

Lance Ying,Ryan Truong,Katherine M. Collins,Cedegao E. Zhang,Megan Wei,Tyler Brooke-Wilson,Tan Zhi-Xuan,Lionel Wong,Joshua B. Tenenbaum

Main category: cs.CL

TL;DR: 本文提出LIRAS框架,结合语言和视觉输入,利用多模态模型和贝叶斯推理进行社会判断,效果优于现有方法。

Details Motivation: 现实世界中的社会推理需要考虑多模态信息,而语言提供了无法通过视觉观察的抽象和具体信息。 Method: 提出了LIRAS框架,结合多模态语言模型与贝叶斯逆向规划引擎,构建结构化的情境特定代理和环境表示。 Result: 在多个认知科学实验衍生任务中,LIRAS在捕捉跨领域的人类判断方面表现优异。 Conclusion: LIRAS框架能够有效整合语言和视觉输入,用于情境特定的社会推理,并在捕捉人类判断方面优于现有模型。 Abstract: Drawing real world social inferences usually requires taking into account information from multiple modalities. Language is a particularly powerful source of information in social settings, especially in novel situations where language can provide both abstract information about the environment dynamics and concrete specifics about an agent that cannot be easily visually observed. In this paper, we propose Language-Informed Rational Agent Synthesis (LIRAS), a framework for drawing context-specific social inferences that integrate linguistic and visual inputs. LIRAS frames multimodal social reasoning as a process of constructing structured but situation-specific agent and environment representations - leveraging multimodal language models to parse language and visual inputs into unified symbolic representations, over which a Bayesian inverse planning engine can be run to produce granular probabilistic judgments. On a range of existing and new social reasoning tasks derived from cognitive science experiments, we find that our model (instantiated with a comparatively lightweight VLM) outperforms ablations and state-of-the-art models in capturing human judgments across all domains.

[60] SocialSim: Towards Socialized Simulation of Emotional Support Conversation

Zhuang Chen,Yaru Cao,Guanqun Bi,Jincenzi Wu,Jinfeng Zhou,Xiyao Xiao,Si Chen,Hongning Wang,Minlie Huang

Main category: cs.CL

TL;DR: This paper proposes SocialSim, a new framework for generating high-quality emotional support conversation data by modeling key aspects of social interaction, outperforming traditional crowdsourced datasets.

Details Motivation: Creating large-scale emotional support conversation datasets through crowdsourcing is costly. Existing methods using language models often neglect the social dynamics crucial for effective emotional support simulations. Method: The authors introduced SocialSim, a framework that simulates emotional support conversations by incorporating social disclosure (for seekers) and social awareness (for supporters). They built a synthetic dataset called SSConv using this framework and trained a chatbot on it. Result: The SSConv dataset generated via SocialSim rivals, and in some cases surpasses, the quality of crowdsourced emotional support data. The chatbot trained on SSConv achieved state-of-the-art performance in both automatic and human evaluations. Conclusion: SocialSim offers a scalable way to synthesize emotional support conversations, enhancing the accessibility and practicality of emotional care. Abstract: Emotional support conversation (ESC) helps reduce people's psychological stress and provide emotional value through interactive dialogues. Due to the high cost of crowdsourcing a large ESC corpus, recent attempts use large language models for dialogue augmentation. However, existing approaches largely overlook the social dynamics inherent in ESC, leading to less effective simulations. In this paper, we introduce SocialSim, a novel framework that simulates ESC by integrating key aspects of social interactions: social disclosure and social awareness. On the seeker side, we facilitate social disclosure by constructing a comprehensive persona bank that captures diverse and authentic help-seeking scenarios. On the supporter side, we enhance social awareness by eliciting cognitive reasoning to generate logical and supportive responses. Building upon SocialSim, we construct SSConv, a large-scale synthetic ESC corpus of which quality can even surpass crowdsourced ESC data. We further train a chatbot on SSConv and demonstrate its state-of-the-art performance in both automatic and human evaluations. We believe SocialSim offers a scalable way to synthesize ESC, making emotional care more accessible and practical.

[61] Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models

Lei Jiang,Zixun Zhang,Zizhou Wang,Xiaobing Sun,Zhen Li,Liangli Zhen,Xiaohua Xu

Main category: cs.CL

TL;DR: This paper proposes CAMO, a new attack method that bypasses safety measures in LVLMs by splitting and reconstructing harmful instructions using both text and images, highlighting critical security vulnerabilities.

Details Motivation: LVLMs are vulnerable to jailbreak attacks that bypass safety measures, but existing methods are detectable and inefficient. A more stealthy and efficient approach is needed. Method: The paper introduces CAMO, a black-box jailbreak framework that splits malicious prompts into benign visual and textual fragments, exploiting cross-modal reasoning to reconstruct harmful instructions covertly. Result: CAMO demonstrates strong performance and cross-model transferability while requiring fewer queries, effectively evading detection systems. Conclusion: CAMO highlights significant vulnerabilities in current LVLM safety mechanisms and emphasizes the urgent need for alignment-aware security solutions. Abstract: Large Vision-Language Models (LVLMs) demonstrate exceptional performance across multimodal tasks, yet remain vulnerable to jailbreak attacks that bypass built-in safety mechanisms to elicit restricted content generation. Existing black-box jailbreak methods primarily rely on adversarial textual prompts or image perturbations, yet these approaches are highly detectable by standard content filtering systems and exhibit low query and computational efficiency. In this work, we present Cross-modal Adversarial Multimodal Obfuscation (CAMO), a novel black-box jailbreak attack framework that decomposes malicious prompts into semantically benign visual and textual fragments. By leveraging LVLMs' cross-modal reasoning abilities, CAMO covertly reconstructs harmful instructions through multi-step reasoning, evading conventional detection mechanisms. Our approach supports adjustable reasoning complexity and requires significantly fewer queries than prior attacks, enabling both stealth and efficiency. Comprehensive evaluations conducted on leading LVLMs validate CAMO's effectiveness, showcasing robust performance and strong cross-model transferability. These results underscore significant vulnerabilities in current built-in safety mechanisms, emphasizing an urgent need for advanced, alignment-aware security and safety solutions in vision-language systems.

[62] DistillNote: LLM-based clinical note summaries improve heart failure diagnosis

Heloisa Oss Boll,Antonio Oss Boll,Leticia Puttlitz Boll,Ameen Abu Hanna,Iacer Calixto

Main category: cs.CL

TL;DR: Distillnote提出一种高效的临床笔记摘要方法,显著提高心力衰竭预测性能并减少幻觉。

Details Motivation: 为减轻医疗人员文档负担,利用大语言模型生成高质量临床笔记摘要。 Method: 提出了一个基于大语言模型的三步临床笔记摘要框架(One-step、Structured和Distilled),并使用压缩率和AUPRC指标评估摘要效果。 Result: 蒸馏摘要实现79%文本压缩和最高18.2%的AUPRC提升,同时平均效率达到6.9倍压缩性能比。 Conclusion: Distillnote框架在临床笔记摘要生成方面表现出色,通过蒸馏摘要实现了高效且准确的心力衰竭预测,并减少了模型幻觉。 Abstract: Large language models (LLMs) offer unprecedented opportunities to generate concise summaries of patient information and alleviate the burden of clinical documentation that overwhelms healthcare providers. We present Distillnote, a framework for LLM-based clinical note summarization, and generate over 64,000 admission note summaries through three techniques: (1) One-step, direct summarization, and a divide-and-conquer approach involving (2) Structured summarization focused on independent clinical insights, and (3) Distilled summarization that further condenses the Structured summaries. We test how useful are the summaries by using them to predict heart failure compared to a model trained on the original notes. Distilled summaries achieve 79% text compression and up to 18.2% improvement in AUPRC compared to an LLM trained on the full notes. We also evaluate the quality of the generated summaries in an LLM-as-judge evaluation as well as through blinded pairwise comparisons with clinicians. Evaluations indicate that one-step summaries are favoured by clinicians according to relevance and clinical actionability, while distilled summaries offer optimal efficiency (avg. 6.9x compression-to-performance ratio) and significantly reduce hallucinations. We release our summaries on PhysioNet to encourage future research.

[63] MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

Muyang Zheng,Yuanzhi Yao,Changting Lin,Rui Wang,Meng Han

Main category: cs.CL

TL;DR: MIST是一种通过迭代语义调整来破解黑盒大语言模型的方法,具有较高的攻击成功率和攻击可迁移性。

Details Motivation: 尽管努力使大型语言模型(LLMs)与社会和道德价值观对齐,但这些模型仍然容易受到越狱攻击。破解黑盒LLMs被认为具有挑战性,因为输入标记的离散性、对目标LLM的访问受限以及查询预算有限。 Method: MIST利用顺序同义词搜索及其高级版本--顺序决定优化,在保持原始语义意图的同时迭代优化提示。 Result: 实验表明,MIST在攻击成功率和攻击可迁移性方面与其他最先进的白盒和黑盒破解方法相当,并验证了其计算效率和实用性。 Conclusion: MIST提供了一种有效的方法来破解黑盒大语言模型,平衡了语义相似性和计算效率。 Abstract: Despite efforts to align large language models (LLMs) with societal and moral values, these models remain susceptible to jailbreak attacks--methods designed to elicit harmful responses. Jailbreaking black-box LLMs is considered challenging due to the discrete nature of token inputs, restricted access to the target LLM, and limited query budget. To address the issues above, we propose an effective method for jailbreaking black-box large language Models via Iterative Semantic Tuning, named MIST. MIST enables attackers to iteratively refine prompts that preserve the original semantic intent while inducing harmful content. Specifically, to balance semantic similarity with computational efficiency, MIST incorporates two key strategies: sequential synonym search, and its advanced version--order-determining optimization. Extensive experiments across two open-source models and four closed-source models demonstrate that MIST achieves competitive attack success rates and attack transferability compared with other state-of-the-art white-box and black-box jailbreak methods. Additionally, we conduct experiments on computational efficiency to validate the practical viability of MIST.

[64] From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts

Daniel Christoph,Max Ploner,Patrick Haller,Alan Akbik

Main category: cs.CL

TL;DR: 该研究通过比较不同架构和大小模型在高频和低频事实上的表现,揭示了模型设计对事实学习效率的影响。

Details Motivation: 提高语言模型的样本效率对于训练效率具有重要意义,特别是在处理信息分布长尾现象时,需要模型能够学习并回忆频繁和不频繁的事实。 Method: 通过分析不同架构和大小的模型在相同预训练数据上的表现,并标注训练语料库中的关系事实频率,研究模型性能随事实频率的变化。 Result: 大多数模型在高频事实上表现相似,但在低频事实上表现明显不同。 Conclusion: 模型在高频事实上的表现相似,但在低频事实上表现出显著差异,这为模型架构、规模和事实学习效率之间的关系提供了新的见解。 Abstract: Sample efficiency is a crucial property of language models with practical implications for training efficiency. In real-world text, information follows a long-tailed distribution. Yet, we expect models to learn and recall frequent and infrequent facts. Sample-efficient models are better equipped to handle this challenge of learning and retaining rare information without requiring excessive exposure. This study analyzes multiple models of varying architectures and sizes, all trained on the same pre-training data. By annotating relational facts with their frequencies in the training corpus, we examine how model performance varies with fact frequency. Our findings show that most models perform similarly on high-frequency facts but differ notably on low-frequency facts. This analysis provides new insights into the relationship between model architecture, size, and factual learning efficiency.

[65] Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond

Antonin Berthon,Mihaela van der Schaar

Main category: cs.CL

TL;DR: This paper proposes the Language Bottleneck Model for interpretable and accurate student knowledge assessment.

Details Motivation: Traditional KT methods are opaque, and LLM-based approaches can hallucinate without accuracy guarantees, necessitating an interpretable yet precise method. Method: Reformulating Knowledge Tracing as an inverse problem to learn concise natural-language summaries that pass predictive information through a bottleneck using an encoder LLM and frozen decoder LLM. Result: LBMs perform comparably to state-of-the-art KT and LLM methods while drastically reducing the need for large datasets and enabling human interpretation. Conclusion: The Language Bottleneck Model (LBM) provides accurate and interpretable assessments of student knowledge, effectively balancing performance with interpretability. Abstract: Accurately assessing student knowledge is critical for effective education, yet traditional Knowledge Tracing (KT) methods rely on opaque latent embeddings, limiting interpretability. Even LLM-based approaches generate direct predictions or summaries that may hallucinate without any accuracy guarantees. We recast KT as an inverse problem: learning the minimum natural-language summary that makes past answers explainable and future answers predictable. Our Language Bottleneck Model (LBM) consists of an encoder LLM that writes an interpretable knowledge summary and a frozen decoder LLM that must reconstruct and predict student responses using only that summary text. By constraining all predictive information to pass through a short natural-language bottleneck, LBMs ensure that the summary contains accurate information while remaining human-interpretable. Experiments on synthetic arithmetic benchmarks and the large-scale Eedi dataset show that LBMs rival the accuracy of state-of-the-art KT and direct LLM methods while requiring orders-of-magnitude fewer student trajectories. We demonstrate that training the encoder with group-relative policy optimization, using downstream decoding accuracy as a reward signal, effectively improves summary quality.

[66] TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs

Sahil Kale,Vijaykant Nadadur

Main category: cs.CL

TL;DR: The paper introduces TeXpert, a new benchmark dataset for evaluating LLMs in LaTeX generation, revealing key challenges and findings about model performance.

Details Motivation: To assess the ability of LLMs in producing publication-ready LaTeX material, which is currently not evaluated in existing benchmarks. Method: The authors introduced TeXpert, a benchmark dataset with natural language prompts for generating LaTeX code, and evaluated both open and closed-source LLMs. Result: LLMs performing well on standard benchmarks struggle with LaTeX generation, showing a notable decrease in accuracy with increased task complexity; open-source models like DeepSeek v3 and DeepSeek Coder perform competitively against closed-source models. Conclusion: The study reveals that while LLMs show promise in generating LaTeX code, there are significant challenges related to accuracy and error types, especially as task complexity increases. Abstract: LaTeX's precision and flexibility in typesetting have made it the gold standard for the preparation of scientific documentation. Large Language Models (LLMs) present a promising opportunity for researchers to produce publication-ready material using LaTeX with natural language instructions, yet current benchmarks completely lack evaluation of this ability. By introducing TeXpert, our benchmark dataset with natural language prompts for generating LaTeX code focused on components of scientific documents across multiple difficulty levels, we conduct an in-depth analysis of LLM performance in this regard and identify frequent error types. Our evaluation across open and closed-source LLMs highlights multiple key findings: LLMs excelling on standard benchmarks perform poorly in LaTeX generation with a significant accuracy drop-off as the complexity of tasks increases; open-source models like DeepSeek v3 and DeepSeek Coder strongly rival closed-source counterparts in LaTeX tasks; and formatting and package errors are unexpectedly prevalent, suggesting a lack of diverse LaTeX examples in the training datasets of most LLMs. Our dataset, code, and model evaluations are available at https://github.com/knowledge-verse-ai/TeXpert.

[67] PersonalAI: Towards digital twins in the graph form

Mikhail Menschikov,Dmitry Evseev,Ruslan Kostoev,Ilya Perepechkin,Ilnaz Salimov,Victoria Dochkina,Petr Anokhin,Evgeny Burnaev,Nikita Semenov

Main category: cs.CL

TL;DR: This paper proposes a novel method for personalizing language models through external memory in the form of knowledge graphs, demonstrating robust performance in question-answering systems.

Details Motivation: The challenge of personalizing language models by accounting for a user's history during interactions remains pertinent despite recent advancements in large language models and Retrieval Augmented Generation. Method: Utilizing external memory in the form of knowledge graphs with standard edges and two types of hyperedges to personalize language models by retaining extensive personal information. Result: Experiments on TriviaQA, HotpotQA, and DiaASQ benchmarks indicate that the approach aids in making the process of graph construction and knowledge extraction unified and robust. The performance of the question-answering system remained robust even after augmenting the DiaASQ benchmark with parameters such as time and introducing contradictory statements. Conclusion: The proposed architecture effectively maintains and utilizes temporal dependencies, making the process of graph construction and knowledge extraction unified and robust. Abstract: The challenge of personalizing language models, specifically the ability to account for a user's history during interactions, is of significant interest. Despite recent advancements in large language models (LLMs) and Retrieval Augmented Generation that have enhanced the factual base of LLMs, the task of retaining extensive personal information and using it to generate personalized responses remains pertinent. To address this, we propose utilizing external memory in the form of knowledge graphs, which are constructed and updated by the LLM itself. We have expanded upon ideas of AriGraph architecture and for the first time introduced a combined graph featuring both standard edges and two types of hyperedges. Experiments conducted on the TriviaQA, HotpotQA and DiaASQ benchmarks indicates that this approach aids in making the process of graph construction and knowledge extraction unified and robust. Furthermore, we augmented the DiaASQ benchmark by incorporating parameters such as time into dialogues and introducing contradictory statements made by the same speaker at different times. Despite these modifications, the performance of the question-answering system remained robust, demonstrating the proposed architecture's ability to maintain and utilize temporal dependencies.

[68] LLM-Generated Feedback Supports Learning If Learners Choose to Use It

Danielle R. Thomas,Conrad Borchers,Shambhavi Bhushan,Erin Gatz,Shivang Gupta,Kenneth R. Koedinger

Main category: cs.CL

TL;DR: This paper shows that optional LLM-generated explanatory feedback provides moderate learning benefits, particularly for learners inclined to use it, making it a cost-effective and scalable enhancement to existing educational systems.

Details Motivation: This study addresses the gap in understanding the impact of LLM-generated feedback on learning compared to traditional methods, especially in scenarios where such feedback is optional rather than mandatory. Method: The research analyzed over 2,600 lesson completions from 885 tutor learners across seven lessons, comparing posttest performance among three groups: those receiving LLM-generated feedback (using gpt-3.5-turbo), those who declined it, and those without access. Propensity scoring was used to address potential selection bias. Result: After adjusting for selection bias using propensity scoring, two out of seven lessons showed statistically significant improvements in learning outcomes with standardized effect sizes of 0.28 and 0.33. Learners also overwhelmingly rated the LLM feedback as helpful. Conclusion: The study concludes that LLM-generated feedback can offer moderate learning benefits, particularly for learners inclined to seek support, and it does not significantly increase completion time. It is presented as a low-cost, scalable enhancement to existing systems providing non-LLM feedback. Abstract: Large language models (LLMs) are increasingly used to generate feedback, yet their impact on learning remains underexplored, especially compared to existing feedback methods. This study investigates how on-demand LLM-generated explanatory feedback influences learning in seven scenario-based tutor training lessons. Analyzing over 2,600 lesson completions from 885 tutor learners, we compare posttest performance among learners across three groups: learners who received feedback generated by gpt-3.5-turbo, those who declined it, and those without access. All groups received non-LLM corrective feedback. To address potential selection bias-where higher-performing learners may be more inclined to use LLM feedback-we applied propensity scoring. Learners with a higher predicted likelihood of engaging with LLM feedback scored significantly higher at posttest than those with lower propensity. After adjusting for this effect, two out of seven lessons showed statistically significant learning benefits from LLM feedback with standardized effect sizes of 0.28 and 0.33. These moderate effects suggest that the effectiveness of LLM feedback depends on the learners' tendency to seek support. Importantly, LLM feedback did not significantly increase completion time, and learners overwhelmingly rated it as helpful. These findings highlight LLM feedback's potential as a low-cost and scalable way to improve learning on open-ended tasks, particularly in existing systems already providing feedback without LLMs. This work contributes open datasets, LLM prompts, and rubrics to support reproducibility.

[69] Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning

Giuseppe Attanasio,Sonal Sannigrahi,Ben Peters,André F. T. Martins

Main category: cs.CL

TL;DR: The paper proposes a unified speech-to-text model for the IWSLT 2025 Shared Task using modality alignment and instruction fine-tuning with small-scale backbones and high-quality data.

Details Motivation: The motivation is to address the IWSLT 2025 Shared Task on Instruction Following Speech Processing by focusing on efficient models that utilize small-scale language model backbones and high-quality, CC-BY data. Method: The method involves creating a unified speech-to-text model by integrating a pre-trained continuous speech encoder and text decoder through modality alignment and instruction fine-tuning. Result: The result is the successful submission of model outcomes for the Short Track tasks, demonstrating the effectiveness of the proposed approach. Conclusion: The paper concludes that the unified speech-to-text model, using small-scale language model backbones and high-quality data supplemented with synthetic generation, is effective for the Short Track tasks of speech recognition, translation, and spoken question answering. Abstract: This paper presents the IT-IST submission to the IWSLT 2025 Shared Task on Instruction Following Speech Processing. We submit results for the Short Track, i.e., speech recognition, translation, and spoken question answering. Our model is a unified speech-to-text model that integrates a pre-trained continuous speech encoder and text decoder through a first phase of modality alignment and a second phase of instruction fine-tuning. Crucially, we focus on using small-scale language model backbones (< 2B) and restrict to high-quality, CC-BY data along with synthetic data generation to supplement existing resources.

[70] MUCAR: Benchmarking Multilingual Cross-Modal Ambiguity Resolution for Multimodal Large Language Models

Xiaolong Wang,Zhaolu Kang,Wangyuxuan Zhai,Xinyue Lou,Yunghwei Lai,Ziyue Wang,Yawen Wang,Kaiyu Huang,Yile Wang,Peng Li,Yang Liu

Main category: cs.CL

TL;DR: 本文介绍了MUCAR,这是一个用于评估多语言和跨模态场景中多模态歧义解析的新基准。

Details Motivation: 现有的多模态基准通常忽视了语言和视觉歧义,未能充分利用模态间的相互澄清潜力。 Method: 引入了一个新的基准MUCAR,包括一个多语言数据集和一个双歧义数据集,以系统地评估跨模态的歧义解析能力。 Result: 对19种最先进的多模态模型进行评估,结果显示与人类表现相比仍有较大差距。 Conclusion: 需要进一步研究更复杂的跨模态歧义理解方法。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances across numerous vision-language tasks. Due to their strong image-text alignment capability, MLLMs can effectively understand image-text pairs with clear meanings. However, effectively resolving the inherent ambiguities in natural language and visual contexts remains challenging. Existing multimodal benchmarks typically overlook linguistic and visual ambiguities, relying mainly on unimodal context for disambiguation and thus failing to exploit the mutual clarification potential between modalities. To bridge this gap, we introduce MUCAR, a novel and challenging benchmark designed explicitly for evaluating multimodal ambiguity resolution across multilingual and cross-modal scenarios. MUCAR includes: (1) a multilingual dataset where ambiguous textual expressions are uniquely resolved by corresponding visual contexts, and (2) a dual-ambiguity dataset that systematically pairs ambiguous images with ambiguous textual contexts, with each combination carefully constructed to yield a single, clear interpretation through mutual disambiguation. Extensive evaluations involving 19 state-of-the-art multimodal models--encompassing both open-source and proprietary architectures--reveal substantial gaps compared to human-level performance, highlighting the need for future research into more sophisticated cross-modal ambiguity comprehension methods, further pushing the boundaries of multimodal reasoning.

[71] Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025

Dominik Macháček,Peter Polák

Main category: cs.CL

TL;DR: 本文介绍了Charles大学在IWSLT 2025同步语音翻译任务中的提交,该方案通过使用Whisper模型和AlignAtt策略,在多种语言对上实现了显著的性能提升。

Details Motivation: 为了在IWSLT 2025的同步语音翻译任务中提供一种有效的解决方案,覆盖所有四种语言对。 Method: 使用离线Whisper语音模型作为系统的基础,并采用AlignAtt作为同步策略进行翻译和转录,同时通过提示注入领域术语来提高性能。 Result: 与组织者的基线相比,在开发集上,捷克语到英语的BLEU分数提高了2分,英语到德语、中文和日语的BLEU分数提高了13-22分。 Conclusion: Charles University的提交在IWSLT 2025的同步语音翻译任务中取得了显著的改进,特别是提出了一种增强的语音识别延迟度量。 Abstract: This paper describes Charles University submission to the Simultaneous Speech Translation Task of the IWSLT 2025. We cover all four language pairs with a direct or cascade approach. The backbone of our systems is the offline Whisper speech model, which we use for both translation and transcription in simultaneous mode with the state-of-the-art simultaneous policy AlignAtt. We further improve the performance by prompting to inject in-domain terminology, and we accommodate context. Our cascaded systems further use EuroLLM for unbounded simultaneous translation. Compared to the Organizers' baseline, our systems improve by 2 BLEU points on Czech to English and 13-22 BLEU points on English to German, Chinese and Japanese on the development sets. Additionally, we also propose a new enhanced measure of speech recognition latency.

[72] Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs

Ricardo Rei,Nuno M. Guerreiro,José Pombal,João Alves,Pedro Teixeirinha,Amin Farajian,André F. T. Martins

Main category: cs.CL

TL;DR: Tower+ balances strong translation performance with robust general-purpose multilingual capabilities through a novel training approach, outperforming larger models in key benchmarks.

Details Motivation: Fine-tuning LLMs often sacrifices general-purpose abilities like conversational reasoning and instruction-following, limiting their real-world utility. This work aims to maintain strong performance on translation while preserving broad capabilities. Method: Tower+ builds on a training recipe combining continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. Carefully curated data is used at each stage to enhance performance across tasks like translation, code generation, math problem-solving, and instruction-following. Result: Tower+ achieves a Pareto frontier between translation performance and general-purpose skills. Smaller Tower+ models outperform larger proprietary and open LLMs (e.g., Llama 3.3 70B, GPT-4o), while the largest model excels in translation for high-resource languages and performs best in multilingual Arena Hard and IF-MT benchmarks. Conclusion: Tower+ models successfully balance translation specialization with multilingual general-purpose capabilities, rivaling frontier models in both areas. Abstract: Fine-tuning pretrained LLMs has been shown to be an effective strategy for reaching state-of-the-art performance on specific tasks like machine translation. However, this process of adaptation often implies sacrificing general-purpose capabilities, such as conversational reasoning and instruction-following, hampering the utility of the system in real-world applications that require a mixture of skills. In this paper, we introduce Tower+, a suite of models designed to deliver strong performance across both translation and multilingual general-purpose text capabilities. We achieve a Pareto frontier between translation specialization and multilingual general-purpose capabilities by introducing a novel training recipe that builds on Tower (Alves et al., 2024), comprising continued pretraining, supervised fine-tuning, preference optimization, and reinforcement learning with verifiable rewards. At each stage of training, we carefully generate and curate data to strengthen performance on translation as well as general-purpose tasks involving code generation, mathematics problem solving, and general instruction-following. We develop models at multiple scales: 2B, 9B, and 72B. Our smaller models often outperform larger general-purpose open-weight and proprietary LLMs (e.g., Llama 3.3 70B, GPT-4o). Our largest model delivers best-in-class translation performance for high-resource languages and top results in multilingual Arena Hard evaluations and in IF-MT, a benchmark we introduce for evaluating both translation and instruction-following. Our findings highlight that it is possible to rival frontier models in general capabilities, while optimizing for specific business domains, such as translation and localization.

[73] Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation

Jiahao Cheng,Tiancheng Su,Jia Yuan,Guoxiu He,Jiawei Liu,Xinqi Tao,Jingwen Xie,Huaxia Li

Main category: cs.CL

TL;DR: This paper explores how Chain-of-Thought prompting reduces Large Language Model hallucinations but simultaneously impairs detection capabilities by masking critical signals.

Details Motivation: Chain-of-Thought (CoT) prompting has shown potential in mitigating LLM hallucinations, but its impact on hallucination detection remains underexplored. This study aims to bridge that gap. Method: A systematic empirical evaluation was conducted, starting with a pilot experiment analyzing the impact of CoT prompting on LLM internal states and token probability distributions. The research further assessed various CoT prompting techniques' effects on mainstream hallucination detection methods across different types of LLMs. Result: Findings indicate that CoT prompting can reduce hallucination frequency. However, it simultaneously masks key signals used for hallucination detection, thus diminishing the accuracy and confidence of detection methods. Conclusion: The study reveals a trade-off in using CoT prompting: while it reduces hallucinations, it also hinders the effectiveness of detection methods by obscuring critical signals. Abstract: Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM's internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: https://anonymous.4open.science/r/cot-hallu-detect.

[74] Better Language Model Inversion by Compactly Representing Next-Token Distributions

Murtaza Nazir,Matthew Finlayson,John X. Morris,Xiang Ren,Swabha Swayamdipta

Main category: cs.CL

TL;DR: 本文提出了一种新的语言模型反演方法——基于对数概率序列的提示反演(PILS),通过利用模型在多个生成步骤中的下一个标记概率来恢复隐藏的提示。与现有技术相比,该方法在恢复隐藏提示方面表现出显著的性能提升,并展示了良好的泛化能力。

Details Motivation: 语言模型反演旨在仅使用语言模型输出来恢复隐藏提示,这对语言模型部署的安全性和可追责性具有重要意义,例如从受API保护的语言模型系统消息中泄露私人信息。 Method: 提出了一种基于对数概率序列的提示反演方法(PILS)。该方法的核心见解是:语言模型的向量值输出位于低维子空间中。这使得可以通过线性映射无损压缩多个生成步骤中的完整下一个标记概率分布,从而在反演过程中利用更多的输出信息。 Result: 与现有的最先进方法相比,所提出的方法在恢复隐藏提示方面的准确率提高了2-3.5倍,在某些测试集中恢复率从17%提高到60%。此外,该方法在恢复隐藏系统消息的任务中也表现出色,并且展示出良好的泛化能力。 Conclusion: 本文的研究表明,与之前已知的情况相比,下一个标记概率是一个更容易受到反演攻击的攻击面。 Abstract: Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method -- prompt inversion from logprob sequences (PILS) -- that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2--3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5--27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.

[75] Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?

Adithya Bhaskar,Alexander Wettig,Tianyu Gao,Yihe Dong,Danqi Chen

Main category: cs.CL

TL;DR: 本研究提出KV footprint度量来分析长文本处理中KV缓存的内存使用问题,并设计了PruLong方法以有效降低内存占用,同时维持模型性能。

Details Motivation: 语言模型处理越来越长的上下文带来了KV缓存内存成本上升的问题,现有方法在特定有利环境下应用存在高内存峰值和性能下降等问题,缺乏公平比较,因此需要一种统一的评估指标和更优的KV管理方法。 Method: 本文提出了一个统一的度量标准KV footprint,用于评估KV缓存条目存储的数量及其在内存中的生命周期,并提出了一种名为PruLong的方法,在保留长上下文性能的同时优化内存使用。 Result: 通过新提出的KV footprint度量发现,现有KV清除方法存在较高的内存峰值;而新方法PruLong相比之前的方法KV footprint减少了12%,同时保持了长上下文任务的性能。 Conclusion: KV eviction方法的高内存峰值问题通过新的指标KV footprint被揭示,并提出了有效减少该footprint的新方法PruLong,为未来降低KV footprint的研究铺平道路。 Abstract: Language models handle increasingly long contexts for tasks such as book summarization, but this leads to growing memory costs for the key-value (KV) cache. Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings, obscuring caveats like high peak memory and performance degradation, and a fair comparison between methods is difficult. In this paper, we propose the *KV footprint* as a unified metric, which accounts for both the amount of KV entries stored and their lifespan in memory. We evaluate methods based on the smallest footprint they attain while preserving performance in both long-context understanding and generation, with context lengths of up to 128K tokens. This metric reveals the high peak memory of prior KV eviction methods. One class of methods -- *post-fill eviction* -- has a high footprint due to being incompatible with eviction during pre-filling. We adapt these methods to be able to evict KVs during pre-filling, achieving substantially lower KV footprints. We then turn to *recency eviction* methods, wherein we propose PruLong, an end-to-end optimization method for learning which attention heads need to retain the full KV cache and which do not. PruLong saves memory while preserving long-context performance, achieving 12% smaller KV footprint than prior methods while retaining performance in challenging recall tasks. Our paper clarifies the complex tangle of long-context inference methods and paves the way for future development to minimize the KV footprint.

[76] CLEAR-3K: Assessing Causal Explanatory Capabilities in Language Models

Naiming Liu,Richard Baraniuk,Shashank Sonkar

Main category: cs.CL

TL;DR: 本文介绍了CLEAR-3K数据集,用于评估语言模型的因果推理能力,并发现当前模型容易混淆语义相关性与真实的因果关系。

Details Motivation: 为了评估语言模型在区分语义相关性和真正因果解释关系方面的能力,我们提出了CLEAR-3K数据集。 Method: 引入CLEAR-3K数据集,包含3000个断言-推理问题,旨在评估语言模型是否能够确定一个陈述是否因果解释了另一个陈述。对21种最先进的语言模型进行了全面评估(参数范围从0.5B到72B). Result: 首先,语言模型经常混淆语义相似性和因果关系,依赖词汇和语义重叠而不是推断实际的因果解释关系。其次,随着参数大小增加,模型倾向于从对因果关系过于怀疑转变为过度接受它们。然而,即使表现最好的模型,通过马修斯相关系数衡量的性能也仅达到0.55的平稳状态。 Conclusion: CLEAR-3K是一个重要的基准测试,用于开发和评估语言模型中的真实因果推理能力,这是需要准确评估因果关系的应用所必需的能力。 Abstract: We introduce CLEAR-3K, a dataset of 3,000 assertion-reasoning questions designed to evaluate whether language models can determine if one statement causally explains another. Each question present an assertion-reason pair and challenge language models to distinguish between semantic relatedness and genuine causal explanatory relationships. Through comprehensive evaluation of 21 state-of-the-art language models (ranging from 0.5B to 72B parameters), we identify two fundamental findings. First, language models frequently confuse semantic similarity with causality, relying on lexical and semantic overlap instead of inferring actual causal explanatory relationships. Second, as parameter size increases, models tend to shift from being overly skeptical about causal relationships to being excessively permissive in accepting them. Despite this shift, performance measured by the Matthews Correlation Coefficient plateaus at just 0.55, even for the best-performing models.Hence, CLEAR-3K provides a crucial benchmark for developing and evaluating genuine causal reasoning in language models, which is an essential capability for applications that require accurate assessment of causal relationships.

[77] Towards AI Search Paradigm

Yuchen Li,Hengyi Cai,Rui Kong,Xinran Chen,Jiamin Chen,Jun Yang,Haojie Zhang,Jiayi Li,Jiayi Wu,Yiqun Chen,Changle Qu,Keyi Kong,Wenwen Ye,Lixin Su,Xinyu Ma,Long Xia,Daiting Shi,Jiashu Zhao,Haoyi Xiong,Shuaiqiang Wang,Dawei Yin

Main category: cs.CL

TL;DR: 本文提出了一个基于大语言模型的AI搜索范式,设计了四个智能代理协同工作的模块化架构,以应对复杂的搜索与推理任务。

Details Motivation: 为了使下一代搜索系统能够模拟人类的信息处理和决策能力,满足从简单事实查询到复杂多阶段推理任务的广泛信息需求。 Method: 提出了一种模块化架构,包含四个LLM驱动的代理(Master、Planner、Executor和Writer),并通过协调的工作流进行协作,涵盖从任务规划与工具集成到执行策略和高效LLM推理的方法。 Result: 介绍了一个名为AI Search Paradigm的综合蓝图,并系统地展示了实现这一范式的关键方法和技术。 Conclusion: 该论文旨在通过提供这些基础组件的深入指南,为下一代AI搜索系统的设计和开发提供指导,实现可信赖、自适应且可扩展的AI搜索系统。 Abstract: In this paper, we introduce the AI Search Paradigm, a comprehensive blueprint for next-generation search systems capable of emulating human information processing and decision-making. The paradigm employs a modular architecture of four LLM-powered agents (Master, Planner, Executor and Writer) that dynamically adapt to the full spectrum of information needs, from simple factual queries to complex multi-stage reasoning tasks. These agents collaborate dynamically through coordinated workflows to evaluate query complexity, decompose problems into executable plans, and orchestrate tool usage, task execution, and content synthesis. We systematically present key methodologies for realizing this paradigm, including task planning and tool integration, execution strategies, aligned and robust retrieval-augmented generation, and efficient LLM inference, spanning both algorithmic techniques and infrastructure-level optimizations. By providing an in-depth guide to these foundational components, this work aims to inform the development of trustworthy, adaptive, and scalable AI search systems.

[78] Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency

Kathleen C. Fraser,Hillary Dawkins,Isar Nejadgholi,Svetlana Kiritchenko

Main category: cs.CL

TL;DR: Fine-tuning large language models can unintentionally reduce their safety features, making them vulnerable to misuse. Current safety evaluation methods are inconsistent, calling for more rigorous and standardized reporting practices.

Details Motivation: Fine-tuning is widely used to adapt LLMs to specific tasks, but it can remove critical safety features. This poses a risk when deployed by well-intentioned developers or exploited by malicious actors, necessitating better understanding and mitigation strategies. Method: The researchers conducted experiments to assess how variations in fine-tuning setups affect the robustness of safety evaluations, focusing on the impact of trivial changes and model stochasticity. Result: Experiments revealed significant variance in safety evaluation results due to minor changes in fine-tuning procedures, highlighting the fragility of current evaluation practices. Conclusion: The study concludes that fine-tuning LLMs compromises safety alignment, posing risks as developers may unknowingly deploy less safe models. The paper emphasizes the need for reliable evaluation methods and transparent reporting in future research. Abstract: Fine-tuning a general-purpose large language model (LLM) for a specific domain or task has become a routine procedure for ordinary users. However, fine-tuning is known to remove the safety alignment features of the model, even when the fine-tuning data does not contain any harmful content. We consider this to be a critical failure mode of LLMs due to the widespread uptake of fine-tuning, combined with the benign nature of the "attack". Most well-intentioned developers are likely unaware that they are deploying an LLM with reduced safety. On the other hand, this known vulnerability can be easily exploited by malicious actors intending to bypass safety guardrails. To make any meaningful progress in mitigating this issue, we first need reliable and reproducible safety evaluations. In this work, we investigate how robust a safety benchmark is to trivial variations in the experimental procedure, and the stochastic nature of LLMs. Our initial experiments expose surprising variance in the results of the safety evaluation, even when seemingly inconsequential changes are made to the fine-tuning setup. Our observations have serious implications for how researchers in this field should report results to enable meaningful comparisons in the future.

[79] LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles

Ho Yin 'Sam' Ng,Ting-Yao Hsu,Aashish Anantha Ramakrishnan,Branislav Kveton,Nedim Lipka,Franck Dernoncourt,Dongwon Lee,Tong Yu,Sungchul Kim,Ryan A. Rossi,Ting-Hao 'Kenneth' Huang

Main category: cs.CL

TL;DR: LaMP-Cap is introduced for personalized figure caption generation using multimodal profiles. Results show profile-based methods improve caption quality, with images being more impactful than text in profiles.

Details Motivation: Figure captions are crucial for understanding figures, but AI-generated captions often require revision to match an author's style and domain-specific context. Existing personalization approaches mainly focus on text-only settings, lacking support for multimodal inputs and profiles. Method: The paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. It uses four LLMs to experimentally evaluate the effectiveness of profile information and conducts ablation studies to assess contributions of different profile components. Result: Experiments show that incorporating profile information consistently improves caption quality by making them closer to author-written ones. Ablation studies indicate that images in the profile contribute more than figure-mentioning paragraphs. Conclusion: The paper concludes that using multimodal profiles helps generate figure captions closer to the original author-written ones, showing the advantage of multimodal profiles over text-only settings. Abstract: Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.

cs.CV [Back]

[80] A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion

Fangzhou Lin,Zilin Dai,Rigved Sanku,Songlin Hou,Kazunori D Yamada,Haichong K. Zhang,Ziming Zhang

Main category: cs.CV

TL;DR: 本文提出了一个不依赖图像引导的单视角点云补全基线方法,利用注意力机制有效提升补全效果。

Details Motivation: 探索单视角图像引导点云补全任务中是否真正需要图像引导,并尝试构建一个不依赖图像的强基线方法。 Method: 采用层次化自融合机制,结合交叉注意力和自注意力层,仅使用部分点云作为输入进行点云补全。 Result: 在ShapeNet-ViPC数据集上的实验表明,该方法优于现有的单视角图像引导点云补全方法。 Conclusion: 本文提出了一种基于注意力机制的多分支编码器-解码器网络,用于单视角图像引导点云补全任务,且无需图像引导即可实现优异性能。 Abstract: The single-view image guided point cloud completion (SVIPC) task aims to reconstruct a complete point cloud from a partial input with the help of a single-view image. While previous works have demonstrated the effectiveness of this multimodal approach, the fundamental necessity of image guidance remains largely unexamined. To explore this, we propose a strong baseline approach for SVIPC based on an attention-based multi-branch encoder-decoder network that only takes partial point clouds as input, view-free. Our hierarchical self-fusion mechanism, driven by cross-attention and self-attention layers, effectively integrates information across multiple streams, enriching feature representations and strengthening the networks ability to capture geometric structures. Extensive experiments and ablation studies on the ShapeNet-ViPC dataset demonstrate that our view-free framework performs superiorly to state-of-the-art SVIPC methods. We hope our findings provide new insights into the development of multimodal learning in SVIPC. Our demo code will be available at https://github.com/Zhang-VISLab.

[81] VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service

Xiasi Wang,Tianliang Yao,Simin Chen,Runqi Wang,Lei YE,Kuofeng Gao,Yi Huang,Yuan Yao

Main category: cs.CV

TL;DR: 本文提出了一种新的黑盒环境下评估视觉-语言模型效率鲁棒性的方法VLMInferSlow,通过细粒度建模与零阶优化生成对抗样本,有效增加了计算成本,强调了效率鲁棒性的重要性。

Details Motivation: 尽管现有研究主要关注提高视觉-语言模型的准确性,但其效率问题仍未得到充分探索。考虑到许多应用的实时需求和VLM的高推理开销,效率鲁棒性至关重要。然而,以往的研究假设过于理想化,需要访问模型架构和参数,这在ML-as-a-service场景中并不现实。 Method: VLMInferSlow结合了针对VLM推理的细粒度效率建模,并利用零阶优化技术搜索对抗样本,以在不访问模型架构和参数的情况下评估效率鲁棒性。 Result: 实验结果表明,VLMInferSlow可以生成具有不可察觉扰动的对抗图像,使计算成本最高增加128.47%。 Conclusion: 该论文提出了VLMInferSlow,用于在现实黑盒环境下评估视觉-语言模型(VLM)的效率鲁棒性。通过细粒度的效率建模和零阶优化技术,该方法能够生成对抗样本,显著增加计算成本。作者希望引起社区对VLM效率鲁棒性的关注。 Abstract: Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unrealistic assumptions, requiring access to the model architecture and parameters -- an impractical scenario in ML-as-a-service settings, where VLMs are deployed via inference APIs. To address this gap, we propose VLMInferSlow, a novel approach for evaluating VLM efficiency robustness in a realistic black-box setting. VLMInferSlow incorporates fine-grained efficiency modeling tailored to VLM inference and leverages zero-order optimization to search for adversarial examples. Experimental results show that VLMInferSlow generates adversarial images with imperceptible perturbations, increasing the computational cost by up to 128.47%. We hope this research raises the community's awareness about the efficiency robustness of VLMs.

[82] Weakly-supervised VLM-guided Partial Contrastive Learning for Visual Language Navigation

Ruoyu Wang,Tong Yu,Junda Wu,Yao Liu,Julian McAuley,Lina Yao

Main category: cs.CV

TL;DR: This paper proposes WPCL, a weakly-supervised method for Visual Language Navigation that improves agent performance by integrating pre-trained VLMs without fine-tuning, leading to better navigation accuracy and efficiency.

Details Motivation: Existing VLN methods face three key challenges: difficulty handling dynamic viewpoints with pre-trained backbones, limited performance when using off-the-shelf LLMs/VLMs without fine-tuning, and high computational cost when fine-tuning these models. This work aims to overcome these limitations. Method: The authors propose Weakly-supervised Partial Contrastive Learning (WPCL), which integrates pre-trained Vision-Language Models (VLMs) into the perception process without requiring fine-tuning, thereby improving object identification from dynamic viewpoints in VLN scenarios. Result: Experimental results show that the WPCL method outperforms baseline methods on multiple VLN benchmarks, demonstrating its effectiveness, robustness, and generalizability while maintaining computational efficiency. Conclusion: The proposed WPCL method effectively enhances agents' ability to navigate based on language instructions by integrating pre-trained VLM knowledge without fine-tuning, achieving superior performance and computational efficiency. Abstract: Visual Language Navigation (VLN) is a fundamental task within the field of Embodied AI, focusing on the ability of agents to navigate complex environments based on natural language instructions. Despite the progress made by existing methods, these methods often present some common challenges. First, they rely on pre-trained backbone models for visual perception, which struggle with the dynamic viewpoints in VLN scenarios. Second, the performance is limited when using pre-trained LLMs or VLMs without fine-tuning, due to the absence of VLN domain knowledge. Third, while fine-tuning LLMs and VLMs can improve results, their computational costs are higher than those without fine-tuning. To address these limitations, we propose Weakly-supervised Partial Contrastive Learning (WPCL), a method that enhances an agent's ability to identify objects from dynamic viewpoints in VLN scenarios by effectively integrating pre-trained VLM knowledge into the perception process, without requiring VLM fine-tuning. Our method enhances the agent's ability to interpret and respond to environmental cues while ensuring computational efficiency. Experimental results have shown that our method outperforms the baseline methods on multiple benchmarks, which validate the effectiveness, robustness and generalizability of our method.

[83] Implicit 3D scene reconstruction using deep learning towards efficient collision understanding in autonomous driving

Akarshani Ramanayake,Nihal Kodikara

Main category: cs.CV

TL;DR: This paper proposes a new method for 3D scene reconstruction using LiDAR and deep learning to improve obstacle mapping and collision detection in autonomous vehicles within dense urban environments.

Details Motivation: In dense urban traffic conditions, current technologies struggle with tight navigation. There is a need for higher boundary level accuracy in 3D scene reconstruction of object shapes, which has not been extensively considered in existing literature. Method: The method involves a learning-based 3D scene reconstruction approach that utilizes LiDAR data and deep neural networks to create static Signed Distance Function (SDF) maps, differing from traditional polygonal representations. Result: Preliminary results indicate a significant enhancement in collision detection performance, particularly in congested and dynamic environments. Conclusion: The research concludes that the developed learning-based 3D scene reconstruction methodology using LiDAR data and deep neural networks enhances collision detection performance, especially in congested and dynamic environments by mapping 3D obstacle shapes with more boundary-level details. Abstract: In crowded urban environments where traffic is dense, current technologies struggle to oversee tight navigation, but surface-level understanding allows autonomous vehicles to safely assess proximity to surrounding obstacles. 3D or 2D scene mapping of the surrounding objects is an essential task in addressing the above problem. Despite its importance in dense vehicle traffic conditions, 3D scene reconstruction of object shapes with higher boundary level accuracy is not yet entirely considered in current literature. The sign distance function represents any shape through parameters that calculate the distance from any point in space to the closest obstacle surface, making it more efficient in terms of storage. In recent studies, researchers have started to formulate problems with Implicit 3D reconstruction methods in the autonomous driving domain, highlighting the possibility of using sign distance function to map obstacles effectively. This research addresses this gap by developing a learning-based 3D scene reconstruction methodology that leverages LiDAR data and a deep neural network to build a the static Signed Distance Function (SDF) maps. Unlike traditional polygonal representations, this approach has the potential to map 3D obstacle shapes with more boundary-level details. Our preliminary results demonstrate that this method would significantly enhance collision detection performance, particularly in congested and dynamic environments.

[84] ADAM-Dehaze: Adaptive Density-Aware Multi-Stage Dehazing for Improved Object Detection in Foggy Conditions

Fatmah AlHindaassi,Mohammed Talha Alam,Fakhri Karray

Main category: cs.CV

TL;DR: 本文提出了一种名为ADAM-Dehaze的去雾框架,通过根据雾气密度动态选择处理方式,显著提高了视觉信息恢复效果和目标检测性能。

Details Motivation: 恶劣天气尤其是雾天严重影响自动驾驶、监控系统等安全关键应用的视觉信息,需要一种能够应对不同雾气强度的高效去雾方法。 Method: 提出了一种密度感知的去雾框架ADAM-Dehaze,结合轻量级雾霾密度估计网络HDEN与三种特定处理分支Light、Medium、Complex,并引入自适应损失函数平衡物理模型一致性和感知保真度。 Result: 在Cityscapes和RTTS真实世界数据集上,PSNR提升达2.1dB,FADE减少30%,目标检测mAP提高最多13个百分点,推理时间缩短20%。 Conclusion: ADAM-Dehaze通过动态优化图像去雾和目标检测,有效提升了在不同雾气强度下的视觉信息恢复能力,并显著改善了性能指标。 Abstract: Adverse weather conditions, particularly fog, pose a significant challenge to autonomous vehicles, surveillance systems, and other safety-critical applications by severely degrading visual information. We introduce ADAM-Dehaze, an adaptive, density-aware dehazing framework that jointly optimizes image restoration and object detection under varying fog intensities. A lightweight Haze Density Estimation Network (HDEN) classifies each input as light, medium, or heavy fog. Based on this score, the system dynamically routes the image through one of three CORUN branches: Light, Medium, or Complex, each tailored to its haze regime. A novel adaptive loss balances physical-model coherence and perceptual fidelity, ensuring both accurate defogging and preservation of fine details. On Cityscapes and the real-world RTTS benchmark, ADAM-Dehaze improves PSNR by up to 2.1 dB, reduces FADE by 30 percent, and increases object detection mAP by up to 13 points, while cutting inference time by 20 percent. These results highlight the importance of intensity-specific processing and seamless integration with downstream vision tasks. Code available at: https://github.com/talha-alam/ADAM-Dehaze.

[85] EchoShot: Multi-Shot Portrait Video Generation

Jiahao Wang,Hualian Sheng,Sijia Cai,Weizhan Zhang,Caixia Yan,Yachuang Feng,Bing Deng,Jieping Ye

Main category: cs.CV

TL;DR: EchoShot is a scalable multi-shot portrait video generation framework built on a video diffusion model, enabling identity consistency, content controllability, and flexible extensions like reference-based personalization and long video synthesis.

Details Motivation: Current video diffusion models are primarily limited to single-shot creation, whereas real-world applications require multiple shots with identity consistency and flexible content control. This limitation motivates the development of a native multi-shot framework for portrait customization. Method: The paper proposes EchoShot, a multi-shot framework based on a foundation video diffusion model. It introduces shot-aware position embedding mechanisms within the video diffusion transformer architecture to model inter-shot variations and improve correspondence between visual content and textual descriptions. Additionally, the authors extend EchoShot for reference image-based personalized generation and long video synthesis. Result: Extensive evaluations show that EchoShot achieves superior identity consistency and attribute-level controllability in multi-shot portrait video generation. The framework also enables scalable and flexible applications such as personalized generation using reference images and infinite shot count video synthesis. Conclusion: EchoShot demonstrates potential as a foundational paradigm for general multi-shot video modeling, achieving superior identity consistency and attribute-level controllability in multi-shot portrait video generation. Abstract: Video diffusion models substantially boost the productivity of artistic workflows with high-quality portrait video generative capacity. However, prevailing pipelines are primarily constrained to single-shot creation, while real-world applications urge for multiple shots with identity consistency and flexible content controllability. In this work, we propose EchoShot, a native and scalable multi-shot framework for portrait customization built upon a foundation video diffusion model. To start with, we propose shot-aware position embedding mechanisms within video diffusion transformer architecture to model inter-shot variations and establish intricate correspondence between multi-shot visual content and their textual descriptions. This simple yet effective design enables direct training on multi-shot video data without introducing additional computational overhead. To facilitate model training within multi-shot scenario, we construct PortraitGala, a large-scale and high-fidelity human-centric video dataset featuring cross-shot identity consistency and fine-grained captions such as facial attributes, outfits, and dynamic motions. To further enhance applicability, we extend EchoShot to perform reference image-based personalized multi-shot generation and long video synthesis with infinite shot counts. Extensive evaluations demonstrate that EchoShot achieves superior identity consistency as well as attribute-level controllability in multi-shot portrait video generation. Notably, the proposed framework demonstrates potential as a foundational paradigm for general multi-shot video modeling.

[86] Assessing the impact of Binarization for Writer Identification in Greek Papyrus

Dominic Akt,Marco Peer,Florian Kleber

Main category: cs.CV

TL;DR: This paper analyzes how effective binarization methods, particularly those enhanced by deep learning and data augmentation, impact writer identification performance for challenging historical documents like Greek papyri.

Details Motivation: The motivation stems from the challenge of image binarization in historical documents like Greek papyri, where the background is often non-uniform and complex. Binarization is crucial in preventing models from learning irrelevant background features, which makes it essential to evaluate its effectiveness in writer identification pipelines. Method: The paper compares traditional binarization methods with state-of-the-art Deep Learning (DL) models. DL models are trained both with and without custom data augmentation techniques, and different model selection criteria are applied. The evaluation is conducted on the DIBCO 2019 dataset, and the impact of binarization on writer identification is assessed using a state-of-the-art writer identification approach. Result: Results show that data augmentation plays a significant role in enhancing the performance of DL-based binarization methods. Furthermore, the study finds a strong correlation between the effectiveness of binarization on the DIBCO 2019 dataset and the accuracy of subsequent writer identification. Conclusion: The study concludes that data augmentation significantly influences the performance of Deep Learning methods in binarization for writer identification. Additionally, there is a strong correlation between effective binarization of papyri documents and improved downstream writer identification performance. Abstract: This paper tackles the task of writer identification for Greek papyri. A common preprocessing step in writer identification pipelines is image binarization, which prevents the model from learning background features. This is challenging in historical documents, in our case Greek papyri, as background is often non-uniform, fragmented, and discolored with visible fiber structures. We compare traditional binarization methods to state-of-the-art Deep Learning (DL) models, evaluating the impact of binarization quality on subsequent writer identification performance. DL models are trained with and without a custom data augmentation technique, as well as different model selection criteria are applied. The performance of these binarization methods, is then systematically evaluated on the DIBCO 2019 dataset. The impact of binarization on writer identification is subsequently evaluated using a state-of-the-art approach for writer identification. The results of this analysis highlight the influence of data augmentation for DL methods. Furthermore, findings indicate a strong correlation between binarization effectiveness on papyri documents of DIBCO 2019 and downstream writer identification performance.

[87] Privacy-Preserving in Connected and Autonomous Vehicles Through Vision to Text Transformation

Abdolazim Rezaei,Mehdi Sookhak,Ahmad Patooghy

Main category: cs.CV

TL;DR: 本文介绍了一种新的隐私保护框架,利用反馈强化学习和视觉-语言模型将图像转换为文本描述,以更好地保护自动驾驶车辆和路侧单元中的隐私。

Details Motivation: 自动驾驶车辆和路侧单元通常处理隐私敏感数据,传统技术如面部模糊化无法完全防止通过衣物等特征跟踪个人,因此需要一种更有效的隐私保护方法。 Method: 使用反馈强化学习和视觉-语言模型将图像转换为语义等效的文本描述,并采用分层强化学习策略迭代优化生成的文本。 Result: 实验结果显示,与现有方法相比,唯一词数增加了约77%,细节密度增加了约50%。 Conclusion: 本文提出了一种基于反馈强化学习和视觉-语言模型的隐私保护框架,用于解决自动驾驶车辆和路侧单元在处理隐私敏感数据时的隐私风险。实验结果表明,该方法在保护隐私的同时提高了文本质量和语义准确性。 Abstract: Connected and Autonomous Vehicles (CAVs) rely on a range of devices that often process privacy-sensitive data. Among these, roadside units play a critical role particularly through the use of AI-equipped (AIE) cameras for applications such as violation detection. However, the privacy risks associated with captured imagery remain a major concern, as such data can be misused for identity theft, profiling, or unauthorized commercial purposes. While traditional techniques such as face blurring and obfuscation have been applied to mitigate privacy risks, individual privacy remains at risk, as individuals can still be tracked using other features such as their clothing. This paper introduces a novel privacy-preserving framework that leverages feedback-based reinforcement learning (RL) and vision-language models (VLMs) to protect sensitive visual information captured by AIE cameras. The main idea is to convert images into semantically equivalent textual descriptions, ensuring that scene-relevant information is retained while visual privacy is preserved. A hierarchical RL strategy is employed to iteratively refine the generated text, enhancing both semantic accuracy and privacy. Evaluation results demonstrate significant improvements in both privacy protection and textual quality, with the Unique Word Count increasing by approximately 77\% and Detail Density by around 50\% compared to existing approaches.

[88] Visual symbolic mechanisms: Emergent symbol processing in vision language models

Rim Assouel,Declan Campbell,Taylor Webb

Main category: cs.CV

TL;DR: 该论文探讨了视觉语言模型(VLMs)如何通过新兴的符号机制解决'绑定问题',并指出绑定错误源于这些机制的失败。

Details Motivation: 最近的研究发现语言模型通过一组类似符号、与内容无关的索引解决'绑定问题',但尚不清楚视觉语言模型(VLM)是否采用了类似的机制。鉴于VLM在需要绑定的任务上存在持续的失败情况,这个问题尤为重要。 Method: 通过内容无关的空间索引方案,识别出一组支持VLM绑定的新兴符号机制。 Result: 此外,我们发现绑定错误可以直接追溯到这些机制的故障。 Conclusion: 研究揭示了支持VLM中类似符号处理的机制,并提出了可能解决模型持续绑定失败问题的方法。 Abstract: To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this 'binding problem' via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by vision language models (VLMs). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a set of emergent symbolic mechanisms that support binding in VLMs via a content-independent, spatial indexing scheme. Moreover, we find that binding errors can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for addressing the persistent binding failures exhibited by these models.

[89] Pediatric Pancreas Segmentation from MRI Scans with Deep Learning

Elif Keles,Merve Yazol,Gorkem Durak,Ziliang Hong,Halil Ertugrul Aktas,Zheyuan Zhang,Linkai Peng,Onkar Susladkar,Necati Guzelyel,Oznur Leman Boyunaga,Cemal Yazici,Mark Lowe,Aliye Uc,Ulas Bagci

Main category: cs.CV

TL;DR: PanSegNet是一种高效的深度学习算法,可对患有急性或慢性胰腺炎及健康的儿童进行准确的胰腺MRI分割,并具有很高的临床可靠性。

Details Motivation: 研究旨在评估和验证一种名为PanSegNet的深度学习算法在儿童急性胰腺炎、慢性胰腺炎及健康对照组中进行儿科胰腺分割的表现。 Method: 回顾性收集84名儿童的MRI扫描数据,使用Dice相似系数(DSC)和95百分位Hausdorff距离(HD95)评估PanSegNet生成的分割结果。 Result: PanSegNet在健康对照组中实现了88%的DSC评分,在AP中为81%,在CP中为80%;HD95值分别为3.98毫米(对照组)、9.85毫米(AP)和15.67毫米(CP)。观察者间kappa值显示高度一致。 Conclusion: PanSegNet是第一个经过验证的用于胰腺MRI分割的深度学习解决方案,在健康和疾病状态下均达到专家水平性能。 Abstract: Objective: Our study aimed to evaluate and validate PanSegNet, a deep learning (DL) algorithm for pediatric pancreas segmentation on MRI in children with acute pancreatitis (AP), chronic pancreatitis (CP), and healthy controls. Methods: With IRB approval, we retrospectively collected 84 MRI scans (1.5T/3T Siemens Aera/Verio) from children aged 2-19 years at Gazi University (2015-2024). The dataset includes healthy children as well as patients diagnosed with AP or CP based on clinical criteria. Pediatric and general radiologists manually segmented the pancreas, then confirmed by a senior pediatric radiologist. PanSegNet-generated segmentations were assessed using Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff distance (HD95). Cohen's kappa measured observer agreement. Results: Pancreas MRI T2W scans were obtained from 42 children with AP/CP (mean age: 11.73 +/- 3.9 years) and 42 healthy children (mean age: 11.19 +/- 4.88 years). PanSegNet achieved DSC scores of 88% (controls), 81% (AP), and 80% (CP), with HD95 values of 3.98 mm (controls), 9.85 mm (AP), and 15.67 mm (CP). Inter-observer kappa was 0.86 (controls), 0.82 (pancreatitis), and intra-observer agreement reached 0.88 and 0.81. Strong agreement was observed between automated and manual volumes (R^2 = 0.85 in controls, 0.77 in diseased), demonstrating clinical reliability. Conclusion: PanSegNet represents the first validated deep learning solution for pancreatic MRI segmentation, achieving expert-level performance across healthy and diseased states. This tool, algorithm, along with our annotated dataset, are freely available on GitHub and OSF, advancing accessible, radiation-free pediatric pancreatic imaging and fostering collaborative research in this underserved domain.

[90] MoiréXNet: Adaptive Multi-Scale Demoiréing with Linear Attention Test-Time Training and Truncated Flow Matching Prior

Liangyan Li,Yimo Ning,Kevin Le,Wei Dong,Yunzhe Li,Jun Chen,Xiaohong Liu

Main category: cs.CV

TL;DR: This paper proposes a novel hybrid framework for image and video demoiréing by combining Maximum A Posteriori (MAP) estimation with deep learning techniques, integrating supervised learning and generative models to improve restoration performance.

Details Motivation: Existing methods struggle with nonlinear degradation processes in demoiréing tasks due to constrained model capacity, scarce training data, or limitations of generative models in nonlinear cases. This work aims to overcome these challenges by integrating advanced deep learning techniques with MAP estimation. Method: The paper proposes a hybrid Maximum A Posteriori (MAP) framework that combines a supervised learning model with linear attention Test-Time Training (TTT) modules and a Truncated Flow Matching Prior (TFMP) for refining outputs. Result: The hybrid framework achieves improved restoration performance by combining computational efficiency and refinement abilities, restoring high-frequency details while suppressing artifacts in demoiréing tasks. Conclusion: The proposed hybrid MAP-based framework successfully integrates supervised learning with generative model techniques, enhancing the restoration performance in image and video demoiréing tasks. Abstract: This paper introduces a novel framework for image and video demoir\'eing by integrating Maximum A Posteriori (MAP) estimation with advanced deep learning techniques. Demoir\'eing addresses inherently nonlinear degradation processes, which pose significant challenges for existing methods. Traditional supervised learning approaches either fail to remove moir\'e patterns completely or produce overly smooth results. This stems from constrained model capacity and scarce training data, which inadequately represent the clean image distribution and hinder accurate reconstruction of ground-truth images. While generative models excel in image restoration for linear degradations, they struggle with nonlinear cases such as demoir\'eing and often introduce artifacts. To address these limitations, we propose a hybrid MAP-based framework that integrates two complementary components. The first is a supervised learning model enhanced with efficient linear attention Test-Time Training (TTT) modules, which directly learn nonlinear mappings for RAW-to-sRGB demoir\'eing. The second is a Truncated Flow Matching Prior (TFMP) that further refines the outputs by aligning them with the clean image distribution, effectively restoring high-frequency details and suppressing artifacts. These two components combine the computational efficiency of linear attention with the refinement abilities of generative models, resulting in improved restoration performance.

[91] Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization

Yosub Shin,Igor Molybog

Main category: cs.CV

TL;DR: 本论文提出了一种通用的视频同步框架VideoSync,并通过新数据集与修正后的评估体系证明其优于现有方法,解决了以往研究中依赖特定信号及存在偏倚的问题。

Details Motivation: 现有的视频同步方法依赖音频线索或特定视觉事件,适用范围受限;此外,缺乏具有通用性和可重复性的基准测试。 Method: 引入VideoSync框架,并构建新的包含单人、多人和非人类场景的数据集,同时分析并纠正了先前研究中的偏差。此外,采用卷积神经网络(CNN)进行同步偏移预测。 Result: VideoSync在公平实验条件下优于包括SeSyn-Net在内的现有方法,并揭示了先前SOTA工作中存在的偏倚问题。CNN模型被证实是同步偏移预测中最有效的方法。 Conclusion: VideoSync是一个独立于特定特征提取方法的视频同步框架,其通过更严谨的评估体系和多样化的数据集展示出优于现有方法的表现,为视频同步的实际应用提供了更高的通用性和鲁棒性。 Abstract: Video synchronization-aligning multiple video streams capturing the same event from different angles-is crucial for applications such as reality TV show production, sports analysis, surveillance, and autonomous systems. Prior work has heavily relied on audio cues or specific visual events, limiting applicability in diverse settings where such signals may be unreliable or absent. Additionally, existing benchmarks for video synchronization lack generality and reproducibility, restricting progress in the field. In this work, we introduce VideoSync, a video synchronization framework that operates independently of specific feature extraction methods, such as human pose estimation, enabling broader applicability across different content types. We evaluate our system on newly composed datasets covering single-human, multi-human, and non-human scenarios, providing both the methodology and code for dataset creation to establish reproducible benchmarks. Our analysis reveals biases in prior SOTA work, particularly in SeSyn-Net's preprocessing pipeline, leading to inflated performance claims. We correct these biases and propose a more rigorous evaluation framework, demonstrating that VideoSync outperforms existing approaches, including SeSyn-Net, under fair experimental conditions. Additionally, we explore various synchronization offset prediction methods, identifying a convolutional neural network (CNN)-based model as the most effective. Our findings advance video synchronization beyond domain-specific constraints, making it more generalizable and robust for real-world applications.

[92] Polyline Path Masked Attention for Vision Transformer

Zhongchen Zhao,Chaodong Xiao,Hui Lin,Qi Xie,Lei Zhang,Deyu Meng

Main category: cs.CV

TL;DR: This paper introduces PPMA, a novel architecture combining Vision Transformers and Mamba2 through a polyline path mask, enhancing spatial adjacency modeling while maintaining global dependency capabilities, resulting in state-of-the-art performance across multiple vision tasks.

Details Motivation: Global dependency modeling and spatial position modeling are key challenges in deep learning architectures. While Vision Transformers excel at global dependency modeling, Mamba2 shows promise in modeling spatial adjacency through structured masks. PPMA aims to combine these strengths for improved performance in computer vision tasks. Method: The paper proposes Polyline Path Masked Attention (PPMA), combining Vision Transformers and Mamba2 by introducing a 2D polyline path scanning strategy to better preserve adjacency relationships among image tokens. Theoretical analysis and an efficient algorithm are developed for computing the polyline path mask, which is then embedded into the self-attention mechanism of ViTs. Result: Experiments show that PPMA outperforms existing approaches on tasks such as image classification, object detection, and semantic segmentation. For instance, PPMA-T/S/B models achieved 48.7%/51.1%/52.3% mIoU on ADE20K, surpassing RMT-T/S/B by 0.7%/1.3%/0.3% respectively. Conclusion: PPMA successfully integrates the self-attention mechanism of ViTs with an enhanced structured mask based on Mamba2, achieving superior performance in computer vision tasks compared to previous state-of-the-art models. Abstract: Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.

[93] Heterogeneous-Modal Unsupervised Domain Adaptation via Latent Space Bridging

Jiawen Yang,Shuhao Chen,Yucong Duan,Ke Tang,Yu Zhang

Main category: cs.CV

TL;DR: This paper proposes Latent Space Bridging (LSB) for Heterogeneous-Modal Unsupervised Domain Adaptation (HMUDA), achieving excellent results in semantic segmentation across different modalities.

Details Motivation: Existing UDA methods struggle with domain adaptation when dealing with entirely distinct modalities. Method: Latent Space Bridging (LSB) uses a dual-branch architecture with feature consistency and domain alignment losses. Result: LSB achieves state-of-the-art performance on six benchmark datasets. Conclusion: LSB is an effective method under the HMUDA setting for transferring knowledge between different modalities in semantic segmentation tasks. Abstract: Unsupervised domain adaptation (UDA) methods effectively bridge domain gaps but become struggled when the source and target domains belong to entirely distinct modalities. To address this limitation, we propose a novel setting called Heterogeneous-Modal Unsupervised Domain Adaptation (HMUDA), which enables knowledge transfer between completely different modalities by leveraging a bridge domain containing unlabeled samples from both modalities. To learn under the HMUDA setting, we propose Latent Space Bridging (LSB), a specialized framework designed for the semantic segmentation task. Specifically, LSB utilizes a dual-branch architecture, incorporating a feature consistency loss to align representations across modalities and a domain alignment loss to reduce discrepancies between class centroids across domains. Extensive experiments conducted on six benchmark datasets demonstrate that LSB achieves state-of-the-art performance.

[94] LBMamba: Locally Bi-directional Mamba

Jingwei Zhang,Xi Han,Hong Qin,Mahdi S. Hosseini,Dimitris Samaras

Main category: cs.CV

TL;DR: This paper introduces LBMamba and LBVim, improving Mamba-based models by enabling efficient bi-directional processing without additional computational cost, resulting in better performance on vision and medical image analysis tasks.

Details Motivation: The motivation is to overcome the unidirectional limitation of Mamba while preserving its efficiency advantages, eliminating the need for costly global backward scans that increase computational load. Method: The paper introduces LBMamba, which incorporates a lightweight local backward scan within the forward selective scan. It then proposes LBVim, a scalable vision backbone that alternates scan directions every two layers to achieve a full receptive field. Result: LBVim demonstrates superior performance across multiple tasks: it achieves higher accuracy on ImageNet-1K classification, improved mIoU on ADE20K segmentation, and better AP scores on COCO detection. In pathology applications, integrating LBMamba into MambaMIL results in improvements in AUC, F1, and accuracy on WSI datasets. Conclusion: The paper concludes that LBVim, built on LBMamba, effectively recovers a global receptive field without extra computational load, offering better performance and efficiency compared to existing Mamba-based methods. Abstract: Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel selective scan, has recently emerged as a linearly-scaling, efficient alternative to self-attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba-based computer-vision methods typically overcome this limitation by augmenting Mamba's global forward scan with a global backward scan, forming a bi-directional scan that restores a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward selective scan and executes it entirely in per-thread registers. Building on LBMamba, we present LBVim, a scalable vision backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate the versatility of our approach on both natural images and whole slide images (WSIs). We show that our LBVim constantly offers a superior performance-throughput trade-off. That is under the same throughput, LBVim achieves 0.8% to 1.6% higher top-1 accuracy on the ImageNet-1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher APb and 1.1% higher APm on the COCO detection dataset. We also integrate LBMamba into the SOTA pathology multiple instance learning (MIL) approach, MambaMIL, which uses single directional scan. Experiments on 3 public WSI classification datasets for show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy.

[95] Towards Classifying Histopathological Microscope Images as Time Series Data

Sungrae Hong,Hyeongmin Park,Youngsin Ko,Sol Lee,Bryan Wong,Mun Yong Yi

Main category: cs.CV

TL;DR: 该论文提出了一种将显微病理图像作为时间序列数据进行分类的新方法,通过动态时间规整(DTW)和基于注意力的池化技术提高了分类性能。

Details Motivation: 显微病理图像是癌症诊断的重要依据,但深度学习领域对其应用关注不足。本文旨在解决显微镜图像手动采集和弱标签特性带来的挑战,提升其在医学分析中的效果。 Method: 利用动态时间规整(DTW)对不同长度的图像序列进行对齐,并采用基于注意力的池化方法同时预测病例类别。 Result: 实验表明,该方法在与多种基线模型比较中表现优异,并通过不同的推理策略实现了稳定可靠的结果。消融研究进一步验证了各组件的有效性。 Conclusion: 本文为医学图像分析做出了贡献,不仅有效利用了显微图像,还提升了其分类性能至可信赖水平。 Abstract: As the frontline data for cancer diagnosis, microscopic pathology images are fundamental for providing patients with rapid and accurate treatment. However, despite their practical value, the deep learning community has largely overlooked their usage. This paper proposes a novel approach to classifying microscopy images as time series data, addressing the unique challenges posed by their manual acquisition and weakly labeled nature. The proposed method fits image sequences of varying lengths to a fixed-length target by leveraging Dynamic Time-series Warping (DTW). Attention-based pooling is employed to predict the class of the case simultaneously. We demonstrate the effectiveness of our approach by comparing performance with various baselines and showcasing the benefits of using various inference strategies in achieving stable and reliable results. Ablation studies further validate the contribution of each component. Our approach contributes to medical image analysis by not only embracing microscopic images but also lifting them to a trustworthy level of performance.

[96] Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

Cong Wang,Zexuan Deng,Zhiwei Jiang,Fei Shen,Yafeng Yin,Shiwei Gan,Zifeng Cheng,Shiping Ge,Qing Gu

Main category: cs.CV

TL;DR: SignViP improves sign language video generation by incorporating multiple fine-grained conditions, resulting in more natural and expressive videos.

Details Motivation: Existing methods for Sign Language Video Generation rely on coarse conditions like skeleton sequences, which limit the naturalness and expressiveness of the generated videos. Method: SignViP uses a discrete tokenization paradigm to integrate fine-grained conditions like poses and 3D hands. It consists of three components: Sign Video Diffusion Model, Finite Scalar Quantization (FSQ) Autoencoder, and Multi-Condition Token Translator. Result: SignViP achieves state-of-the-art results across several metrics including video quality, temporal coherence, and semantic fidelity. Conclusion: SignViP is introduced as a new framework for Sign Language Video Generation that enhances generation fidelity by using multiple fine-grained conditions. It achieves state-of-the-art performance in video quality, temporal coherence, and semantic fidelity. Abstract: Sign Language Video Generation (SLVG) seeks to generate identity-preserving sign language videos from spoken language texts. Existing methods primarily rely on the single coarse condition (\eg, skeleton sequences) as the intermediary to bridge the translation model and the video generation model, which limits both the naturalness and expressiveness of the generated videos. To overcome these limitations, we propose SignViP, a novel SLVG framework that incorporates multiple fine-grained conditions for improved generation fidelity. Rather than directly translating error-prone high-dimensional conditions, SignViP adopts a discrete tokenization paradigm to integrate and represent fine-grained conditions (\ie, fine-grained poses and 3D hands). SignViP contains three core components. (1) Sign Video Diffusion Model is jointly trained with a multi-condition encoder to learn continuous embeddings that encapsulate fine-grained motion and appearance. (2) Finite Scalar Quantization (FSQ) Autoencoder is further trained to compress and quantize these embeddings into discrete tokens for compact representation of the conditions. (3) Multi-Condition Token Translator is trained to translate spoken language text to discrete multi-condition tokens. During inference, Multi-Condition Token Translator first translates the spoken language text into discrete multi-condition tokens. These tokens are then decoded to continuous embeddings by FSQ Autoencoder, which are subsequently injected into Sign Video Diffusion Model to guide video generation. Experimental results show that SignViP achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/umnooob/signvip/.

[97] Adversarial Attacks and Detection in Visual Place Recognition for Safer Robot Navigation

Connor Malone,Owen Claxton,Iman Shames,Michael Milford

Main category: cs.CV

TL;DR: This paper investigates the impact of adversarial attacks on Visual Place Recognition (VPR) systems and proposes a framework integrating Adversarial Attack Detectors (AADs) to enhance robot navigation reliability, showing that even moderately accurate AADs can substantially reduce localization errors.

Details Motivation: Stand-alone Visual Place Recognition (VPR) systems have vulnerabilities to adversarial attacks which can result in severe consequences for robot navigation; therefore, there is a need to develop effective defense mechanisms like AAD. Method: The authors conducted an extensive analysis of four common adversarial attacks and four novel VPR-specific attacks on VPR localization performance. They proposed a new experimental paradigm integrating AAD into the VPR navigation loop to evaluate its effectiveness. Result: The study found that incorporating AADs with varying detection accuracies could significantly improve VPR performance over a baseline, achieving up to a ~50% reduction in mean along-track localization error with True Positive and False Positive detection rates of 75% and up to 25%, respectively. The work also examined metrics such as Along-Track Error and Time Under Attack, and investigated the efficacy of FGSM adversarial attack for VPR. Conclusion: This paper concludes that Adversarial Attack Detectors (AADs) are necessary for trustworthy robot navigation in real-world systems, and it provides quantitative requirements for system design. Abstract: Stand-alone Visual Place Recognition (VPR) systems have little defence against a well-designed adversarial attack, which can lead to disastrous consequences when deployed for robot navigation. This paper extensively analyzes the effect of four adversarial attacks common in other perception tasks and four novel VPR-specific attacks on VPR localization performance. We then propose how to close the loop between VPR, an Adversarial Attack Detector (AAD), and active navigation decisions by demonstrating the performance benefit of simulated AADs in a novel experiment paradigm -- which we detail for the robotics community to use as a system framework. In the proposed experiment paradigm, we see the addition of AADs across a range of detection accuracies can improve performance over baseline; demonstrating a significant improvement -- such as a ~50% reduction in the mean along-track localization error -- can be achieved with True Positive and False Positive detection rates of only 75% and up to 25% respectively. We examine a variety of metrics including: Along-Track Error, Percentage of Time Attacked, Percentage of Time in an `Unsafe' State, and Longest Continuous Time Under Attack. Expanding further on these results, we provide the first investigation into the efficacy of the Fast Gradient Sign Method (FGSM) adversarial attack for VPR. The analysis in this work highlights the need for AADs in real-world systems for trustworthy navigation, and informs quantitative requirements for system design.

[98] DIGMAPPER: A Modular System for Automated Geologic Map Digitization

Weiwei Duan,Michael P. Gerlek,Steven N. Minton,Craig A. Knoblock,Fandel Lin,Theresa Chen,Leeje Jang,Sofia Kirsanova,Zekun Li,Yijun Lin,Yao-Yi Chiang

Main category: cs.CV

TL;DR: DIGMAPPER is an automated system designed to digitize geologic maps efficiently by leveraging advanced deep learning techniques, enabling faster creation of geospatial datasets critical for mineral resource assessment.

Details Motivation: Historical geologic maps contain valuable geospatial information essential for assessing mineral resources related to renewable energy, electric vehicles, and national security. However, the digitization process is currently labor-intensive and time-consuming, necessitating an automated solution. Method: The paper introduces DIGMAPPER, a modular and scalable system integrating deep learning models for map layout analysis, feature extraction, and georeferencing. It employs techniques such as in-context learning with large language models, synthetic data generation, and transformer-based models to overcome challenges like limited training data. Result: Evaluations on over 100 annotated maps from the DARPA-USGS dataset show that DIGMAPPER achieves high accuracy in extracting polygon, line, and point features, as well as reliable georeferencing performance. The system has been deployed at USGS. Conclusion: DIGMAPPER successfully automates the digitization of geologic maps, significantly accelerating the creation of analysis-ready geospatial datasets and supporting large-scale critical mineral assessments. Abstract: Historical geologic maps contain rich geospatial information, such as rock units, faults, folds, and bedding planes, that is critical for assessing mineral resources essential to renewable energy, electric vehicles, and national security. However, digitizing maps remains a labor-intensive and time-consuming task. We present DIGMAPPER, a modular, scalable system developed in collaboration with the United States Geological Survey (USGS) to automate the digitization of geologic maps. DIGMAPPER features a fully dockerized, workflow-orchestrated architecture that integrates state-of-the-art deep learning models for map layout analysis, feature extraction, and georeferencing. To overcome challenges such as limited training data and complex visual content, our system employs innovative techniques, including in-context learning with large language models, synthetic data generation, and transformer-based models. Evaluations on over 100 annotated maps from the DARPA-USGS dataset demonstrate high accuracy across polygon, line, and point feature extraction, and reliable georeferencing performance. Deployed at USGS, DIGMAPPER significantly accelerates the creation of analysis-ready geospatial datasets, supporting national-scale critical mineral assessments and broader geoscientific applications.

[99] EndoMUST: Monocular Depth Estimation for Robotic Endoscopy via End-to-end Multi-step Self-supervised Training

Liangjing Shao,Linxin Bai,Chenkang Du,Xinrong Chen

Main category: cs.CV

TL;DR: 本文提出了一种适用于机器人辅助内窥镜场景的深度估计新方法,通过分步骤训练有效解决了光照变化和信息干扰问题,并在多个数据集上表现出色。

Details Motivation: 由于内窥镜场景中存在光照变化和稀疏纹理的问题,现有的方法引入了包括光流、外观流和内在图像分解在内的多种技术,但如何为多个模块设计有效的训练策略仍然是关键挑战。因此,本文提出了新的解决方案。 Method: 该方法通过将每次端到端训练的过程分为三个步骤:光流配准、多尺度图像分解和多种变换对齐,在每一步中仅训练相关的网络模块,从而避免无关信息的干扰。 Result: 基于基础模型的参数高效微调,所提出的方法在SCARED数据集上的自监督深度估计和Hamlyn数据集上的零样本深度估计任务中实现了4%~10%的误差降低。 Conclusion: 该论文提出了一种新的多步骤高效微调框架,用于解决内窥镜自监督深度估计中的照明问题和信息干扰问题,并在多个数据集上取得了最先进的性能表现。 Abstract: Monocular depth estimation and ego-motion estimation are significant tasks for scene perception and navigation in stable, accurate and efficient robot-assisted endoscopy. To tackle lighting variations and sparse textures in endoscopic scenes, multiple techniques including optical flow, appearance flow and intrinsic image decomposition have been introduced into the existing methods. However, the effective training strategy for multiple modules are still critical to deal with both illumination issues and information interference for self-supervised depth estimation in endoscopy. Therefore, a novel framework with multistep efficient finetuning is proposed in this work. In each epoch of end-to-end training, the process is divided into three steps, including optical flow registration, multiscale image decomposition and multiple transformation alignments. At each step, only the related networks are trained without interference of irrelevant information. Based on parameter-efficient finetuning on the foundation model, the proposed method achieves state-of-the-art performance on self-supervised depth estimation on SCARED dataset and zero-shot depth estimation on Hamlyn dataset, with 4\%$\sim$10\% lower error. The evaluation code of this work has been published on https://github.com/BaymaxShao/EndoMUST.

[100] PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models

Tianchen Zhao,Ke Hong,Xinhao Yang,Xuefeng Xiao,Huixia Li,Feng Ling,Ruiqi Xie,Siqi Chen,Hongyu Zhu,Yichong Zhang,Yu Wang

Main category: cs.CV

TL;DR: This paper introduces PAROAttention, a method that reorganizes attention patterns to improve efficiency in visual generation tasks, achieving significant speedups without sacrificing performance.

Details Motivation: Quadratic complexity of attention mechanisms causes high costs in high-resolution image or video generation; existing techniques like sparsification and quantization face challenges under low density and reduced bitwidths. Method: The paper proposes Pattern-Aware token ReOrdering (PARO), which transforms irregular attention patterns into a hardware-friendly block-wise pattern, improving sparsification and quantization efficiency. Result: PAROAttention achieves video and image generation with lossless metrics, nearly identical results to full-precision baselines, and 1.9x to 2.7x end-to-end latency speedup at INT8/INT4 precision. Conclusion: PAROAttention successfully reduces computational and memory costs in visual generation by reorganizing attention patterns, achieving lossless performance with lower density and bitwidth. Abstract: In visual generation, the quadratic complexity of attention mechanisms results in high memory and computational costs, especially for longer token sequences required in high-resolution image or multi-frame video generation. To address this, prior research has explored techniques such as sparsification and quantization. However, these techniques face significant challenges under low density and reduced bitwidths. Through systematic analysis, we identify that the core difficulty stems from the dispersed and irregular characteristics of visual attention patterns. Therefore, instead of introducing specialized sparsification and quantization design to accommodate such patterns, we propose an alternative strategy: *reorganizing* the attention pattern to alleviate the challenges. Inspired by the local aggregation nature of visual feature extraction, we design a novel **Pattern-Aware token ReOrdering (PARO)** technique, which unifies the diverse attention patterns into a hardware-friendly block-wise pattern. This unification substantially simplifies and enhances both sparsification and quantization. We evaluate the performance-efficiency trade-offs of various design choices and finalize a methodology tailored for the unified pattern. Our approach, **PAROAttention**, achieves video and image generation with lossless metrics, and nearly identical results from full-precision (FP) baselines, while operating at notably lower density (~20%-30%) and bitwidth (**INT8/INT4**), achieving a **1.9x** to **2.7x** end-to-end latency speedup.

[101] Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation

Yong Liu,SongLi Wu,Sule Bai,Jiahao Wang,Yitong Wang,Yansong Tang

Main category: cs.CV

TL;DR: This paper introduces OpenBench, a new benchmark for open-vocabulary segmentation, and proposes OVSNet, a method that achieves state-of-the-art results by fusing heterogeneous features and expanding the training space.

Details Motivation: Existing test sets are limited in measuring models' comprehension of 'open-vocabulary' concepts as their semantic space closely resembles the training space. This motivates the need for a new benchmark to better assess the model's ability to understand and segment a wide range of real-world concepts. Method: OVSNet improves segmentation performance through elaborate fusion of heterogeneous features and cost-free expansion of the training space. Result: Testing existing methods on OpenBench showed that their performance diverges from conclusions drawn on existing test sets, highlighting the importance of the new benchmark. Conclusion: The proposed OVSNet method achieves state-of-the-art results on both existing datasets and the newly proposed OpenBench benchmark, demonstrating its effectiveness in improving segmentation performance for diverse and open scenarios. Abstract: Open-vocabulary segmentation aims to achieve segmentation of arbitrary categories given unlimited text inputs as guidance. To achieve this, recent works have focused on developing various technical routes to exploit the potential of large-scale pre-trained vision-language models and have made significant progress on existing benchmarks. However, we find that existing test sets are limited in measuring the models' comprehension of ``open-vocabulary" concepts, as their semantic space closely resembles the training space, even with many overlapping categories. To this end, we present a new benchmark named OpenBench that differs significantly from the training semantics. It is designed to better assess the model's ability to understand and segment a wide range of real-world concepts. When testing existing methods on OpenBench, we find that their performance diverges from the conclusions drawn on existing test sets. In addition, we propose a method named OVSNet to improve the segmentation performance for diverse and open scenarios. Through elaborate fusion of heterogeneous features and cost-free expansion of the training space, OVSNet achieves state-of-the-art results on both existing datasets and our proposed OpenBench. Corresponding analysis demonstrate the soundness and effectiveness of our proposed benchmark and method.

[102] STAR-Pose: Efficient Low-Resolution Video Human Pose Estimation via Spatial-Temporal Adaptive Super-Resolution

Yucheng Jin,Jinyan Chen,Ziyue He,Baojun Han,Furan An

Main category: cs.CV

TL;DR: STAR-Pose是一种高效的视频人体姿态估计框架,解决了低分辨率视频中姿态估计的问题,具有高性能和快速推理能力。

Details Motivation: 现有的方法要么假设输入质量高,要么采用计算成本高昂的级联处理,这限制了它们在资源受限环境中的应用。因此需要提出一个高效、高质量的解决方案。 Method: STAR-Pose采用了带有LeakyReLU修正的线性注意力机制的空间-时间Transformer,并结合了用于局部纹理增强的并行CNN分支的自适应融合模块,以及一种姿态感知复合损失函数。 Result: STAR-Pose在主流视频HPE数据集上进行了大量实验,证明其性能优于现有方法,在极端低分辨率(64x48)条件下mAP提高了5.2%,并且推理速度快2.8倍到4.4倍。 Conclusion: STAR-Pose是一个专为视频人体姿态估计设计的空间-时间自适应超分辨率框架,它在低分辨率条件下表现出色,并且推理速度更快。 Abstract: Human pose estimation in low-resolution videos presents a fundamental challenge in computer vision. Conventional methods either assume high-quality inputs or employ computationally expensive cascaded processing, which limits their deployment in resource-constrained environments. We propose STAR-Pose, a spatial-temporal adaptive super-resolution framework specifically designed for video-based human pose estimation. Our method features a novel spatial-temporal Transformer with LeakyReLU-modified linear attention, which efficiently captures long-range temporal dependencies. Moreover, it is complemented by an adaptive fusion module that integrates parallel CNN branch for local texture enhancement. We also design a pose-aware compound loss to achieve task-oriented super-resolution. This loss guides the network to reconstruct structural features that are most beneficial for keypoint localization, rather than optimizing purely for visual quality. Extensive experiments on several mainstream video HPE datasets demonstrate that STAR-Pose outperforms existing approaches. It achieves up to 5.2% mAP improvement under extremely low-resolution (64x48) conditions while delivering 2.8x to 4.4x faster inference than cascaded approaches.

[103] TD3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading

Byung Hoon Lee,Wooseok Shin,Sung Won Han

Main category: cs.CV

TL;DR: 本文提出TD3Net,通过改进时间卷积网络提升唇读任务中对连续唇动建模的能力,取得了更好的性能表现。

Details Motivation: 传统TCN存在感受野密度不足的问题,导致唇动连续信息丢失,影响模型性能。 Method: 提出了TD3Net,结合密集跳跃连接和多扩张时间卷积以扩大感受野密度并保持时序连续性。 Result: 在LRW和LRW-1000数据集上,TD3Net在参数更少、计算量更低的情况下达到了与现有方法相当或更高的准确率。 Conclusion: TD3Net有效解决了TCN在唇读任务中的感受野盲点问题,并在准确率、参数量和计算复杂度方面表现出色,适用于唇读系统。 Abstract: The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository: https://github.com/Leebh-kor/TD3Net-A-Temporal-Densely-Connected-Multi-dilated-Convolutional-Network-for-Lipreading

[104] PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning

Yizhe Li,Sanping Zhou,Zheng Qin,Le Wang

Main category: cs.CV

TL;DR: This paper proposes PR-DETR, a dense video captioning method that enhances detection transformers with explicit position and relation priors, improving both localization and caption quality.

Details Motivation: Existing transformer-based methods for dense video captioning implicitly learn event locations and semantics, requiring large amounts of data and limiting performance. This work aims to improve performance by explicitly integrating prior knowledge about event positions and relations. Method: PR-DETR introduces position-anchored queries to provide explicit position prior and an event relation encoder to model boundary relationships as relation prior, enhancing both event localization and semantic coherence of captions. Result: Extensive experiments and ablation studies demonstrate the effectiveness of the position and relation priors. The method achieves competitive performance on ActivityNet Captions and YouCook2 datasets. Conclusion: The proposed PR-DETR framework improves dense video captioning by explicitly incorporating position and relation priors, leading to better localization accuracy and caption quality. Abstract: Dense video captioning is a challenging task that aims to localize and caption multiple events in an untrimmed video. Recent studies mainly follow the transformer-based architecture to jointly perform the two sub-tasks, i.e., event localization and caption generation, in an end-to-end manner. Based on the general philosophy of detection transformer, these methods implicitly learn the event locations and event semantics, which requires a large amount of training data and limits the model's performance in practice. In this paper, we propose a novel dense video captioning framework, named PR-DETR, which injects the explicit position and relation prior into the detection transformer to improve the localization accuracy and caption quality, simultaneously. On the one hand, we first generate a set of position-anchored queries to provide the scene-specific position and semantic information about potential events as position prior, which serves as the initial event search regions to eliminate the implausible event proposals. On the other hand, we further design an event relation encoder to explicitly calculate the relationship between event boundaries as relation prior to guide the event interaction to improve the semantic coherence of the captions. Extensive ablation studies are conducted to verify the effectiveness of the position and relation prior. Experimental results also show the competitive performance of our method on ActivityNet Captions and YouCook2 datasets.

[105] AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models

Yuan Zhang,Chun-Kai Fan,Tao Huang,Ming Lu,Sicheng Yu,Junwen Pan,Kuan Cheng,Qi She,Shanghang Zhang

Main category: cs.CV

TL;DR: 本文提出 AutoV 方法,能够自动选择最佳视觉提示,有效提升视觉-语言模型在图像理解任务中的表现。

Details Motivation: 手动设计有效的视觉提示耗时且难以达到最优效果,因此需要一种自动选择最佳视觉提示的方法。 Method: AutoV 通过自动选择最佳视觉提示来增强 LVLMs 的推理能力,并使用预训练的 LVLM 对各种视觉提示进行排序作为监督信号进行训练。 Result: 实验结果显示,AutoV 在多个流行的图像理解任务中显著提升了不同 LVLM 的性能,例如 LLaVA-OV 在 LLaVA$^{\text{Wild}}$ 上准确率提高了 1.7%,Qwen2.5-VL 在 MMMU 上提高了 1.9%。 Conclusion: AutoV 是一种优化的视觉提示方法,可以提升大型视觉-语言模型在多个图像理解任务中的性能。 Abstract: Inspired by text prompts in large language models (LLMs), visual prompts have been explored to enhance the reasoning capabilities of large vision-language models (LVLMs). Current methods design heuristic visual prompts, such as overlaying a text-query-guided attention heatmap on the original input image. However, designing effective prompts manually is challenging and time-consuming, and it often fails to explore the benefits of different visual prompts, leading to sub-optimal performance. To this end, we propose \textbf{AutoV} that learns to automatically select the optimal visual prompt from various candidates based on given textual queries and the input image. To train AutoV, we developed an automatic data collection and labeling pipeline that evaluates various visual prompts with a pre-trained LVLM. We input a set of visual prompts into the LVLM and rank them according to the prediction losses generated by the model. Using the ranking as a supervision signal, we train AutoV to automatically choose the optimal visual prompt from various visual prompts for LVLMs. Experimental results indicate that AutoV enhances the performance of various LVLMs across multiple popular image understanding tasks. For instance, LLaVA-OV with AutoV achieves $\textbf{1.7}\%$ accuracy gain on LLaVA$^{\text{Wild}}$, and AutoV boosts Qwen2.5-VL by $\textbf{1.9}\%$ on MMMU, highlighting its potential as an optimal visual prompting method for LVLMs.

[106] FastInit: Fast Noise Initialization for Temporally Consistent Video Generation

Chengyu Bai,Yuming Li,Zhongyu Zhao,Jintao Chen,Peidong Jia,Qi She,Ming Lu,Shanghang Zhang

Main category: cs.CV

TL;DR: FastInit 提出了一种高效的视频生成噪声初始化方法,利用 VNPNet 在单次前向传播中生成 refined noise,从而在提高生成质量与时间一致性的同时,显著降低计算成本。

Details Motivation: 尽管扩散模型在视频生成方面取得了显著进展,但实现高时间一致性仍然是一个挑战。现有方法 FreeInit 虽然提出了在推理过程中迭代优化初始噪声的方法,但这显著增加了计算成本。因此,提出一种高效且保持时间一致性的方法是本文的主要动机。 Method: FastInit 采用了一个 Video Noise Prediction Network (VNPNet),以随机噪声和文本提示作为输入,并通过一次前向传播生成 refined noise。为了训练 VNPNet,作者构建了一个包含文本提示、随机噪声和 refined noise 对的大规模数据集。 Result: 实验表明,FastInit 在多种文本到视频生成模型中都能持续提升生成视频的质量和时间一致性。此外,由于无需迭代优化,该方法显著提升了生成效率。 Conclusion: FastInit 是一种高效的视频生成噪声初始化方法,它通过单次前向传递学习生成 refined noise,从而消除了对迭代优化的依赖。这种方法不仅提高了视频生成的质量和时间一致性,而且可以直接应用于推理阶段,提供了实用的解决方案。 Abstract: Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video generation. In this paper, we introduce FastInit, a fast noise initialization method that eliminates the need for iterative refinement. FastInit learns a Video Noise Prediction Network (VNPNet) that takes random noise and a text prompt as input, generating refined noise in a single forward pass. Therefore, FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames. To train the VNPNet, we create a large-scale dataset consisting of pairs of text prompts, random noise, and refined noise. Extensive experiments with various text-to-video models show that our method consistently improves the quality and temporal consistency of the generated videos. FastInit not only provides a substantial improvement in video generation but also offers a practical solution that can be applied directly during inference. The code and dataset will be released.

[107] Neurosymbolic Object-Centric Learning with Distant Supervision

Stefano Colamonaco,David Debot,Giuseppe Marra

Main category: cs.CV

TL;DR: 该论文提出了一种名为DeepObjectLog的神经符号模型,用于从原始非结构化感知数据中直接学习以对象为中心的表示,并通过概率逻辑编程实现符号推理。

Details Motivation: 现有系统依赖于对象级别的监督或预定义的对象分解方式,而本文旨在解决这一问题,实现从原始数据中自主学习对象表示并进行推理。 Method: 作者提出了一个神经符号框架,结合了感知模块和基于概率逻辑编程的符号推理层,从而引导模型发现输入中的有意义对象。 Result: 实验结果显示,在多种泛化设置下(如未见过的对象组合、任务和对象数量),该方法优于纯神经网络和神经符号基线方法。 Conclusion: DeepObjectLog能够有效从非结构化数据中学习对象表示,并在远距离监督下实现强大的泛化能力。 Abstract: Relational learning enables models to generalize across structured domains by reasoning over objects and their interactions. While recent advances in neurosymbolic reasoning and object-centric learning bring us closer to this goal, existing systems rely either on object-level supervision or on a predefined decomposition of the input into objects. In this work, we propose a neurosymbolic formulation for learning object-centric representations directly from raw unstructured perceptual data and using only distant supervision. We instantiate this approach in DeepObjectLog, a neurosymbolic model that integrates a perceptual module, which extracts relevant object representations, with a symbolic reasoning layer based on probabilistic logic programming. By enabling sound probabilistic logical inference, the symbolic component introduces a novel learning signal that further guides the discovery of meaningful objects in the input. We evaluate our model across a diverse range of generalization settings, including unseen object compositions, unseen tasks, and unseen number of objects. Experimental results show that our method outperforms neural and neurosymbolic baselines across the tested settings.

[108] GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Yi Chen,Yuying Ge,Rui Wang,Yixiao Ge,Junhao Cheng,Ying Shan,Xihui Liu

Main category: cs.CV

TL;DR: This paper proposes GRPO-CARE, a reinforcement learning framework enhancing reasoning coherence and answer accuracy in MLLMs, validated through a new benchmark called SEED-Bench-R1.

Details Motivation: Current reinforcement learning methods like GRPO lack logical coherence in reasoning steps when applied to multimodal LLMs (MLLMs), necessitating a method that improves both accuracy and interpretability. Method: GRPO-CARE introduces a two-tiered reward mechanism that combines a base reward for answer correctness with an adaptive consistency bonus based on comparisons with a reference model and group peers. Result: GRPO-CARE achieves a 6.7% performance gain on the hardest evaluation level of SEED-Bench-R1 and a 24.5% improvement in reasoning consistency, demonstrating strong transferability across video understanding benchmarks. Conclusion: GRPO-CARE provides an effective framework for improving both answer correctness and reasoning coherence in MLLMs, outperforming standard GRPO without explicit supervision. Abstract: Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.

[109] MBA: Multimodal Bidirectional Attack for Referring Expression Segmentation Models

Xingbai Chen,Tingchao Fu,Renyang Liu,Wei Zhou,Chao Yi

Main category: cs.CV

TL;DR: This paper proposes a Multimodal Bidirectional Attack to improve adversarial robustness evaluation for Referring Expression Segmentation (RES) models, showing better performance in handling diverse textual inputs.

Details Motivation: The motivation is to explore the adversarial robustness of RES models, which has been largely unexplored. Existing adversarial attack methods perform poorly on RES due to its multimodal structure, necessitating a new approach for practical open-world scenarios. Method: The paper introduces a novel adversarial attack strategy called Multimodal Bidirectional Attack. It employs learnable proxy textual embedding perturbation and performs joint optimization on both image and text modalities to enhance cross-text transferability. Result: Extensive experiments showed that the proposed method outperforms existing methods on multiple RES models and benchmark datasets, demonstrating superior effectiveness in generating adversarial examples with high cross-text transferability. Conclusion: The paper concludes that the proposed Multimodal Bidirectional Attack method effectively exposes vulnerabilities in RES models, offering improved adversarial robustness and cross-text transferability. Abstract: Referring Expression Segmentation (RES) enables precise object segmentation in images based on natural language descriptions, offering high flexibility and broad applicability in real-world vision tasks. Despite its impressive performance, the robustness of RES models against adversarial examples remains largely unexplored. While prior adversarial attack methods have explored adversarial robustness on conventional segmentation models, they perform poorly when directly applied to RES, failing to expose vulnerabilities in its multimodal structure. Moreover, in practical open-world scenarios, users typically issue multiple, diverse referring expressions to interact with the same image, highlighting the need for adversarial examples that generalize across varied textual inputs. To address these multimodal challenges, we propose a novel adversarial attack strategy termed \textbf{Multimodal Bidirectional Attack}, tailored for RES models. Our method introduces learnable proxy textual embedding perturbation and jointly performs visual-aligned optimization on the image modality and textual-adversarial optimization on the textual modality during attack generation. This dual optimization framework encourages adversarial images to actively adapt to more challenging text embedding during optimization, thereby enhancing their cross-text transferability, which refers to the ability of adversarial examples to remain effective under a variety of unseen or semantically diverse textual inputs. Extensive experiments conducted on multiple RES models and benchmark datasets demonstrate the superior effectiveness of our method compared to existing methods.

[110] Co-Speech Gesture and Facial Expression Generation for Non-Photorealistic 3D Characters

Taisei Omine,Naoyuki Kawabata,Fuminori Homma

Main category: cs.CV

TL;DR: 本研究提出了针对非真实感角色(如动漫)情感表达的新方法,结合漫画中的表情数据和对话特定的手势,提升了效果。

Details Motivation: 现有的研究多集中于逼真虚拟角色,难以适用于如动漫等非真实感角色,因此需要探索适合这类角色的情感表达方式。 Method: 通过从漫画中提取表情数据并结合对话特定的语义手势,提出一种适用于非真实感角色的情感表达方法,并进行了用户研究验证。 Result: 用户研究表明,在多个方面相较于现有研究有显著改进,证明了所提方法的有效性。 Conclusion: 研究得出利用漫画中提取的表情数据和对话特定的语义手势,可以有效提升非真实感角色情感表达的效果。 Abstract: With the advancement of conversational AI, research on bodily expressions, including gestures and facial expressions, has also progressed. However, many existing studies focus on photorealistic avatars, making them unsuitable for non-photorealistic characters, such as those found in anime. This study proposes methods for expressing emotions, including exaggerated expressions unique to non-photorealistic characters, by utilizing expression data extracted from comics and dialogue-specific semantic gestures. A user study demonstrated significant improvements across multiple aspects when compared to existing research.

[111] Align the GAP: Prior-based Unified Multi-Task Remote Physiological Measurement Framework For Domain Generalization and Personalization

Jiyao Wang,Xiao Yang,Hao Lu,Dengbo He,Kaishun Wu

Main category: cs.CV

TL;DR: This paper introduces a unified framework (GAP) for multi-source domain generalization and test-time personalization in remote physiological measurement, improving robustness and adaptability across diverse conditions and individuals.

Details Motivation: MSSDG aims to improve the generalizability of multi-task remote physiological measurements, but challenges like partial labeling and environmental noise hinder accuracy. Real-time personalization (TTPA) is also needed, yet current methods struggle to bridge the gap between generalization and personalization. Method: A unified framework disentangles facial video information into invariant semantics, individual bias, and noise. Prior-based modules are applied across stages for simultaneous generalization and personalization. Result: The framework successfully handles both MSSDG and TTPA with minimal adjustments. The benchmark was expanded to include TTPA on six datasets, and a new real-world driving dataset with complete labels was introduced. Conclusion: The proposed GAP framework effectively addresses both MSSDG and TTPA in multi-task remote physiological measurement, enhancing generalizability and enabling real-time personalization. Abstract: Multi-source synsemantic domain generalization (MSSDG) for multi-task remote physiological measurement seeks to enhance the generalizability of these metrics and attracts increasing attention. However, challenges like partial labeling and environmental noise may disrupt task-specific accuracy. Meanwhile, given that real-time adaptation is necessary for personalized products, the test-time personalized adaptation (TTPA) after MSSDG is also worth exploring, while the gap between previous generalization and personalization methods is significant and hard to fuse. Thus, we proposed a unified framework for MSSD\textbf{G} and TTP\textbf{A} employing \textbf{P}riors (\textbf{GAP}) in biometrics and remote photoplethysmography (rPPG). We first disentangled information from face videos into invariant semantics, individual bias, and noise. Then, multiple modules incorporating priors and our observations were applied in different stages and for different facial information. Then, based on the different principles of achieving generalization and personalization, our framework could simultaneously address MSSDG and TTPA under multi-task remote physiological estimation with minimal adjustments. We expanded the MSSDG benchmark to the TTPA protocol on six publicly available datasets and introduced a new real-world driving dataset with complete labeling. Extensive experiments that validated our approach, and the codes along with the new dataset will be released.

[112] Integrating Generative Adversarial Networks and Convolutional Neural Networks for Enhanced Traffic Accidents Detection and Analysis

Zhenghao Xi,Xiang Liu,Yaqi Liu,Yitong Cai,Yangyu Zheng

Main category: cs.CV

TL;DR: This research proposes a deep learning-based framework for accident detection using CCTV footage, combining GANs for data synthesis and CNN for model training. It achieves high accuracy rates (up to 95%) and addresses challenges such as supervised monitoring and data deficiency, making it suitable for traffic safety applications and future intelligent surveillance systems.

Details Motivation: The motivation stems from the rising statistics of car accidents worldwide, calling for innovation and the establishment of a smart, efficient, and automated way of identifying accidents to save lives. Supervised monitoring and data deficiency issues in current accident detection systems necessitate this research. Method: The research adapts deep learning technologies by combining Generative Adversarial Networks (GANs) for synthesizing data and Convolutional Neural Networks (CNN) for model training. Video frames for accidents and non-accidents are collected from YouTube videos, followed by preprocessing steps like resizing, image enhancement, and normalization. Three models are employed: CNN, Fine-tuned Convolutional Neural Network (FTCNN), and Vision Transformer (VIT). Result: The Vision Transformer (VIT) model performed best, achieving an accuracy rate of 94% and 95%, while the CNN model obtained 88%. These results indicate the effectiveness of the proposed framework in detecting accidents from CCTV footage. Conclusion: The proposed framework is suitable for traffic safety applications due to its high real-time accident detection capabilities and broad-scale applicability, laying the foundation for future intelligent surveillance systems in real-time traffic monitoring, smart city frameworks, and integration into emergency management systems. Abstract: Accident detection using Closed Circuit Television (CCTV) footage is one of the most imperative features for enhancing transport safety and efficient traffic control. To this end, this research addresses the issues of supervised monitoring and data deficiency in accident detection systems by adapting excellent deep learning technologies. The motivation arises from rising statistics in the number of car accidents worldwide; this calls for innovation and the establishment of a smart, efficient and automated way of identifying accidents and calling for help to save lives. Addressing the problem of the scarcity of data, the presented framework joins Generative Adversarial Networks (GANs) for synthesizing data and Convolutional Neural Networks (CNN) for model training. Video frames for accidents and non-accidents are collected from YouTube videos, and we perform resizing, image enhancement and image normalisation pixel range adjustments. Three models are used: CNN, Fine-tuned Convolutional Neural Network (FTCNN) and Vision Transformer (VIT) worked best for detecting accidents from CCTV, obtaining an accuracy rate of 94% and 95%, while the CNN model obtained 88%. Such results show that the proposed framework suits traffic safety applications due to its high real-time accident detection capabilities and broad-scale applicability. This work lays the foundation for intelligent surveillance systems in the future for real-time traffic monitoring, smart city framework, and integration of intelligent surveillance systems into emergency management systems.

[113] VideoGAN-based Trajectory Proposal for Automated Vehicles

Annajoyce Mariani,Kira Maag,Hanno Gottschalk

Main category: cs.CV

TL;DR: This paper proposes a GAN-based method for generating realistic vehicle trajectories using BEV video data, achieving fast training/inference and realistic results validated against real-world datasets.

Details Motivation: Current methods like model-driven, rule-based, and classical learning-based approaches struggle to capture complex, multimodal distributions of future trajectories. This research explores the use of GANs to improve trajectory prediction accuracy and realism. Method: A pipeline was developed that trains a generative adversarial network (GAN) on low-resolution bird's-eye view (BEV) occupancy grid videos. Trajectory data is extracted from generated videos using single-frame object detection and frame-to-frame object matching. Result: The best results were achieved within 100 GPU hours of training, with inference times under 20 ms. The generated trajectories demonstrated physical realism by aligning spatial and dynamic parameters with ground truth data from the Waymo Open Motion Dataset. Conclusion: The study concludes that using a GAN architecture trained on BEV occupancy grid videos can effectively generate statistically accurate and physically realistic trajectory options for road vehicles, with fast training and inference times. Abstract: Being able to generate realistic trajectory options is at the core of increasing the degree of automation of road vehicles. While model-driven, rule-based, and classical learning-based methods are widely used to tackle these tasks at present, they can struggle to effectively capture the complex, multimodal distributions of future trajectories. In this paper we investigate whether a generative adversarial network (GAN) trained on videos of bird's-eye view (BEV) traffic scenarios can generate statistically accurate trajectories that correctly capture spatial relationships between the agents. To this end, we propose a pipeline that uses low-resolution BEV occupancy grid videos as training data for a video generative model. From the generated videos of traffic scenarios we extract abstract trajectory data using single-frame object detection and frame-to-frame object matching. We particularly choose a GAN architecture for the fast training and inference times with respect to diffusion models. We obtain our best results within 100 GPU hours of training, with inference times under 20\,ms. We demonstrate the physical realism of the proposed trajectories in terms of distribution alignment of spatial and dynamic parameters with respect to the ground truth videos from the Waymo Open Motion Dataset.

[114] FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models

Xinting Liao,Weiming Liu,Jiaming Qian,Pengyang Zhou,Jiahe Xu,Wenjie Wang,Chaochao Chen,Xiaolin Zheng,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出了FOCoOp框架,通过结合ID全局提示、局部提示和OOD提示的方法,解决联邦提示学习中性能与鲁棒性之间的权衡问题,特别是在OOD转移情况下。

Details Motivation: 现有FPL方法在性能与鲁棒性之间存在权衡问题,特别是在现实场景中的OOD转移下,而客户机之间的ID数据异质性使这种权衡更具挑战性。 Method: 提出了一种联邦OOD感知上下文优化框架,利用ID全局提示、局部提示和OOD提示来创建类别级别和分布级别的分离,并通过半不平衡最优传输改进客户端之间的判别一致性。 Result: 在真实世界数据集上的广泛实验证明了FOCoOp的有效性。 Conclusion: FOCoOp有效地捕捉去中心化异构分布并增强不同OOD转移的鲁棒性。 Abstract: Federated prompt learning (FPL) for vision-language models is a powerful approach to collaboratively adapt models across distributed clients while preserving data privacy. However, existing FPL approaches suffer from a trade-off between performance and robustness, particularly in out-of-distribution (OOD) shifts, limiting their reliability in real-world scenarios. The inherent in-distribution (ID) data heterogeneity among different clients makes it more challenging to maintain this trade-off. To fill this gap, we introduce a Federated OOD-aware Context Optimization (FOCoOp) framework, which captures diverse distributions among clients using ID global prompts, local prompts, and OOD prompts. Specifically, FOCoOp leverages three sets of prompts to create both class-level and distribution-level separations, which adapt to OOD shifts through bi-level distributionally robust optimization. Additionally, FOCoOp improves the discrimination consistency among clients, i.e., calibrating global prompts, seemingly OOD prompts, and OOD prompts by semi-unbalanced optimal transport. The extensive experiments on real-world datasets demonstrate that FOCoOp effectively captures decentralized heterogeneous distributions and enhances robustness of different OOD shifts. The project is available at GitHub.

[115] R3eVision: A Survey on Robust Rendering, Restoration, and Enhancement for 3D Low-Level Vision

Weeyoung Kwon,Jeahun Sung,Minkyu Jeon,Chanho Eom,Jihyong Oh

Main category: cs.CV

TL;DR: This survey explores 3D Low-Level Vision (3D LLV), addressing challenges in robust 3D scene reconstruction from degraded inputs by extending classical 2D vision tasks into the 3D domain.

Details Motivation: Most neural rendering models assume clean, high-resolution inputs but struggle with real-world degradations like noise, blur, and low resolution. This necessitates the development of 3D Low-Level Vision techniques. Method: The paper provides a survey of existing methods integrating Low-Level Vision (LLV) into neural rendering frameworks to address degradation issues in 3D scenes. Result: The survey categorizes recent approaches that enable high-fidelity 3D reconstruction under adverse conditions and discusses their applications in domains such as autonomous driving, AR/VR, and robotics. Conclusion: 3D LLV is positioned as a fundamental direction for robust 3D content generation and scene-level reconstruction in real-world environments. Abstract: Neural rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved significant progress in photorealistic 3D scene reconstruction and novel view synthesis. However, most existing models assume clean and high-resolution (HR) multi-view inputs, which limits their robustness under real-world degradations such as noise, blur, low-resolution (LR), and weather-induced artifacts. To address these limitations, the emerging field of 3D Low-Level Vision (3D LLV) extends classical 2D Low-Level Vision tasks including super-resolution (SR), deblurring, weather degradation removal, restoration, and enhancement into the 3D spatial domain. This survey, referred to as R\textsuperscript{3}eVision, provides a comprehensive overview of robust rendering, restoration, and enhancement for 3D LLV by formalizing the degradation-aware rendering problem and identifying key challenges related to spatio-temporal consistency and ill-posed optimization. Recent methods that integrate LLV into neural rendering frameworks are categorized to illustrate how they enable high-fidelity 3D reconstruction under adverse conditions. Application domains such as autonomous driving, AR/VR, and robotics are also discussed, where reliable 3D perception from degraded inputs is critical. By reviewing representative methods, datasets, and evaluation protocols, this work positions 3D LLV as a fundamental direction for robust 3D content generation and scene-level reconstruction in real-world environments.

[116] Dense 3D Displacement Estimation for Landslide Monitoring via Fusion of TLS Point Clouds and Embedded RGB Images

Zhaoyi Wang,Jemil Avers Butt,Shengyu Huang,Tomislav Medic,Andreas Wieser

Main category: cs.CV

TL;DR: 本文提出了一种基于点云和RGB图像融合的分层由粗到精的方法,用于估计密集的三维位移向量场,以提高滑坡监测的空间覆盖率和准确性。

Details Motivation: 现有的点云方法通常依赖几何或辐射信息,导致稀疏或非三维位移估计,而本文旨在通过结合三维点云和共注册RGB图像来解决这一问题。 Method: 构建基于补丁的匹配方法,使用三维几何和二维图像特征,并通过几何一致性检查进行优化,最后对每个匹配进行刚性变换估计。 Result: 实验结果显示,该方法在两个真实世界滑坡数据集上分别实现了79%和97%的高空间覆盖率,与外部测量相比偏差分别为0.15米和0.25米,且优于F2S3方法。 Conclusion: 本文提出的方法为TLS-based滑坡监测提供了一个实用且可扩展的解决方案,并可以推广到其他类型的点云和监测任务。 Abstract: Landslide monitoring is essential for understanding geohazards and mitigating associated risks. However, existing point cloud-based methods typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partition-based coarse-to-fine approach that fuses 3D point clouds and co-registered RGB images to estimate dense 3D displacement vector fields. We construct patch-level matches using both 3D geometry and 2D image features. These matches are refined via geometric consistency checks, followed by rigid transformation estimation per match. Experimental results on two real-world landslide datasets demonstrate that our method produces 3D displacement estimates with high spatial coverage (79% and 97%) and high accuracy. Deviations in displacement magnitude with respect to external measurements (total station or GNSS observations) are 0.15 m and 0.25 m on the two datasets, respectively, and only 0.07 m and 0.20 m compared to manually derived references. These values are below the average scan resolutions (0.08 m and 0.30 m). Our method outperforms the state-of-the-art method F2S3 in spatial coverage while maintaining comparable accuracy. Our approach offers a practical and adaptable solution for TLS-based landslide monitoring and is extensible to other types of point clouds and monitoring tasks. Our example data and source code are publicly available at https://github.com/zhaoyiww/fusion4landslide.

[117] Fine-grained Image Retrieval via Dual-Vision Adaptation

Xin Jiang,Meiqi Cao,Hao Tang,Fei Shen,Zechao Li

Main category: cs.CV

TL;DR: This paper proposes Dual-Vision Adaptation (DVA) for fine-grained image retrieval, which improves generalization by adapting input samples and features without modifying pre-trained model parameters, resulting in efficient and effective retrieval performance.

Details Motivation: Current FGIR methods either enforce similarity constraints or use localization sub-networks, which often overfit training data and lose pre-training knowledge, leading to reduced generalization ability. Method: DVA introduces two key components: Object-Perceptual Adaptation for modifying input samples to highlight critical objects, and In-Context Adaptation for feature-level adjustments using additional parameters. Discrimination Perception Transfer is also used to transfer discriminative knowledge via knowledge distillation. Result: DVA achieves strong performance on both in-distribution and out-of-distribution fine-grained datasets while maintaining retrieval efficiency and having fewer learnable parameters compared to existing methods. Conclusion: The proposed DVA approach effectively addresses overfitting and generalization issues in FGIR by combining sample and feature adaptation without modifying pre-trained parameters, achieving strong performance with fewer learnable parameters. Abstract: Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features. Current leading FGIR solutions typically follow two regimes: enforce pairwise similarity constraints in the semantic embedding space, or incorporate a localization sub-network to fine-tune the entire model. However, such two regimes tend to overfit the training data while forgetting the knowledge gained from large-scale pre-training, thus reducing their generalization ability. In this paper, we propose a Dual-Vision Adaptation (DVA) approach for FGIR, which guides the frozen pre-trained model to perform FGIR through collaborative sample and feature adaptation. Specifically, we design Object-Perceptual Adaptation, which modifies input samples to help the pre-trained model perceive critical objects and elements within objects that are helpful for category prediction. Meanwhile, we propose In-Context Adaptation, which introduces a small set of parameters for feature adaptation without modifying the pre-trained parameters. This makes the FGIR task using these adjusted features closer to the task solved during the pre-training. Additionally, to balance retrieval efficiency and performance, we propose Discrimination Perception Transfer to transfer the discriminative knowledge in the object-perceptual adaptation to the image encoder using the knowledge distillation mechanism. Extensive experiments show that DVA has fewer learnable parameters and performs well on three in-distribution and three out-of-distribution fine-grained datasets.

[118] SycnMapV2: Robust and Adaptive Unsupervised Segmentation

Heng Zhang,Zikang Wan,Danilo Vasconcellos Vargas

Main category: cs.CV

TL;DR: SyncMapV2: A new unsupervised AI method for robust visual segmentation that mimics human vision adaptability and significantly outperforms current techniques under various corrupting conditions.

Details Motivation: Existing AI algorithms struggle to maintain accuracy under various types of visual corruption, unlike human vision which remains remarkably robust. This paper aims to bridge this gap by developing a more resilient and adaptable AI model. Method: SyncMapV2 uses self-organizing dynamical equations combined with concepts from random networks to achieve unsupervised segmentation without any robust training, supervision, or loss functions. Result: SyncMapV2 shows minimal mIoU drop (0.01%) under digital corruption compared to a 23.8% drop in SOTA methods. It also demonstrates superior performance across noise (7.3% vs. 37.7%), weather (7.5% vs. 33.8%), and blur (7.0% vs. 29.5%) corruption types while adapting online with near-zero performance degradation. Conclusion: SyncMapV2 is the first algorithm that can perform unsupervised segmentation with state-of-the-art robustness and online adaptability, setting a foundation for future development of robust and adaptive AI. Abstract: Human vision excels at segmenting visual cues without the need for explicit training, and it remains remarkably robust even as noise severity increases. In contrast, existing AI algorithms struggle to maintain accuracy under similar conditions. Here, we present SyncMapV2, the first to solve unsupervised segmentation with state-of-the-art robustness. SyncMapV2 exhibits a minimal drop in mIoU, only 0.01%, under digital corruption, compared to a 23.8% drop observed in SOTA methods.This superior performance extends across various types of corruption: noise (7.3% vs. 37.7%), weather (7.5% vs. 33.8%), and blur (7.0% vs. 29.5%). Notably, SyncMapV2 accomplishes this without any robust training, supervision, or loss functions. It is based on a learning paradigm that uses self-organizing dynamical equations combined with concepts from random networks. Moreover,unlike conventional methods that require re-initialization for each new input, SyncMapV2 adapts online, mimicking the continuous adaptability of human vision. Thus, we go beyond the accurate and robust results, and present the first algorithm that can do all the above online, adapting to input rather than re-initializing. In adaptability tests, SyncMapV2 demonstrates near-zero performance degradation, which motivates and fosters a new generation of robust and adaptive intelligence in the near future.

[119] Learning Multi-scale Spatial-frequency Features for Image Denoising

Xu Zhao,Chen Zhao,Xiantao Hu,Hongliang Zhang,Ying Tai,Jian Yang

Main category: cs.CV

TL;DR: This paper proposes MADNet, a multi-scale adaptive dual-domain network for image denoising that effectively separates and processes high- and low-frequency noise components, achieving superior performance over existing methods.

Details Motivation: Existing multi-scale architectures for image denoising rely on fixed single-input single-output U-Net structures and treat frequency domains uniformly, neglecting the distinct characteristics of high- and low-frequency noise. Method: The authors introduce MADNet, which utilizes an image pyramid input strategy and an adaptive spatial-frequency learning unit (ASFU) with a learnable mask to separate and process high- and low-frequency components. They also incorporate a global feature fusion block in skip connections to enhance multi-scale features. Result: Extensive experiments on synthetic and real noisy image datasets demonstrate the effectiveness and superiority of MADNet over current state-of-the-art denoising approaches. Conclusion: The proposed MADNet effectively enhances image denoising performance by leveraging multi-scale adaptive dual-domain learning, outperforming existing state-of-the-art methods. Abstract: Recent advancements in multi-scale architectures have demonstrated exceptional performance in image denoising tasks. However, existing architectures mainly depends on a fixed single-input single-output Unet architecture, ignoring the multi-scale representations of pixel level. In addition, previous methods treat the frequency domain uniformly, ignoring the different characteristics of high-frequency and low-frequency noise. In this paper, we propose a novel multi-scale adaptive dual-domain network (MADNet) for image denoising. We use image pyramid inputs to restore noise-free results from low-resolution images. In order to realize the interaction of high-frequency and low-frequency information, we design an adaptive spatial-frequency learning unit (ASFU), where a learnable mask is used to separate the information into high-frequency and low-frequency components. In the skip connections, we design a global feature fusion block to enhance the features at different scales. Extensive experiments on both synthetic and real noisy image datasets verify the effectiveness of MADNet compared with current state-of-the-art denoising approaches.

[120] Segment Anything for Satellite Imagery: A Strong Baseline and a Regional Dataset for Automatic Field Delineation

Carmelo Scribano,Elena Govi,Paolo bertellini,Simone Parisi,Giorgia Franchini,Marko Bertogna

Main category: cs.CV

TL;DR: 本文提出一种基于SAM模型的自动田地边界描绘方法,并发布了一个新的区域数据集ERAS。

Details Motivation: 准确绘制农田边界对于农业高效运作至关重要,而利用计算机视觉技术从高分辨率卫星图像中自动提取可以避免昂贵的地面调查。 Method: 基于Segment Anything Model (SAM) 提出了一种田地边界描绘的流程,并引入了微调策略以适应此任务。此外,还描述了一种获取补充区域数据集的方法。 Result: 进行了广泛的实验来评估分割精度和泛化能力,结果表明所提出的方法有效。 Conclusion: 该论文的方法为自动田地边界描绘提供了一个强大的基线,并通过发布新的区域数据集ERAS促进了未来的相关研究。 Abstract: Accurate mapping of agricultural field boundaries is essential for the efficient operation of agriculture. Automatic extraction from high-resolution satellite imagery, supported by computer vision techniques, can avoid costly ground surveys. In this paper, we present a pipeline for field delineation based on the Segment Anything Model (SAM), introducing a fine-tuning strategy to adapt SAM to this task. In addition to using published datasets, we describe a method for acquiring a complementary regional dataset that covers areas beyond current sources. Extensive experiments assess segmentation accuracy and evaluate the generalization capabilities. Our approach provides a robust baseline for automated field delineation. The new regional dataset, known as ERAS, is now publicly available.

[121] RealDriveSim: A Realistic Multi-Modal Multi-Task Synthetic Dataset for Autonomous Driving

Arpit Jadon,Haoran Wang,Phillip Thomas,Michael Stanley,S. Nathaniel Cibik,Rachel Laurat,Omar Maher,Lukas Hoyer,Ozan Unal,Dengxin Dai

Main category: cs.CV

TL;DR: 本研究提出了RealDriveSim,这是一个用于自动驾驶的高质量合成数据集,具有多模态支持和细粒度标注,显著提升了模型性能。

Details Motivation: 随着感知模型的发展,对大规模数据集的需求增加,但数据标注成本过高。合成数据集提供了一种低成本提升模型性能的解决方案,但目前的合成数据集在范围、真实感和通用性方面仍有限。 Method: 提出了RealDriveSim,一个逼真的多模态合成数据集,支持2D计算机视觉和LiDAR应用,并进行了广泛的评估。 Result: RealDriveSim不仅支持广泛的应用,还在多个任务中取得了比现有合成基准更好的结果。 Conclusion: RealDriveSim是一个用于自动驾驶的多模态合成数据集,具有64类细粒度标注,并且在多种应用和领域中表现出最先进的性能。 Abstract: As perception models continue to develop, the need for large-scale datasets increases. However, data annotation remains far too expensive to effectively scale and meet the demand. Synthetic datasets provide a solution to boost model performance with substantially reduced costs. However, current synthetic datasets remain limited in their scope, realism, and are designed for specific tasks and applications. In this work, we present RealDriveSim, a realistic multi-modal synthetic dataset for autonomous driving that not only supports popular 2D computer vision applications but also their LiDAR counterparts, providing fine-grained annotations for up to 64 classes. We extensively evaluate our dataset for a wide range of applications and domains, demonstrating state-of-the-art results compared to existing synthetic benchmarks. The dataset is publicly available at https://realdrivesim.github.io/.

[122] Reliable Few-shot Learning under Dual Noises

Ji Zhang,Jingkuan Song,Lianli Gao,Nicu Sebe,Heng Tao Shen

Main category: cs.CV

TL;DR: This paper introduces DETA++, a noise-robust approach for few-shot learning that improves task adaptation and prediction reliability in noisy environments through specialized modules and strategies.

Details Motivation: Existing task adaptation-based few-shot learning approaches struggle with reliability due to in-distribution and out-of-distribution noise from support and query samples; this work aims to improve robustness in such scenarios. Method: The paper proposes DETA++, which incorporates a Contrastive Relevance Aggregation (CoRA) module, a clean prototype loss, a noise entropy maximization loss, a memory bank with LocalNCC classifier, and an IntraSwap strategy to combat ID and OOD noise in FSL. Result: Extensive experiments show that DETA++ achieves reliable few-shot learning performance under dual noise conditions, demonstrating its effectiveness and flexibility. Conclusion: DETA++ effectively addresses the issue of noise during few-shot learning by utilizing a combination of loss functions, memory refinement, and swapping strategies for robust performance. Abstract: Recent advances in model pre-training give rise to task adaptation-based few-shot learning (FSL), where the goal is to adapt a pre-trained task-agnostic model for capturing task-specific knowledge with a few-labeled support samples of the target task.Nevertheless, existing approaches may still fail in the open world due to the inevitable in-distribution (ID) and out-of-distribution (OOD) noise from both support and query samples of the target task. With limited support samples available, i) the adverse effect of the dual noises can be severely amplified during task adaptation, and ii) the adapted model can produce unreliable predictions on query samples in the presence of the dual noises. In this work, we propose DEnoised Task Adaptation (DETA++) for reliable FSL. DETA++ uses a Contrastive Relevance Aggregation (CoRA) module to calculate image and region weights for support samples, based on which a clean prototype loss and a noise entropy maximization loss are proposed to achieve noise-robust task adaptation. Additionally,DETA++ employs a memory bank to store and refine clean regions for each inner-task class, based on which a Local Nearest Centroid Classifier (LocalNCC) is devised to yield noise-robust predictions on query samples. Moreover, DETA++ utilizes an Intra-class Region Swapping (IntraSwap) strategy to rectify ID class prototypes during task adaptation, enhancing the model's robustness to the dual noises. Extensive experiments demonstrate the effectiveness and flexibility of DETA++.

[123] Transparency Techniques for Neural Networks trained on Writer Identification and Writer Verification

Viktoria Pundy,Marco Peer,Florian Kleber

Main category: cs.CV

TL;DR: This paper explores the application of two transparency techniques (pixel-level and point-specific saliency maps) to neural networks in Writer Identification and Verification, showing that pixel-level maps are more effective in supporting forensic experts.

Details Motivation: The motivation is to improve the performance and reliability of neural networks by providing transparency, specifically through examining characteristics selected by a neural network for identification processes in forensic analysis. Method: Two transparency techniques (pixel-level saliency maps and point-specific saliency maps) were applied to neural networks trained on Writer Identification and Writer Verification. The evaluation used deletion and insertion score metrics. Result: Pixel-level saliency maps performed better than point-specific maps according to the evaluation results using qualitative and quantitative measures. Conclusion: Pixel-wise saliency maps outperform point-specific saliency maps and are suitable for supporting forensic experts. Abstract: Neural Networks are the state of the art for many tasks in the computer vision domain, including Writer Identification (WI) and Writer Verification (WV). The transparency of these "black box" systems is important for improvements of performance and reliability. For this work, two transparency techniques are applied to neural networks trained on WI and WV for the first time in this domain. The first technique provides pixel-level saliency maps, while the point-specific saliency maps of the second technique provide information on similarities between two images. The transparency techniques are evaluated using deletion and insertion score metrics. The goal is to support forensic experts with information on similarities in handwritten text and to explore the characteristics selected by a neural network for the identification process. For the qualitative evaluation, the highlights of the maps are compared to the areas forensic experts consider during the identification process. The evaluation results show that the pixel-wise saliency maps outperform the point-specific saliency maps and are suitable for the support of forensic experts.

[124] MambaHash: Visual State Space Deep Hashing Model for Large-Scale Image Retrieval

Chao He,Hongxi Wei

Main category: cs.CV

TL;DR: MambaHash利用视觉状态空间模型提升深度哈希的性能,实验证明其在大规模图像检索任务中的有效性和优越性。

Details Motivation: 尽管Vision Mamba在线性时间复杂度和计算机任务性能上表现出色,但其在大规模图像检索中的适用性尚未被探索。 Method: 提出了一种基于Mamba的阶段式结构网络,包括分组Mamba操作、通道交互注意力模块和自适应特征增强模块,以优化信息建模和特征表达。 Result: 在CIFAR-10、NUS-WIDE和IMAGENET数据集上的实验表明,MambaHash相比现有深度哈希方法具有更高的效率和更优的性能。 Conclusion: MambaHash展示了Vision Mamba在大规模图像检索任务中的潜力,并为相关研究提供了开源代码实现。 Abstract: Deep image hashing aims to enable effective large-scale image retrieval by mapping the input images into simple binary hash codes through deep neural networks. More recently, Vision Mamba with linear time complexity has attracted extensive attention from researchers by achieving outstanding performance on various computer tasks. Nevertheless, the suitability of Mamba for large-scale image retrieval tasks still needs to be explored. Towards this end, we propose a visual state space hashing model, called MambaHash. Concretely, we propose a backbone network with stage-wise architecture, in which grouped Mamba operation is introduced to model local and global information by utilizing Mamba to perform multi-directional scanning along different groups of the channel. Subsequently, the proposed channel interaction attention module is used to enhance information communication across channels. Finally, we meticulously design an adaptive feature enhancement module to increase feature diversity and enhance the visual representation capability of the model. We have conducted comprehensive experiments on three widely used datasets: CIFAR-10, NUS-WIDE and IMAGENET. The experimental results demonstrate that compared with the state-of-the-art deep hashing methods, our proposed MambaHash has well efficiency and superior performance to effectively accomplish large-scale image retrieval tasks. Source code is available https://github.com/shuaichaochao/MambaHash.git

[125] Prompt-based Dynamic Token Pruning to Guide Transformer Attention in Efficient Segmentation

Pallabi Dutta,Anubhab Maity,Sushmita Mitra

Main category: cs.CV

TL;DR: This paper proposes an adaptive prompt-guided pruning method for Vision Transformers to reduce computational costs in medical image segmentation by focusing on relevant tokens, achieving significant efficiency improvements without sacrificing accuracy.

Details Motivation: Vision Transformers (ViTs) face high computational demands due to processing a large number of tokens, which limits their practical application in medical image analysis. This research aims to address this issue by improving computational efficiency while preserving segmentation accuracy. Method: An adaptive prompt-guided pruning method was proposed to rank tokens based on their relevance, selectively reducing the processing of irrelevant tokens during the segmentation pipeline. This strategy allows data-driven pruning, end-to-end training, and preserved gradient flow. Result: Experimental results demonstrated a reduction of ~35-55% in the number of tokens processed, leading to reduced computational costs compared to baselines. The framework maintained segmentation accuracy while significantly enhancing efficiency. Conclusion: The proposed adaptive prompt-guided pruning method enhances computational efficiency and maintains segmentation accuracy, enabling cost-effective and real-time medical image processing in resource-constrained environments. Abstract: The high computational demands of Vision Transformers (ViTs), in processing a huge number of tokens, often constrain their practical application in analyzing medical images. This research proposes an adaptive prompt-guided pruning method to selectively reduce the processing of irrelevant tokens in the segmentation pipeline. The prompt-based spatial prior helps to rank the tokens according to their relevance. Tokens with low-relevance scores are down-weighted, ensuring that only the relevant ones are propagated for processing across subsequent stages. This data-driven pruning strategy facilitates end-to-end training, maintains gradient flow, and improves segmentation accuracy by focusing computational resources on essential regions. The proposed framework is integrated with several state-of-the-art models to facilitate the elimination of irrelevant tokens; thereby, enhancing computational efficiency while preserving segmentation accuracy. The experimental results show a reduction of $\sim$ 35-55\% tokens; thus reducing the computational costs relative to the baselines. Cost-effective medical image processing, using our framework, facilitates real-time diagnosis by expanding its applicability in resource-constrained environments.

[126] AGC-Drive: A Large-Scale Dataset for Real-World Aerial-Ground Collaboration in Driving Scenarios

Yunhao Hou,Bochao Zou,Min Zhang,Ran Chen,Shangdong Yang,Yanmei Zhang,Junbao Zhuo,Siheng Chen,Jiansheng Chen,Huimin Ma

Main category: cs.CV

TL;DR: 本文提出了AGC-Drive——首个大规模空中-地面协同3D感知数据集,包含车辆与无人机多视角信息,旨在提升自动驾驶中的协同感知能力并填补空中视角研究的空白。

Details Motivation: 现有的协同感知研究主要关注车与车或车与基础设施之间的合作,缺乏对无人机提供空中视角的研究,同时缺少高质量的空中-地面协同场景数据集。 Method: 构建了一个包含两辆车和一架无人机的数据采集平台,每辆车配备五个摄像头和一个LiDAR传感器,无人机携带前视摄像头和LiDAR传感器,实现了多视角、多智能体感知。 Result: 推出了AGC-Drive数据集,包括约120K帧LiDAR数据和440K张图像,覆盖14种真实驾驶场景,并提供了两个3D感知任务基准测试及开源工具包。 Conclusion: AGC-Drive为自动驾驶中的多智能体协同感知提供了高质量的现实世界数据集,支持车辆间和无人机间的3D感知任务,并推动了相关研究的发展。 Abstract: By sharing information across multiple agents, collaborative perception helps autonomous vehicles mitigate occlusions and improve overall perception accuracy. While most previous work focus on vehicle-to-vehicle and vehicle-to-infrastructure collaboration, with limited attention to aerial perspectives provided by UAVs, which uniquely offer dynamic, top-down views to alleviate occlusions and monitor large-scale interactive environments. A major reason for this is the lack of high-quality datasets for aerial-ground collaborative scenarios. To bridge this gap, we present AGC-Drive, the first large-scale real-world dataset for Aerial-Ground Cooperative 3D perception. The data collection platform consists of two vehicles, each equipped with five cameras and one LiDAR sensor, and one UAV carrying a forward-facing camera and a LiDAR sensor, enabling comprehensive multi-view and multi-agent perception. Consisting of approximately 120K LiDAR frames and 440K images, the dataset covers 14 diverse real-world driving scenarios, including urban roundabouts, highway tunnels, and on/off ramps. Notably, 19.5% of the data comprises dynamic interaction events, including vehicle cut-ins, cut-outs, and frequent lane changes. AGC-Drive contains 400 scenes, each with approximately 100 frames and fully annotated 3D bounding boxes covering 13 object categories. We provide benchmarks for two 3D perception tasks: vehicle-to-vehicle collaborative perception and vehicle-to-UAV collaborative perception. Additionally, we release an open-source toolkit, including spatiotemporal alignment verification tools, multi-agent visualization systems, and collaborative annotation utilities. The dataset and code are available at https://github.com/PercepX/AGC-Drive.

[127] CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset

Santosh Patapati,Trisanth Srinivasan,Amith Adiraju

Main category: cs.CV

TL;DR: This paper introduces CLIP-MG, a modified CLIP model that incorporates pose information to improve micro-gesture recognition, achieving a Top-1 accuracy of 61.82%.

Details Motivation: Micro-gesture recognition is challenging due to its subtle, involuntary nature and low movement amplitude. This motivates the need for more effective models tailored to recognize these gestures accurately. Method: The paper proposes a Pose-Guided Semantics-Aware CLIP-based architecture called CLIP-MG, which integrates human pose information into the CLIP-based recognition pipeline using pose-guided semantic query generation and a gated multi-modal fusion mechanism. Result: The proposed CLIP-MG model achieves a Top-1 accuracy of 61.82% on the iMiGUE dataset. Conclusion: The paper concludes that the proposed CLIP-MG model demonstrates potential in micro-gesture recognition but also highlights the challenges in fully adapting vision-language models like CLIP for such tasks. Abstract: Micro-gesture recognition is a challenging task in affective computing due to the subtle, involuntary nature of the gestures and their low movement amplitude. In this paper, we introduce a Pose-Guided Semantics-Aware CLIP-based architecture, or CLIP for Micro-Gesture recognition (CLIP-MG), a modified CLIP model tailored for micro-gesture classification on the iMiGUE dataset. CLIP-MG integrates human pose (skeleton) information into the CLIP-based recognition pipeline through pose-guided semantic query generation and a gated multi-modal fusion mechanism. The proposed model achieves a Top-1 accuracy of 61.82%. These results demonstrate both the potential of our approach and the remaining difficulty in fully adapting vision-language models like CLIP for micro-gesture recognition.

[128] HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis

Peixiang Huang,Yanyan Huang,Weiqin Zhao,Junjun He,Lequan Yu

Main category: cs.CV

TL;DR: HyperPath是一种新的方法,通过整合文本描述的知识来在双曲空间中建模WSIs的语义层次结构,从而提升WSI分类性能。

Details Motivation: 病理学对于癌症诊断至关重要,而多实例学习(MIL)广泛用于全切片图像(WSI)分析。虽然一些方法试图利用这种层次结构进行改进表示,但它们主要依赖于欧几里得嵌入,难以完全捕捉语义层次结构。 Method: 我们提出了一种名为HyperPath的新方法,将文本描述的知识整合到双曲空间中的WSI语义层次结构建模中。我们的方法将病理视觉-语言基础模型提取的视觉和文本特征都自适应到双曲空间中。我们设计了一个角度模态对齐损失以确保稳健的跨模态对齐,同时一个语义层次一致性损失进一步通过蕴含和矛盾关系优化特征层次结构,从而增强语义一致性。分类通过测地距离执行,该距离测量双曲语义层次结构中实体之间的相似性。 Result: 广泛的实验表明,与现有方法相比,我们的方法在各种任务中实现了优越的性能,突出了双曲嵌入在WSI分析中的潜力。 Conclusion: HyperPath通过将知识整合到双曲空间中的语义层次结构建模中,成功提升了WSI分类的性能,证明了双曲嵌入在WSI分析中的潜力。 Abstract: Pathology is essential for cancer diagnosis, with multiple instance learning (MIL) widely used for whole slide image (WSI) analysis. WSIs exhibit a natural hierarchy -- patches, regions, and slides -- with distinct semantic associations. While some methods attempt to leverage this hierarchy for improved representation, they predominantly rely on Euclidean embeddings, which struggle to fully capture semantic hierarchies. To address this limitation, we propose HyperPath, a novel method that integrates knowledge from textual descriptions to guide the modeling of semantic hierarchies of WSIs in hyperbolic space, thereby enhancing WSI classification. Our approach adapts both visual and textual features extracted by pathology vision-language foundation models to the hyperbolic space. We design an Angular Modality Alignment Loss to ensure robust cross-modal alignment, while a Semantic Hierarchy Consistency Loss further refines feature hierarchies through entailment and contradiction relationships and thus enhance semantic coherence. The classification is performed with geodesic distance, which measures the similarity between entities in the hyperbolic semantic hierarchy. This eliminates the need for linear classifiers and enables a geometry-aware approach to WSI analysis. Extensive experiments show that our method achieves superior performance across tasks compared to existing methods, highlighting the potential of hyperbolic embeddings for WSI analysis.

[129] Robustness Evaluation of OCR-based Visual Document Understanding under Multi-Modal Adversarial Attacks

Dong Nguyen Tien,Dung D. Le

Main category: cs.CV

TL;DR: This paper presents the first unified framework for multi-modal adversarial attacks on OCR-based VDU models, revealing that line-level and compound perturbations severely degrade performance, with PGD-based attacks being the most effective.

Details Motivation: The robustness of Visual Document Understanding (VDU) systems under realistic adversarial perturbations has not been sufficiently explored, prompting the need for a comprehensive framework to assess vulnerabilities. Method: The study introduces a unified framework for generating and evaluating multi-modal adversarial attacks on OCR-based VDU models, covering six gradient-based layout attack scenarios with constraints on layout perturbation budget. Result: Experiments across four datasets and six model families showed that line-level attacks and compound perturbations lead to the worst performance degradation, while PGD-based BBox perturbations outperformed random-shift baselines. Conclusion: Project Gradience Descent (PGD)-based BBox perturbations are more effective than random-shift baselines, and line-level attacks and compound perturbations cause the most significant performance degradation in VDU systems. Abstract: Visual Document Understanding (VDU) systems have achieved strong performance in information extraction by integrating textual, layout, and visual signals. However, their robustness under realistic adversarial perturbations remains insufficiently explored. We introduce the first unified framework for generating and evaluating multi-modal adversarial attacks on OCR-based VDU models. Our method covers six gradient-based layout attack scenarios, incorporating manipulations of OCR bounding boxes, pixels, and texts across both word and line granularities, with constraints on layout perturbation budget (e.g., IoU >= 0.6) to preserve plausibility. Experimental results across four datasets (FUNSD, CORD, SROIE, DocVQA) and six model families demonstrate that line-level attacks and compound perturbations (BBox + Pixel + Text) yield the most severe performance degradation. Projected Gradient Descent (PGD)-based BBox perturbations outperform random-shift baselines in all investigated models. Ablation studies further validate the impact of layout budget, text modification, and adversarial transferability.

[130] Efficient Transformations in Deep Learning Convolutional Neural Networks

Berk Yilmaz,Daniel Fidel Harvey,Prajit Dhuri

Main category: cs.CV

TL;DR: This study explores how integrating signal processing techniques like WHT into CNNs can improve energy efficiency and accuracy for image classification tasks.

Details Motivation: The motivation is to evaluate the trade-offs between computational efficiency, energy consumption, and classification accuracy in CNN models, particularly for energy-constrained applications. Method: This study integrates signal processing transformations (FFT, WHT, DCT) into the ResNet50 CNN model and evaluates their impact on computational efficiency, energy consumption, and classification accuracy using the CIFAR-100 dataset. Result: The modified ResNet50 model with WHT applied achieved improved accuracy (74% for early layers, 79% for both early and late layers) while significantly reducing energy consumption (from 25,606 kJ to 39 kJ). Conclusion: The study concludes that incorporating WHT into the ResNet50 model enhances energy efficiency and classification accuracy, making it a promising approach for CNN applications constrained by energy resources. Abstract: This study investigates the integration of signal processing transformations -- Fast Fourier Transform (FFT), Walsh-Hadamard Transform (WHT), and Discrete Cosine Transform (DCT) -- within the ResNet50 convolutional neural network (CNN) model for image classification. The primary objective is to assess the trade-offs between computational efficiency, energy consumption, and classification accuracy during training and inference. Using the CIFAR-100 dataset (100 classes, 60,000 images), experiments demonstrated that incorporating WHT significantly reduced energy consumption while improving accuracy. Specifically, a baseline ResNet50 model achieved a testing accuracy of 66%, consuming an average of 25,606 kJ per model. In contrast, a modified ResNet50 incorporating WHT in the early convolutional layers achieved 74% accuracy, and an enhanced version with WHT applied to both early and late layers achieved 79% accuracy, with an average energy consumption of only 39 kJ per model. These results demonstrate the potential of WHT as a highly efficient and effective approach for energy-constrained CNN applications.

[131] Structured Semantic 3D Reconstruction (S23DR) Challenge 2025 -- Winning solution

Jan Skvrna,Lukas Neumann

Main category: cs.CV

TL;DR: This paper details a successful method for predicting a house's 3D roof structure using deep learning techniques applied directly to 3D data.

Details Motivation: The motivation is to develop an efficient solution for predicting 3D roof wireframes as part of the S23DR Challenge 2025. Method: The method involves identifying vertex candidates from a point cloud using Gestalt segmentations and refining and classifying these candidates with two PointNet-like models, one for vertices and another for edges. Result: The approach achieved a Hybrid Structure Score (HSS) of 0.43 on the private leaderboard, winning the challenge. Conclusion: The paper concludes that the two-stage, 3D deep learning approach is effective in predicting a house's 3D roof wireframe from sparse data. Abstract: This paper presents the winning solution for the S23DR Challenge 2025, which involves predicting a house's 3D roof wireframe from a sparse point cloud and semantic segmentations. Our method operates directly in 3D, first identifying vertex candidates from the COLMAP point cloud using Gestalt segmentations. We then employ two PointNet-like models: one to refine and classify these candidates by analyzing local cubic patches, and a second to predict edges by processing the cylindrical regions connecting vertex pairs. This two-stage, 3D deep learning approach achieved a winning Hybrid Structure Score (HSS) of 0.43 on the private leaderboard.

[132] How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?

Giuseppe Lando,Rosario Forte,Giovanni Maria Farinella,Antonino Furnari

Main category: cs.CV

TL;DR: 本文研究了如何利用现成的多模态大语言模型和语言模型推理模块,在不进行额外训练的情况下高效完成视频问答任务,显著降低了内存消耗。

Details Motivation: 探索现有多模态大语言模型是否能在不进行额外训练的情况下解决在线情节记忆视频问答(OEM-VQA)问题,并实现高效的内存使用。 Method: 通过将流式自我中心视频转换为轻量级文本记忆,并使用LLM推理模块查询该记忆来回答问题。 Result: 在QAEgo4D-Closed基准测试中,最佳配置达到了56.0%的准确率,同时每分钟仅需3.6 kB的存储空间,相较现有最先进的系统具有更高的内存效率。 Conclusion: 本文提出了一种基于现成的多模态大语言模型(MLLMs)和语言模型推理模块(LLM reasoner module)的轻量级在线记忆存储与查询框架,成功实现了高效的视频问答任务处理。 Abstract: We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with 3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 10**4/10**5 times more memory-efficient. Extensive ablations provides insights into the role of each component and design choice, and highlight directions of improvement for future research.

[133] Spotting tell-tale visual artifacts in face swapping videos: strengths and pitfalls of CNN detectors

Riccardo Ziglio,Cecilia Pasquini,Silvio Ranise

Main category: cs.CV

TL;DR: 本文探讨了利用卷积神经网络检测视频中因面部遮挡而产生的交换操作的有效性,并指出其在跨数据集泛化方面的挑战。

Details Motivation: 由于自动化和实时工具的发展,视频流中的面部交换操作对远程视频通信构成了日益增长的威胁。 Method: 通过在两个数据语料库上基准测试基于CNN的数据驱动模型,分析不同采集源和交换算法下的泛化能力。 Result: 结果证实了通用CNN架构在同一数据源内操作时的卓越性能,但在跨数据集稳健地表征遮挡相关视觉线索方面存在显著困难。 Conclusion: 研究强调了需要专门的检测策略来处理面部遮挡引起的视觉伪影。 Abstract: Face swapping manipulations in video streams represents an increasing threat in remote video communications, due to advances in automated and real-time tools. Recent literature proposes to characterize and exploit visual artifacts introduced in video frames by swapping algorithms when dealing with challenging physical scenes, such as face occlusions. This paper investigates the effectiveness of this approach by benchmarking CNN-based data-driven models on two data corpora (including a newly collected one) and analyzing generalization capabilities with respect to different acquisition sources and swapping algorithms. The results confirm excellent performance of general-purpose CNN architectures when operating within the same data source, but a significant difficulty in robustly characterizing occlusion-based visual cues across datasets. This highlights the need for specialized detection strategies to deal with such artifacts.

[134] Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details

Zeqiang Lai,Yunfei Zhao,Haolin Liu,Zibo Zhao,Qingxiang Lin,Huiwen Shi,Xianghui Yang,Mingxin Yang,Shuhui Yang,Yifei Feng,Sheng Zhang,Xin Huang,Di Luo,Fan Yang,Fang Yang,Lifu Wang,Sicong Liu,Yixuan Tang,Yulin Cai,Zebin He,Tian Liu,Yuhong Liu,Jie Jiang,Linus,Jingwei Huang,Chunchao Guo

Main category: cs.CV

TL;DR: This paper introduces Hunyuan3D 2.5, an advanced 3D diffusion model capable of generating high-quality 3D shapes and textures, surpassing prior approaches.

Details Motivation: The motivation behind Hunyuan3D 2.5 is to generate high-fidelity and detailed textured 3D assets while closing the gap between generated and handcrafted 3D shapes. Method: The method involves a two-stages pipeline with a new shape foundation model called LATTICE for shape generation, and a novel multi-view architecture based on the Hunyuan3D 2.0 Paint model for physical-based rendering (PBR) in texture generation. Result: The result of this paper is that the largest LATTICE model reaches 10B parameters and generates sharp and detailed 3D shapes with precise image-3D following while keeping the mesh surface clean and smooth. The texture generation was also upgraded via PBR, leading to superior performance compared to previous methods. Conclusion: Hunyuan3D 2.5 significantly outperforms previous methods in both shape and end-to-end texture generation. Abstract: In this report, we present Hunyuan3D 2.5, a robust suite of 3D diffusion models aimed at generating high-fidelity and detailed textured 3D assets. Hunyuan3D 2.5 follows two-stages pipeline of its previous version Hunyuan3D 2.0, while demonstrating substantial advancements in both shape and texture generation. In terms of shape generation, we introduce a new shape foundation model -- LATTICE, which is trained with scaled high-quality datasets, model-size, and compute. Our largest model reaches 10B parameters and generates sharp and detailed 3D shape with precise image-3D following while keeping mesh surface clean and smooth, significantly closing the gap between generated and handcrafted 3D shapes. In terms of texture generation, it is upgraded with phyiscal-based rendering (PBR) via a novel multi-view architecture extended from Hunyuan3D 2.0 Paint model. Our extensive evaluation shows that Hunyuan3D 2.5 significantly outperforms previous methods in both shape and end-to-end texture generation.

[135] How Hard Is Snow? A Paired Domain Adaptation Dataset for Clear and Snowy Weather: CADC+

Mei Qi Tang,Sean Sedwards,Chengjie Huang,Krzysztof Czarnecki

Main category: cs.CV

TL;DR: This paper introduces CADC+, a novel dataset for assessing the effects of snow on 3D object detection in autonomous driving, showing that snow introduces complex uncertainties.

Details Motivation: The motivation stems from the lack of datasets with sufficient labeled data for both snowy and clear weather conditions. Existing datasets either have limited data or rely on synthetic clear weather data through de-snowing methods, which lack realism and introduce additional domain shifts. Method: The authors introduced CADC+, a new dataset for evaluating the impact of snow on autonomous driving performance by pairing real snowy sequences with corresponding clear weather sequences recorded in the same environment. This minimizes domain shifts unrelated to snow. Result: CADC+ is presented as the first paired weather domain adaptation dataset for winter driving conditions. Preliminary results demonstrate how snow impacts 3D object detection performance by introducing different types of uncertainty. Conclusion: The paper concludes that snowfall introduces both aleatoric and epistemic uncertainties in 3D object detection, acting as both noise and a distinct data domain. Abstract: The impact of snowfall on 3D object detection performance remains underexplored. Conducting such an evaluation requires a dataset with sufficient labelled data from both weather conditions, ideally captured in the same driving environment. Current driving datasets with LiDAR point clouds either do not provide enough labelled data in both snowy and clear weather conditions, or rely on de-snowing methods to generate synthetic clear weather. Synthetic data often lacks realism and introduces an additional domain shift that confounds accurate evaluations. To address these challenges, we present CADC+, the first paired weather domain adaptation dataset for autonomous driving in winter conditions. CADC+ extends the Canadian Adverse Driving Conditions dataset (CADC) using clear weather data that was recorded on the same roads and in the same period as CADC. To create CADC+, we pair each CADC sequence with a clear weather sequence that matches the snowy sequence as closely as possible. CADC+ thus minimizes the domain shift resulting from factors unrelated to the presence of snow. We also present some preliminary results using CADC+ to evaluate the effect of snow on 3D object detection performance. We observe that snow introduces a combination of aleatoric and epistemic uncertainties, acting as both noise and a distinct data domain.

[136] From Semantic To Instance: A Semi-Self-Supervised Learning Approach

Keyhan Najafian,Farhad Maleki,Lingling Jin,Ian Stavness

Main category: cs.CV

TL;DR: 本研究提出了一种新的半自监督实例分割方法(GLMask),大幅减少了标注工作量,并在农业和通用数据集上均取得了显著性能提升。

Details Motivation: 传统实例分割模型需要大量像素级标注数据,特别是在农业等领域的密集、自遮挡场景下难以应用。 Method: 设计了一个名为GLMask的图像-掩膜表示方法,强调形状、纹理和模式,并开发了一个从语义分割到实例分割的转换流程。 Result: 在小麦穗部实例分割任务中达到了98.5%的mAP@50,在COCO数据集上相比传统方法提升了12.6%的mAP@50。 Conclusion: 该论文提出了一种半自监督学习方法,有效解决了密集、自遮挡物体场景下的实例分割问题,并在小麦穗部数据集和通用COCO数据集上验证了其优越性能。 Abstract: Instance segmentation is essential for applications such as automated monitoring of plant health, growth, and yield. However, extensive effort is required to create large-scale datasets with pixel-level annotations of each object instance for developing instance segmentation models that restrict the use of deep learning in these areas. This challenge is more significant in images with densely packed, self-occluded objects, which are common in agriculture. To address this challenge, we propose a semi-self-supervised learning approach that requires minimal manual annotation to develop a high-performing instance segmentation model. We design GLMask, an image-mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features. We develop a pipeline to generate semantic segmentation and then transform it into instance-level segmentation. The proposed approach substantially outperforms the conventional instance segmentation models, establishing a state-of-the-art wheat head instance segmentation model with mAP@50 of 98.5%. Additionally, we assessed the proposed methodology on the general-purpose Microsoft COCO dataset, achieving a significant performance improvement of over 12.6% mAP@50. This highlights that the utility of our proposed approach extends beyond precision agriculture and applies to other domains, specifically those with similar data characteristics.

[137] SafeTriage: Facial Video De-identification for Privacy-Preserving Stroke Triage

Tongan Cai,Haomiao Ni,Wenchao Ma,Yuan Xue,Qian Ma,Rachel Leicht,Kelvin Wong,John Volpi,Stephen T. C. Wong,James Z. Wang,Sharon X. Huang

Main category: cs.CV

TL;DR: The paper proposes SafeTriage, a novel method for de-identifying patient facial videos while preserving essential motion cues crucial for stroke diagnosis, using a pretrained video motion transfer model and a conditional generative model for visual prompt tuning.

Details Motivation: Effective stroke triage relies on identifying subtle abnormalities in facial muscle coordination, but AI models' reliance on real patient data raises ethical and privacy challenges. This research aims to address these concerns by proposing SafeTriage, a method designed to de-identify patient facial videos while preserving essential motion cues crucial for stroke diagnosis. Method: SafeTriage leverages a pretrained video motion transfer model to map the motion characteristics of real patient faces onto synthetic identities. A conditional generative model for visual prompt tuning is introduced to adapt the input space of the VMT model. Result: Comprehensive evaluation demonstrates that SafeTriage-produced synthetic videos effectively preserve stroke-relevant facial patterns, enabling reliable AI-based triage and ensuring accurate motion transfer without needing to fine-tune the VMT model backbone. Conclusion: SafeTriage provides robust privacy protection while maintaining diagnostic accuracy, offering a secure and ethically sound foundation for data sharing and AI-driven clinical analysis in neurological disorders. Abstract: Effective stroke triage in emergency settings often relies on clinicians' ability to identify subtle abnormalities in facial muscle coordination. While recent AI models have shown promise in detecting such patterns from patient facial videos, their reliance on real patient data raises significant ethical and privacy challenges -- especially when training robust and generalizable models across institutions. To address these concerns, we propose SafeTriage, a novel method designed to de-identify patient facial videos while preserving essential motion cues crucial for stroke diagnosis. SafeTriage leverages a pretrained video motion transfer (VMT) model to map the motion characteristics of real patient faces onto synthetic identities. This approach retains diagnostically relevant facial dynamics without revealing the patients' identities. To mitigate the distribution shift between normal population pre-training videos and patient population test videos, we introduce a conditional generative model for visual prompt tuning, which adapts the input space of the VMT model to ensure accurate motion transfer without needing to fine-tune the VMT model backbone. Comprehensive evaluation, including quantitative metrics and clinical expert assessments, demonstrates that SafeTriage-produced synthetic videos effectively preserve stroke-relevant facial patterns, enabling reliable AI-based triage. Our evaluations also show that SafeTriage provides robust privacy protection while maintaining diagnostic accuracy, offering a secure and ethically sound foundation for data sharing and AI-driven clinical analysis in neurological disorders.

[138] Spatially-Aware Evaluation of Segmentation Uncertainty

Tal Zeevi,Eléonore V. Lieffrig,Lawrence H. Staib,John A. Onofrey

Main category: cs.CV

TL;DR: This paper introduces spatially aware metrics for evaluating uncertainty maps in medical image segmentation, showing better clinical relevance and discrimination ability.

Details Motivation: Current uncertainty evaluation metrics treat voxels independently, ignoring spatial context and anatomical structure, leading to similar scores for distinct uncertainty patterns. Method: Three spatially aware metrics were developed and validated using medical imaging data from the prostate zonal segmentation challenge in the Medical Segmentation Decathlon. Result: The new metrics demonstrated improved alignment with clinically important factors and better discrimination between meaningful and spurious uncertainty patterns. Conclusion: The proposed spatially aware metrics improve the evaluation of uncertainty maps by incorporating structural and boundary information, showing better alignment with clinical factors. Abstract: Uncertainty maps highlight unreliable regions in segmentation predictions. However, most uncertainty evaluation metrics treat voxels independently, ignoring spatial context and anatomical structure. As a result, they may assign identical scores to qualitatively distinct patterns (e.g., scattered vs. boundary-aligned uncertainty). We propose three spatially aware metrics that incorporate structural and boundary information and conduct a thorough validation on medical imaging data from the prostate zonal segmentation challenge within the Medical Segmentation Decathlon. Our results demonstrate improved alignment with clinically important factors and better discrimination between meaningful and spurious uncertainty patterns.

[139] MetaQAP -- A Meta-Learning Approach for Quality-Aware Pretraining in Image Quality Assessment

Muhammad Azeem Aslam,Muhammad Hamza,Nisar Ahmed,Gulshan Saleem,Zhu Shuangtong,Hu Hongfei,Xu Wei,Saba Aslam,Wang Jun

Main category: cs.CV

TL;DR: This study introduces MetaQAP, a new no-reference Image Quality Assessment (IQA) model that effectively tackles the challenges posed by the subjective nature of human perception and real-world image distortions. By combining quality-aware pre-training, a quality-aware loss function, and a meta-learning approach, MetaQAP achieves outstanding performance on multiple benchmark datasets, outperforming existing IQA methods and demonstrating strong generalizability across different datasets.

Details Motivation: Image Quality Assessment (IQA) is a critical task in a wide range of applications but remains challenging due to the subjective nature of human perception and the complexity of real-world image distortions. Method: This study proposes MetaQAP, a novel no-reference IQA model designed to address these challenges by leveraging quality-aware pre-training and meta-learning. The model performs three key contributions: pre-training Convolutional Neural Networks (CNNs) on a quality-aware dataset, implementing a quality-aware loss function to optimize predictions, and integrating a meta-learner to form an ensemble model that effectively combines predictions from multiple base models. Result: The proposed MetaQAP model achieved exceptional performance with Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC) scores of 0.9885/0.9812 on LiveCD, 0.9702/0.9658 on KonIQ-10K, and 0.884/0.8765 on BIQ2021, outperforming existing IQA methods. Cross-dataset evaluations further demonstrated the generalizability of the model, with PLCC and SROCC scores ranging from 0.6721 to 0.8023 and 0.6515 to 0.7805, respectively, across diverse datasets. The ablation study confirmed the significance of each model component, revealing substantial performance degradation when critical elements such as the meta-learner or quality-aware loss function were omitted. Conclusion: MetaQAP not only addresses the complexities of authentic distortions but also establishes a robust and generalizable framework for practical IQA applications. By advancing the state-of-the-art in no-reference IQA, this research provides valuable insights and methodologies for future improvements and extensions in the field. Abstract: Image Quality Assessment (IQA) is a critical task in a wide range of applications but remains challenging due to the subjective nature of human perception and the complexity of real-world image distortions. This study proposes MetaQAP, a novel no-reference IQA model designed to address these challenges by leveraging quality-aware pre-training and meta-learning. The model performs three key contributions: pre-training Convolutional Neural Networks (CNNs) on a quality-aware dataset, implementing a quality-aware loss function to optimize predictions, and integrating a meta-learner to form an ensemble model that effectively combines predictions from multiple base models. Experimental evaluations were conducted on three benchmark datasets: LiveCD, KonIQ-10K, and BIQ2021. The proposed MetaQAP model achieved exceptional performance with Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank Order Correlation Coefficient (SROCC) scores of 0.9885/0.9812 on LiveCD, 0.9702/0.9658 on KonIQ-10K, and 0.884/0.8765 on BIQ2021, outperforming existing IQA methods. Cross-dataset evaluations further demonstrated the generalizability of the model, with PLCC and SROCC scores ranging from 0.6721 to 0.8023 and 0.6515 to 0.7805, respectively, across diverse datasets. The ablation study confirmed the significance of each model component, revealing substantial performance degradation when critical elements such as the meta-learner or quality-aware loss function were omitted. MetaQAP not only addresses the complexities of authentic distortions but also establishes a robust and generalizable framework for practical IQA applications. By advancing the state-of-the-art in no-reference IQA, this research provides valuable insights and methodologies for future improvements and extensions in the field.

[140] Leveraging CNN and IoT for Effective E-Waste Management

Ajesh Thangaraj Nadar,Gabriel Nixon Raj,Soham Chandane,Sushant Bhat

Main category: cs.CV

TL;DR: This paper proposes an IoT and CNN-based system for automated e-waste classification and routing to improve recycling efficiency and reduce environmental risks.

Details Motivation: The motivation is to address the environmental and health risks posed by improper disposal and insufficient recycling of e-waste by developing an efficient, technology-driven solution. Method: The method involves integrating an IoT framework with a lightweight CNN-based classification model using visual and weight-based attributes for automated e-waste sorting. Result: The system successfully enables real-time detection and classification of e-waste components such as circuit boards, sensors, and wires, contributing to smarter recycling workflows. Conclusion: The paper concludes that the proposed IoT-enabled system with a lightweight CNN-based classification pipeline can significantly enhance e-waste identification, categorization, and routing, thereby improving recycling efficiency. Abstract: The increasing proliferation of electronic devices in the modern era has led to a significant surge in electronic waste (e-waste). Improper disposal and insufficient recycling of e-waste pose serious environmental and health risks. This paper proposes an IoT-enabled system combined with a lightweight CNN-based classification pipeline to enhance the identification, categorization, and routing of e-waste materials. By integrating a camera system and a digital weighing scale, the framework automates the classification of electronic items based on visual and weight-based attributes. The system demonstrates how real-time detection of e-waste components such as circuit boards, sensors, and wires can facilitate smart recycling workflows and improve overall waste processing efficiency.

[141] A Comparative Analysis of Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) as Dimensionality Reduction Techniques

Michael Gyimadu,Gregory Bell

Main category: cs.CV

TL;DR: This paper analytically compares PCA and SVD for dimensionality reduction in high-dimensional image data, offering guidelines for choosing between them based on matrix properties.

Details Motivation: High-dimensional image data often require dimensionality reduction, prompting the need to compare techniques like PCA and SVD for efficient analysis. Method: The paper analytically compares PCA and SVD by deriving the algorithms from first principles and assessing their properties. Result: The paper provides rule-of-thumb guidelines for choosing between PCA and SVD, along with an assessment of their interpretability, numerical stability, and suitability for different matrix shapes. Conclusion: This paper concludes that PCA and SVD have specific guidelines for usage based on matrix characteristics, interpretability, and numerical stability, without the need for empirical benchmarking. Abstract: High-dimensional image data often require dimensionality reduction before further analysis. This paper provides a purely analytical comparison of two linear techniques-Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). After the derivation of each algorithm from first principles, we assess their interpretability, numerical stability, and suitability for differing matrix shapes. building on classical and recent numerical literature, We synthesize rule-of-thumb guidelines for choosing one out of the two algorithms without empirical benchmarking, building on classical and recent numerical literature. Limitations and directions for future experimental work are outlined at the end.

[142] Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge

Ruiming Chen,Junming Yang,Shiyu Xia,Xu Yang,Jing Wang,Xin Geng

Main category: cs.CV

TL;DR: This paper proposes MM-LG, a novel framework designed to extract and leverage generalizable components from CLIP, achieving performance gains over existing learngene approaches and comparable or superior results to the pre-training and fine-tuning paradigm.

Details Motivation: CLIP has attracted widespread attention for its multimodal generalizable knowledge, which is significant for downstream tasks. However, the computational overhead of a large number of parameters and large-scale pre-training poses challenges of pre-training a different scale of CLIP. Learngene extracts the generalizable components termed as learngene from an ancestry model and initializes diverse descendant models with it. Previous Learngene paradigms fail to handle the generalizable knowledge in multimodal scenarios. Method: We first establish multimodal and unimodal blocks to extract the multimodal and unimodal generalizable knowledge in a weighted-sum manner. Subsequently, we employ these components to numerically initialize descendant models of varying scales and modalities. Result: Extensive experiments demonstrate MM-LG's effectiveness, which achieves performance gains over existing learngene approaches (e.g.,+3.1% on Oxford-IIIT PET and +4.13% on Flickr30k) and comparable or superior results to the pre-training and fine-tuning paradigm (e.g.,+1.9% on Oxford-IIIT PET and +3.65% on Flickr30k). Conclusion: MM-LG requires only around 25% of the parameter storage while reducing around 2.8 times pre-training costs for diverse model scales compared to the pre-training and fine-tuning paradigm, making it particularly suitable for efficient deployment across diverse downstream tasks. Abstract: CLIP (Contrastive Language-Image Pre-training) has attracted widespread attention for its multimodal generalizable knowledge, which is significant for downstream tasks. However, the computational overhead of a large number of parameters and large-scale pre-training poses challenges of pre-training a different scale of CLIP. Learngene extracts the generalizable components termed as learngene from an ancestry model and initializes diverse descendant models with it. Previous Learngene paradigms fail to handle the generalizable knowledge in multimodal scenarios. In this paper, we put forward the idea of utilizing a multimodal block to extract the multimodal generalizable knowledge, which inspires us to propose MM-LG (Multimodal Learngene), a novel framework designed to extract and leverage generalizable components from CLIP. Specifically, we first establish multimodal and unimodal blocks to extract the multimodal and unimodal generalizable knowledge in a weighted-sum manner. Subsequently, we employ these components to numerically initialize descendant models of varying scales and modalities. Extensive experiments demonstrate MM-LG's effectiveness, which achieves performance gains over existing learngene approaches (e.g.,+3.1% on Oxford-IIIT PET and +4.13% on Flickr30k) and comparable or superior results to the pre-training and fine-tuning paradigm (e.g.,+1.9% on Oxford-IIIT PET and +3.65% on Flickr30k). Notably, MM-LG requires only around 25% of the parameter storage while reducing around 2.8 times pre-training costs for diverse model scales compared to the pre-training and fine-tuning paradigm, making it particularly suitable for efficient deployment across diverse downstream tasks.

[143] How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions

Manuel Brack,Sudeep Katakol,Felix Friedrich,Patrick Schramowski,Hareesh Ravi,Kristian Kersting,Ajinkya Kale

Main category: cs.CV

TL;DR: This paper explores how different synthetic captioning strategies impact the performance of text-to-image models, revealing trade-offs between text alignment, output aesthetics, and diversity.

Details Motivation: The motivation is to understand the impact of synthetic caption design on model performance, as current literature lacks insights into optimal design choices despite their widespread use. Method: The authors systematically investigate various synthetic captioning strategies and evaluate their effects on downstream performance metrics such as text alignment, output aesthetics, and sample diversity. Result: Dense, high-quality captions improve text alignment but may reduce aesthetics and diversity, while captions of randomized lengths offer balanced improvements. Varying caption distributions also shift the bias of the trained model. Conclusion: Caption design significantly influences model performance in text-to-image generation, and thoughtful strategy selection is essential for achieving desired outcomes. Abstract: Training data is at the core of any successful text-to-image models. The quality and descriptiveness of image text are crucial to a model's performance. Given the noisiness and inconsistency in web-scraped datasets, recent works shifted towards synthetic training captions. While this setup is generally believed to produce more capable models, current literature does not provide any insights into its design choices. This study closes this gap by systematically investigating how different synthetic captioning strategies impact the downstream performance of text-to-image models. Our experiments demonstrate that dense, high-quality captions enhance text alignment but may introduce trade-offs in output aesthetics and diversity. Conversely, captions of randomized lengths yield balanced improvements across aesthetics and alignment without compromising sample diversity. We also demonstrate that varying caption distributions introduce significant shifts in the output bias of a trained model. Our findings underscore the importance of caption design in achieving optimal model performance and provide practical insights for more effective training data strategies in text-to-image generation.

[144] DepthVanish: Optimizing Adversarial Interval Structures for Stereo-Depth-Invisible Patches

Yun Xing,Yue Cao,Nhat Chung,Jie Zhang,Ivor Tsang,Ming-Ming Cheng,Yang Liu,Lei Ma,Qing Guo

Main category: cs.CV

TL;DR: This paper introduces an effective method for attacking stereo depth estimation systems using adversarial patches with a striped structure, showing practical relevance for security assessments.

Details Motivation: The motivation behind this research is to identify vulnerabilities in stereo depth estimation systems used in autonomous driving and robotics, particularly when these systems are subjected to adversarial attacks. Method: The researchers developed a novel stereo depth attack that jointly optimizes both the striped structure and texture elements. They conducted extensive experimentation to analyze how variations of this novel structure influence the performance. Result: The study found that introducing regular intervals between repeated textures significantly enhances patch attack effectiveness. The generated adversarial patches successfully attacked state-of-the-art stereo depth estimation methods and commercial RGB-D cameras in real-world conditions. Conclusion: The research concludes that adversarial patches with a striped structure can effectively attack stereo depth estimation systems, including commercial RGB-D cameras, demonstrating their practical relevance for security assessment. Abstract: Stereo Depth estimation is a critical task in autonomous driving and robotics, where inaccuracies (such as misidentifying nearby objects as distant) can lead to dangerous situations. Adversarial attacks against stereo depth estimation can help reveal vulnerabilities before deployment. Previous work has shown that repeating optimized textures can effectively mislead stereo depth estimation in digital settings. However, our research reveals that these naively repeated texture structures perform poorly in physical-world implementations, i.e., when deployed as patches, limiting their practical utility for testing stereo depth estimation systems. In this work, for the first time, we discover that introducing regular intervals between repeated textures, creating a striped structure, significantly enhances the patch attack effectiveness. Through extensive experimentation, we analyze how variations of this novel structure influence the performance. Based on these insights, we develop a novel stereo depth attack that jointly optimizes both the striped structure and texture elements. Our generated adversarial patches can be inserted into any scenes and successfully attack state-of-the-art stereo depth estimation methods, i.e., RAFT-Stereo and STTR. Most critically, our patch can also attack commercial RGB-D cameras (Intel RealSense) in real-world conditions, demonstrating their practical relevance for security assessment of stereo systems.

[145] LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation

Tongtian Yue,Longteng Guo,Yepeng Tang,Zijia Zhao,Xinxin Zhu,Hua Huang,Jing Liu

Main category: cs.CV

TL;DR: 本文提出了一种新型大型视觉语言模型LaVi,通过内部特征调制机制实现高效视觉-语言融合,解决了现有方法中存在的低效问题,并在多个基准测试中表现出色。

Details Motivation: 现有方法存在视觉-语言集成效率低下问题,或者破坏模型固有结构,或者引入严重的长上下文计算负担,这严重限制了可扩展性和效率。 Method: 通过在大型语言模型(LLM)内部引入基于视觉输入的特征调制机制,实现无缝高效的视觉-语言融合。 Result: 在15个图像和视频基准测试中表现卓越,与LLaVA-OV-7B相比,FLOPs减少了94.0%,推理速度提高了3.1倍,内存使用减少了一半。 Conclusion: LaVi是一种可扩展且实用的实时多模态推理解决方案,不仅实现了最先进的多模态性能,还显著提高了效率。 Abstract: Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion by introducing a lightweight and adaptive transformation, which incorporates visual context by injecting token-wise vision-conditioned deltas into the affine parameters of layer normalization. This mechanism directly modulates linguistic hidden states based on visual input, ensuring precise vision-language alignment while preserving the LLM's linguistic priors and drastically reducing computational costs. Extensive evaluations across 15 image and video benchmarks demonstrate that LaVi not only achieves state-of-the-art multimodal performance but also dramatically enhances efficiency. Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half - establishing LaVi as a scalable and practical solution for real-time multimodal reasoning. The code and models will be released soon.

[146] Language-driven Description Generation and Common Sense Reasoning for Video Action Recognition

Xiaodan Hu,Chuhang Zou,Suchen Wang,Jaechul Kim,Narendra Ahuja

Main category: cs.CV

TL;DR: This paper proposes a framework leveraging language-driven common sense priors to enhance video action recognition, particularly for cluttered and occluded scenes.

Details Motivation: Current video action recognition methods do not fully exploit the rich common sense priors available in language models, which are crucial for understanding scene contexts like objects, human-object interactions, and activities. Method: The authors propose a framework with three components: (1) a video context summary component to generate candidate objects and activities; (2) a description generation module using auxiliary prompts and common sense reasoning; and (3) a multi-modal activity recognition head combining visual and textual cues. Result: The proposed approach demonstrates effectiveness on the Action Genome and Charades datasets, showing improved performance in recognizing complex and occluded video actions. Conclusion: The paper concludes that incorporating language-driven common sense priors enhances the recognition of cluttered video action sequences, particularly in occluded scenarios. Abstract: Recent video action recognition methods have shown excellent performance by adapting large-scale pre-trained language-image models to the video domain. However, language models contain rich common sense priors - the scene contexts that humans use to constitute an understanding of objects, human-object interactions, and activities - that have not been fully exploited. In this paper, we introduce a framework incorporating language-driven common sense priors to identify cluttered video action sequences from monocular views that are often heavily occluded. We propose: (1) A video context summary component that generates candidate objects, activities, and the interactions between objects and activities; (2) A description generation module that describes the current scene given the context and infers subsequent activities, through auxiliary prompts and common sense reasoning; (3) A multi-modal activity recognition head that combines visual and textual cues to recognize video actions. We demonstrate the effectiveness of our approach on the challenging Action Genome and Charades datasets.

[147] Few-Shot Generalized Category Discovery With Retrieval-Guided Decision Boundary Enhancement

Yunhan Ren,Feng Luo,Siyu Huang

Main category: cs.CV

TL;DR: 本文提出了一个用于少样本广义类别发现(FSGCD)任务的决策边界增强框架,通过利用标记样本和基于相似度检索的伪标记样本,改善已知和未知类别的决策边界学习,在六个公共GCD基准上均取得了优于现有方法的表现。

Details Motivation: 虽然现有的广义类别发现(GCD)模型取得了显著的成功,但它们在有限标记样本和少量已知类别情况下的性能在很大程度上仍未被探索。在这项工作中,我们引入了少样本广义类别发现(FSGCD)的任务,旨在在已知信息稀缺的情况下实现具有竞争力的GCD任务表现。 Method: 首先,使用决策边界预训练模块减轻预训练信息对已知类别边界的过拟合,并利用标记样本改进这些决策边界的学 习。其次,实施两阶段检索引导的决策边界优化策略。具体而言,该策略通过基于相似度检索的伪标记样本来进一步增强严重受限的已知边界。然后,通过基于相似度特征检索的指导将这些优化后的边界应用于未知簇。 Result: 实验结果证明,所提出的决策边界增强框架在基于相似度检索的环境下能够有效提升FSGCD任务的表现。 Conclusion: 实验结果表明,所提出的方法在FSGCD设置下的六个公共GCD基准上优于现有方法。 Abstract: While existing Generalized Category Discovery (GCD) models have achieved significant success, their performance with limited labeled samples and a small number of known categories remains largely unexplored. In this work, we introduce the task of Few-shot Generalized Category Discovery (FSGCD), aiming to achieve competitive performance in GCD tasks under conditions of known information scarcity. To tackle this challenge, we propose a decision boundary enhancement framework with affinity-based retrieval. Our framework is designed to learn the decision boundaries of known categories and transfer these boundaries to unknown categories. First, we use a decision boundary pre-training module to mitigate the overfitting of pre-trained information on known category boundaries and improve the learning of these decision boundaries using labeled samples. Second, we implement a two-stage retrieval-guided decision boundary optimization strategy. Specifically, this strategy further enhances the severely limited known boundaries by using affinity-retrieved pseudo-labeled samples. Then, these refined boundaries are applied to unknown clusters via guidance from affinity-based feature retrieval. Experimental results demonstrate that our proposed method outperforms existing methods on six public GCD benchmarks under the FSGCD setting. The codes are available at: https://github.com/Ryh1218/FSGCD

[148] TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion

Mingrui Zhu,Xiru Chen,Xin Wei,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出了TeSG,一种结合文本语义指导的红外与可见光图像融合方法,通过生成掩码和文本语义,采用分步注意力机制提高融合效果,并在多项任务中表现优异。

Details Motivation: 传统IVF未能充分整合和利用文本语义信息,而文本引导具有灵活性和多功能性,因此需要更有效的融合策略。 Method: 引入掩码语义和文本语义两个层次,提出TeSG框架,包含SIG、MGCA和TDAF模块进行多阶段融合。 Result: 实验表明,与其他最先进方法相比,该方法在下游任务(如检测和分割)上表现出更强的竞争力。 Conclusion: TeSG方法在红外与可见光图像融合中有效地集成了文本语义信息,通过三个核心组件实现了对下游任务的优化。 Abstract: Infrared and visible image fusion (IVF) aims to combine complementary information from both image modalities, producing more informative and comprehensive outputs. Recently, text-guided IVF has shown great potential due to its flexibility and versatility. However, the effective integration and utilization of textual semantic information remains insufficiently studied. To tackle these challenges, we introduce textual semantics at two levels: the mask semantic level and the text semantic level, both derived from textual descriptions extracted by large Vision-Language Models (VLMs). Building on this, we propose Textual Semantic Guidance for infrared and visible image fusion, termed TeSG, which guides the image synthesis process in a way that is optimized for downstream tasks such as detection and segmentation. Specifically, TeSG consists of three core components: a Semantic Information Generator (SIG), a Mask-Guided Cross-Attention (MGCA) module, and a Text-Driven Attentional Fusion (TDAF) module. The SIG generates mask and text semantics based on textual descriptions. The MGCA module performs initial attention-based fusion of visual features from both infrared and visible images, guided by mask semantics. Finally, the TDAF module refines the fusion process with gated attention driven by text semantics. Extensive experiments demonstrate the competitiveness of our approach, particularly in terms of performance on downstream tasks, compared to existing state-of-the-art methods.

[149] 3DeepRep: 3D Deep Low-rank Tensor Representation for Hyperspectral Image Inpainting

Yunshan Li,Wenwu Gong,Qianqian Wang,Chao Wang,Lili Yang

Main category: cs.CV

TL;DR: This paper proposes a 3-directional deep low-rank tensor representation model for hyperspectral image inpainting that exploits low-rank structures across all tensor modes, leading to improved performance over existing methods.

Details Motivation: Existing transform-based TNN approaches focus only on the spectral mode, neglecting low-rank properties in other tensor modes, which limits their performance in HSI inpainting. Method: A 3-directional deep low-rank tensor representation (3DeepRep) model that applies deep nonlinear transforms along all three modes of the HSI tensor and minimizes nuclear norms in latent space for regularization. Result: Experiments show that the method outperforms current state-of-the-art techniques in both qualitative and quantitative evaluations on real-world HSI datasets. Conclusion: The proposed 3DeepRep model achieves superior HSI inpainting performance compared to existing state-of-the-art methods. Abstract: Recent approaches based on transform-based tensor nuclear norm (TNN) have demonstrated notable effectiveness in hyperspectral image (HSI) inpainting by leveraging low-rank structures in latent representations. Recent developments incorporate deep transforms to improve low-rank tensor representation; however, existing approaches typically restrict the transform to the spectral mode, neglecting low-rank properties along other tensor modes. In this paper, we propose a novel 3-directional deep low-rank tensor representation (3DeepRep) model, which performs deep nonlinear transforms along all three modes of the HSI tensor. To enforce low-rankness, the model minimizes the nuclear norms of mode-i frontal slices in the corresponding latent space for each direction (i=1,2,3), forming a 3-directional TNN regularization. The outputs from the three directional branches are subsequently fused via a learnable aggregation module to produce the final result. An efficient gradient-based optimization algorithm is developed to solve the model in a self-supervised manner. Extensive experiments on real-world HSI datasets demonstrate that the proposed method achieves superior inpainting performance compared to existing state-of-the-art techniques, both qualitatively and quantitatively.

[150] Cross-modal Offset-guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection

Liu Zongzhen,Luo Hui,Wang Zhixing,Wei Yuxing,Zuo Haorui,Zhang Jianlin

Main category: cs.CV

TL;DR: This paper proposes CoDAF, a framework for UAV-based object detection that addresses spatial misalignment and modality conflict through dynamic alignment and adaptive fusion, achieving strong results on the DroneVehicle dataset.

Details Motivation: Spatial misalignment between RGB and IR imagery due to UAV platform motion and asynchronous imaging leads to weak alignment, creating challenges of semantic inconsistency and modality conflict. Method: Cross-modal Offset-guided Dynamic Alignment and Fusion (CoDAF), which includes Offset-guided Semantic Alignment (OSA) and Dynamic Attention-guided Fusion Module (DAFM). Result: Experiments on standard benchmarks validate the effectiveness of CoDAF, achieving a mAP of 78.6% on the DroneVehicle dataset. Conclusion: CoDAF enables robust UAV object detection by integrating alignment and fusion in a unified design. Abstract: Unmanned aerial vehicle (UAV) object detection plays a vital role in applications such as environmental monitoring and urban security. To improve robustness, recent studies have explored multimodal detection by fusing visible (RGB) and infrared (IR) imagery. However, due to UAV platform motion and asynchronous imaging, spatial misalignment frequently occurs between modalities, leading to weak alignment. This introduces two major challenges: semantic inconsistency at corresponding spatial locations and modality conflict during feature fusion. Existing methods often address these issues in isolation, limiting their effectiveness. In this paper, we propose Cross-modal Offset-guided Dynamic Alignment and Fusion (CoDAF), a unified framework that jointly tackles both challenges in weakly aligned UAV-based object detection. CoDAF comprises two novel modules: the Offset-guided Semantic Alignment (OSA), which estimates attention-based spatial offsets and uses deformable convolution guided by a shared semantic space to align features more precisely; and the Dynamic Attention-guided Fusion Module (DAFM), which adaptively balances modality contributions through gating and refines fused features via spatial-channel dual attention. By integrating alignment and fusion in a unified design, CoDAF enables robust UAV object detection. Experiments on standard benchmarks validate the effectiveness of our approach, with CoDAF achieving a mAP of 78.6% on the DroneVehicle dataset.

[151] Uncertainty-Aware Variational Information Pursuit for Interpretable Medical Image Analysis

Md Nahiduzzaman,Ruwan Tennakoon,Steven Korevaar,Zongyuan Ge,Alireza Bab-Hadiashar

Main category: cs.CV

TL;DR: 本文提出了一种新的医学图像AI决策支持系统-UAV-IP,它结合了不确定性量化,以提供更准确和简明的解释。

Details Motivation: 现有的V-IP方法忽略了查询生成中的实例级不确定性,这可能来自于模型限制(认知不确定性)或专家响应的变异性(随机不确定性)。 Method: 引入了Uncertainty-Aware V-IP (UAV-IP),一种将不确定性量化的框架,评估其在四个医学影像数据集上的表现。 Result: 在四个医学图像数据集上,UAV-IP平均AUC提高了约3.2%,并生成比基线V-IP多20%的简洁解释。 Conclusion: UAV-IP通过整合不确定性量化到V-IP过程中,在保持信息量的同时,实现了更简洁的解释,并在多个医学图像数据集中显示出AUC提升。 Abstract: In medical imaging, AI decision-support systems must balance accuracy and interpretability to build user trust and support effective clinical decision-making. Recently, Variational Information Pursuit (V-IP) and its variants have emerged as interpretable-by-design modeling techniques, aiming to explain AI decisions in terms of human-understandable, clinically relevant concepts. However, existing V-IP methods overlook instance-level uncertainties in query-answer generation, which can arise from model limitations (epistemic uncertainty) or variability in expert responses (aleatoric uncertainty). This paper introduces Uncertainty-Aware V-IP (UAV-IP), a novel framework that integrates uncertainty quantification into the V-IP process. We evaluate UAV-IP across four medical imaging datasets, PH2, Derm7pt, BrEaST, and SkinCon, demonstrating an average AUC improvement of approximately 3.2% while generating 20% more concise explanations compared to baseline V-IP, without sacrificing informativeness. These findings highlight the importance of uncertainty-aware reasoning in interpretable by design models for robust and reliable medical decision-making.

[152] Noise-Informed Diffusion-Generated Image Detection with Anomaly Attention

Weinan Guan,Wei Wang,Bo Peng,Ziwen He,Jing Dong,Haonan Cheng

Main category: cs.CV

TL;DR: 本文通过关注扩散模型生成图像的噪声模式,提出了新的检测架构NASA-Swin,有效解决了模型泛化问题,并达到了最先进的检测性能。

Details Motivation: 扩散模型的发展带来了高质量合成图像,也引发了信息安全问题。然而,伪造检测的主要挑战是泛化到训练期间未见过的扩散模型。 Method: 利用图像噪声作为关键特征,并引入了一个名为Noise-Aware Self-Attention (NASA)的新模块,结合Swin Transformer形成新的检测架构NASA-Swin。此外,还采用跨模态融合嵌入和通道掩码策略来增强特征学习。 Result: 实验表明,新提出的NASA-Swin在扩散生成图像检测中表现优异,特别是在面对未知生成方法时。 Conclusion: 该论文提出了一种新的基于NASA模块的检测模型,在面对未见过的扩散生成方法时,能够实现最先进的性能。 Abstract: With the rapid development of image generation technologies, especially the advancement of Diffusion Models, the quality of synthesized images has significantly improved, raising concerns among researchers about information security. To mitigate the malicious abuse of diffusion models, diffusion-generated image detection has proven to be an effective countermeasure.However, a key challenge for forgery detection is generalising to diffusion models not seen during training. In this paper, we address this problem by focusing on image noise. We observe that images from different diffusion models share similar noise patterns, distinct from genuine images. Building upon this insight, we introduce a novel Noise-Aware Self-Attention (NASA) module that focuses on noise regions to capture anomalous patterns. To implement a SOTA detection model, we incorporate NASA into Swin Transformer, forming an novel detection architecture NASA-Swin. Additionally, we employ a cross-modality fusion embedding to combine RGB and noise images, along with a channel mask strategy to enhance feature learning from both modalities. Extensive experiments demonstrate the effectiveness of our approach in enhancing detection capabilities for diffusion-generated images. When encountering unseen generation methods, our approach achieves the state-of-the-art performance.Our code is available at https://github.com/WeinanGuan/NASA-Swin.

Qi-Ying Sun,Wan-Lei Zhao,Yi-Bo Miao,Chong-Wah Ngo

Main category: cs.CV

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Despite the great success of the deep features in content-based image retrieval, the visual instance search remains challenging due to the lack of effective instance level feature representation. Supervised or weakly supervised object detection methods are not among the options due to their poor performance on the unknown object categories. In this paper, based on the feature set output from self-supervised ViT, the instance level region discovery is modeled as detecting the compact feature subsets in a hierarchical fashion. The hierarchical decomposition results in a hierarchy of feature subsets. The non-leaf nodes and leaf nodes on the hierarchy correspond to the various instance regions in an image of different semantic scales. The hierarchical decomposition well addresses the problem of object embedding and occlusions, which are widely observed in the real scenarios. The features derived from the nodes on the hierarchy make up a comprehensive representation for the latent instances in the image. Our instance-level descriptor remains effective on both the known and unknown object categories. Empirical studies on three instance search benchmarks show that it outperforms state-of-the-art methods considerably.

[154] Infrared and Visible Image Fusion Based on Implicit Neural Representations

Shuchen Sun,Ligen Shi,Chang Liu,Lina Wu,Jun Qiu

Main category: cs.CV

TL;DR: 本文提出了一种基于隐式神经表示(INR)的红外与可见光图像融合方法,称为INRFuse。该方法通过神经网络参数化连续函数来隐式表示图像的多模态信息,利用归一化的空间坐标作为输入,并使用多层感知机自适应地融合两种模态的特征。设计了多种损失函数以联合优化融合图像与原始图像之间的相似性,从而有效保留了红外图像的热辐射信息和可见光图像的纹理细节。此外,INR的分辨率无关特性使得可以直接融合不同分辨率的图像,并通过高密度坐标查询实现超分辨率重建。实验结果表明,INRFuse在主客观评价指标上均优于现有方法,生成的融合图像结构清晰、细节自然、信息丰富且无需训练数据集。

Details Motivation: 红外与可见光图像融合旨在结合两种模态的优势,生成信息丰富且满足视觉或计算需求的图像。传统方法依赖于离散像素或显式特征,难以充分挖掘多模态信息的潜力。因此,需要一种能够突破这些限制的新方法。 Method: 本文提出了INRFuse,通过神经网络参数化一个连续函数来隐式表示图像的多模态信息。将红外和可见光图像的归一化空间坐标作为输入,利用多层感知机自适应地融合两种模态的特征,并输出融合后的图像。同时,设计了多种损失函数以联合优化融合图像与原始图像之间的相似性。此外,利用INR的分辨率无关特性,实现了不同分辨率图像的直接融合和超分辨率重建。 Result: 实验结果表明,INRFuse在主观视觉质量和客观评估指标方面均优于现有方法。融合后的图像具有清晰的结构、自然的细节和丰富的信息,同时不需要训练数据集。此外,该方法支持不同分辨率图像的直接融合和超分辨率重建。 Conclusion: INRFuse是一种创新的红外与可见光图像融合方法,其基于隐式神经表示的设计突破了传统方法的限制,显著提升了融合图像的质量和信息完整性。 Abstract: Infrared and visible light image fusion aims to combine the strengths of both modalities to generate images that are rich in information and fulfill visual or computational requirements. This paper proposes an image fusion method based on Implicit Neural Representations (INR), referred to as INRFuse. This method parameterizes a continuous function through a neural network to implicitly represent the multimodal information of the image, breaking through the traditional reliance on discrete pixels or explicit features. The normalized spatial coordinates of the infrared and visible light images serve as inputs, and multi-layer perceptrons is utilized to adaptively fuse the features of both modalities, resulting in the output of the fused image. By designing multiple loss functions, the method jointly optimizes the similarity between the fused image and the original images, effectively preserving the thermal radiation information of the infrared image while maintaining the texture details of the visible light image. Furthermore, the resolution-independent characteristic of INR allows for the direct fusion of images with varying resolutions and achieves super-resolution reconstruction through high-density coordinate queries. Experimental results indicate that INRFuse outperforms existing methods in both subjective visual quality and objective evaluation metrics, producing fused images with clear structures, natural details, and rich information without the necessity for a training dataset.

[155] PQCAD-DM: Progressive Quantization and Calibration-Assisted Distillation for Extremely Efficient Diffusion Model

Beomseok Ko,Hyeryung Jang

Main category: cs.CV

TL;DR: 该研究提出了一种针对扩散模型的高效压缩框架PQCAD-DM,兼顾计算效率与生成质量,显著提升了推理速度且性能优异。

Details Motivation: 扩散模型在图像生成方面表现出色,但由于依赖迭代马尔可夫链过程,在计算和资源上开销大,导致误差累积,并限制了朴素压缩技术的效果。因此需要一种新的压缩方法来解决这些问题。 Method: 本文提出了一种结合渐进式量化(PQ)和校准辅助蒸馏(CAD)的新框架PQCAD-DM。PQ通过动量机制指导的两阶段量化减少低精度下的权重扰动;CAD利用全精度校准数据集进行蒸馏,使学生模型能够在教师模型量化的情况下匹配全精度性能。 Result: PQCAD-DM在多个数据集上验证了其优越的生成能力和效率,优于固定位宽量化方法。 Conclusion: PQCAD-DM实现了计算效率和生成质量之间的平衡,使推理时间减少了一半,同时保持了有竞争力的性能。 Abstract: Diffusion models excel in image generation but are computational and resource-intensive due to their reliance on iterative Markov chain processes, leading to error accumulation and limiting the effectiveness of naive compression techniques. In this paper, we propose PQCAD-DM, a novel hybrid compression framework combining Progressive Quantization (PQ) and Calibration-Assisted Distillation (CAD) to address these challenges. PQ employs a two-stage quantization with adaptive bit-width transitions guided by a momentum-based mechanism, reducing excessive weight perturbations in low-precision. CAD leverages full-precision calibration datasets during distillation, enabling the student to match full-precision performance even with a quantized teacher. As a result, PQCAD-DM achieves a balance between computational efficiency and generative quality, halving inference time while maintaining competitive performance. Extensive experiments validate PQCAD-DM's superior generative capabilities and efficiency across diverse datasets, outperforming fixed-bit quantization methods.

[156] Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

Haoran Sun,Yankai Jiang,Wenjie Lou,Yujie Zhang,Wenjie Li,Lilong Wang,Mianxin Liu,Lei Liu,Xiaosong Wang

Main category: cs.CV

TL;DR: 该论文提出了一种新的医疗领域多模态大语言模型的推理路径搜索方法MICS,并构建了相关的数据集和模型,以提高医学推理能力。

Details Motivation: 尽管多模态大语言模型在一般任务上表现出色,但在医学领域的应用仍处于早期阶段,尤其是在构建链式推理训练数据方面存在不足。 Method: 提出了MICS(Mentor-Intern Collaborative Search)方法,通过导师模型初始化推理步骤,实习生模型继续扩展推理路径,并根据多个实习生模型的整体表现选择最优路径。此外,还引入了MICS-Score来评估推理路径的质量。 Result: 成功构建了MMRP数据集和Chiron-o1模型,实验表明Chiron-o1在多个医学视觉问答和推理基准测试中达到了最先进的性能。 Conclusion: MICS方法为医学领域的多模态大语言模型提供了一个有效的推理路径搜索框架,显著提升了模型的推理能力和表现。 Abstract: Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at GitHub - manglu097/Chiron-o1: Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs

[157] TextBraTS: Text-Guided Volumetric Brain Tumor Segmentation with Innovative Dataset Development and Fusion Module Exploration

Xiaoyu Shi,Rahul Kumar Jain,Yinhao Li,Ruibo Hou,Jingliang Cheng,Jie Bai,Guohua Zhao,Lanfen Lin,Rui Xu,Yen-wei Chen

Main category: cs.CV

TL;DR: 该研究介绍了TextBraTS数据集,这是一个首次公开的多模态数据集,结合了MRI体积数据和丰富的文本注释,用于脑肿瘤分割。

Details Motivation: 尽管在其他医学影像领域中,整合文本报告与视觉数据能提高分割准确性,但脑肿瘤分析领域缺乏结合放射图像与对应文本注释的全面数据集。 Method: 基于BraTS2020基准创建了TextBraTS数据集,并提出了一个新的基线框架及顺序交叉注意方法,用于文本引导的体素医学图像分割。 Result: 通过广泛的实验,研究显示其文本-图像融合策略显著提高了脑肿瘤分割的准确性。 Conclusion: TextBraTS数据集和新提出的文本引导分割方法为多模态医学影像分析提供了新的可能性。 Abstract: Deep learning has demonstrated remarkable success in medical image segmentation and computer-aided diagnosis. In particular, numerous advanced methods have achieved state-of-the-art performance in brain tumor segmentation from MRI scans. While recent studies in other medical imaging domains have revealed that integrating textual reports with visual data can enhance segmentation accuracy, the field of brain tumor analysis lacks a comprehensive dataset that combines radiological images with corresponding textual annotations. This limitation has hindered the exploration of multimodal approaches that leverage both imaging and textual data. To bridge this critical gap, we introduce the TextBraTS dataset, the first publicly available volume-level multimodal dataset that contains paired MRI volumes and rich textual annotations, derived from the widely adopted BraTS2020 benchmark. Building upon this novel dataset, we propose a novel baseline framework and sequential cross-attention method for text-guided volumetric medical image segmentation. Through extensive experiments with various text-image fusion strategies and templated text formulations, our approach demonstrates significant improvements in brain tumor segmentation accuracy, offering valuable insights into effective multimodal integration techniques. Our dataset, implementation code, and pre-trained models are publicly available at https://github.com/Jupitern52/TextBraTS.

[158] MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Shoubin Yu,Yue Zhang,Ziyang Wang,Jaehong Yoon,Mohit Bansal

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的多模态推理框架MEXA,该框架通过动态选择并聚合专家模型来实现跨领域的高效推理,适用于多种复杂任务且性能优越。

Details Motivation: 多模态推理需要结合预训练专家模型,但因输入模态多样性和任务复杂性增加,构建统一框架面临挑战。例如医疗诊断和金融预测分别依赖于表格数据和图表数据。 Method: 引入了一个名为MEXA的训练无关框架,根据输入模态和任务需求动态选择专家模型,并利用大推理模型(LRM)对输出进行聚合和推理。 Result: 在包括视频推理、音频推理、3D理解和医学问答等多模态基准上的广泛评估表明,MEXA始终优于强大的多模态基线方法。 Conclusion: MEXA是一个无需训练的框架,通过聚合多个专家模型实现了有效的多模态推理,具有广泛的适用性和显著的性能优势。 Abstract: Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.

[159] RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

Junbo Qiao,Miaomiao Cai,Wei Li,Yutong Liu,Xudong Huang,Gaoqi He,Jiao Xie,Jie Hu,Xinghao Chen,Shaohui Lin

Main category: cs.CV

TL;DR: RealSR-R1 introduces the VLCoT framework and GRPO optimization to enhance Real-World Image Super-Resolution, achieving better fidelity and naturalness in restored images.

Details Motivation: The motivation stems from the challenges faced by existing methods in Real-World Image Super-Resolution, which struggle with accurate understanding of degraded image content, leading to low-fidelity and unnatural reconstructed results. Method: The method involves the introduction of the VLCoT framework for integrating vision and language reasoning, combined with Group Relative Policy Optimization (GRPO) that uses four reward functions: Format reward, Degradation reward, Understanding reward, and Generation reward involving a visual expert model. Result: Extensive experiments show that RealSR-R1 can effectively handle real-world scenarios, restoring image details progressively while generating more realistic images compared to traditional approaches. Conclusion: The proposed RealSR-R1 model demonstrates the ability to generate realistic details and accurately understand image content, especially in semantically rich scenes or images with severe degradation. Abstract: Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation.

[160] Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation

Riccardo Corvi,Davide Cozzolino,Ekta Prashnani,Shalini De Mello,Koki Nagano,Luisa Verdoliva

Main category: cs.CV

TL;DR: The paper proposes a new method for improving the detection of AI-generated videos by focusing on low-level artifacts through a novel data augmentation strategy, achieving better performance with simpler techniques.

Details Motivation: The motivation is to overcome the poor generalization of existing video forensic detectors by focusing on identifying intrinsic low-level artifacts introduced by generative architectures. Method: The method involves studying generative architectures to identify discriminative features and introducing a data augmentation strategy based on wavelet decomposition to replace specific frequency-related bands. Result: The result shows that despite its simplicity, the proposed method achieves significant accuracy improvement over state-of-the-art detectors and performs well even on recent generative models like NOVA and FLUX. Conclusion: The paper concludes that their novel forensic-oriented data augmentation strategy improves the generalizability of AI-generated video detectors without needing complex algorithms or large datasets. Abstract: Synthetic video generation is progressing very rapidly. The latest models can produce very realistic high-resolution videos that are virtually indistinguishable from real ones. Although several video forensic detectors have been recently proposed, they often exhibit poor generalization, which limits their applicability in a real-world scenario. Our key insight to overcome this issue is to guide the detector towards seeing what really matters. In fact, a well-designed forensic classifier should focus on identifying intrinsic low-level artifacts introduced by a generative architecture rather than relying on high-level semantic flaws that characterize a specific model. In this work, first, we study different generative architectures, searching and identifying discriminative features that are unbiased, robust to impairments, and shared across models. Then, we introduce a novel forensic-oriented data augmentation strategy based on the wavelet decomposition and replace specific frequency-related bands to drive the model to exploit more relevant forensic cues. Our novel training paradigm improves the generalizability of AI-generated video detectors, without the need for complex algorithms and large datasets that include multiple synthetic generators. To evaluate our approach, we train the detector using data from a single generative model and test it against videos produced by a wide range of other models. Despite its simplicity, our method achieves a significant accuracy improvement over state-of-the-art detectors and obtains excellent results even on very recent generative models, such as NOVA and FLUX. Code and data will be made publicly available.

[161] Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes

Chao Chen,Nobel Dang,Juexiao Zhang,Wenkai Sun,Pengfei Zheng,Xuhang He,Yimeng Ye,Taarun Srinivas,Chen Feng

Main category: cs.CV

TL;DR: This paper introduces the Co-VisiON benchmark to evaluate co-visibility reasoning in sparse image sets. Current vision models perform worse than humans, but the proposed Covis baseline improves performance among vision-only models.

Details Motivation: The motivation stems from the human ability to recognize co-visibility in sparse images and its importance in 3D vision and robotics. Despite progress in vision learning, it is unclear if current models match human proficiency in this area. Method: The authors introduced the Co-Visibility reasONing (Co-VisiON) benchmark for evaluating co-visibility reasoning across over 1000 indoor scenarios using sparse image sets. They tested existing vision models and proposed a novel multi-view baseline called Covis to improve performance. Result: Experiments showed that existing vision models face significant challenges in co-visibility analysis under sparse conditions. A proprietary vision-language model outperformed all vision-based approaches, but both types of models lag behind human performance. The proposed Covis baseline improved results among pure vision models. Conclusion: The study concludes that while current vision models struggle with co-visibility analysis under sparse conditions, a new multi-view baseline, Covis, achieves top performance among pure vision models and narrows the gap to proprietary vision-language models (VLMs). The work highlights the need for high-level reasoning in multi-view perception. Abstract: Humans exhibit a remarkable ability to recognize co-visibility-the overlapping regions visible in multiple images-even when these images are sparsely distributed across a complex scene. This capability is foundational in 3D vision and robotic perception. Despite significant progress in vision learning, it remains unclear whether current vision models have reached human-level proficiency in co-visibility analysis. In this work, we introduce the Co-Visibility reasONing (Co-VisiON) benchmark, designed to directly evaluate co-visibility reasoning on sparse image sets across over 1000 indoor scenarios. Our experiments reveal that while co-visibility is typically treated as a low-level feature matching task, it poses a significant challenge for existing vision models under sparse conditions. Notably, a proprietary vision-language model outperforms all purely vision-based approaches, with all models lagging substantially behind human performance. This gap underscores the need for more than basic pairwise vision processing-it calls for a comprehensive spatial understanding through high-level reasoning across multiple views. Inspired by human visual cognition, we propose a novel multi-view baseline, Covis, which achieves top performance among pure vision models and narrows the gap to the proprietary VLM. We hope our benchmark and findings will spur further advancements in developing vision models capable of robust, high-level reasoning in challenging, sparse environments. Our dataset and source code can be found at: https://ai4ce.github.io/CoVISION

[162] FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

Fan Yang,Yousong Zhu,Xin Li,Yufei Zhan,Hongyin Zhao,Shurong Zheng,Yaowei Wang,Ming Tang,Jinqiao Wang

Main category: cs.CV

TL;DR: FOCUS is a unified Large Vision Language Model that combines visual understanding and image editing in an end-to-end framework, using a dual-branch encoder, MoVQGAN tokenizer, and progressive training to improve segmentation-aware perception and controllable image generation.

Details Motivation: Current approaches treat 'what to see' and 'how to edit' separately, often relying on multiple disjointed models. The authors aim to bridge this gap by integrating both functions into a single end-to-end framework. Method: The paper introduces FOCUS, which uses a dual-branch visual encoder and a MoVQGAN-based visual tokenizer. It also implements a progressive multi-stage training pipeline where segmentation masks are used as spatial condition prompts to guide the diffusion decoder. Result: Extensive experiments across three core tasks (multimodal understanding, referring segmentation accuracy, and controllable image generation) show that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities. Conclusion: The paper concludes that FOCUS, a unified LVLM, effectively integrates segmentation-aware perception and controllable object-centric generation, achieving strong performance in visual understanding and generative modeling. Abstract: Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat "what to see" and "how to edit" separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi-stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder. This strategy aligns visual encoding, segmentation, and generation modules, effectively bridging segmentation-aware perception with fine-grained visual synthesis. Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.

[163] Loupe: A Generalizable and Adaptive Framework for Image Forgery Detection

Yuchu Jiang,Jiaming Chu,Jian Zhao,Xin Zhang,Xu Yang,Lei Jin,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: Loupe is a lightweight framework that improves deepfake detection and localization by combining patch-level classification and conditional query-based segmentation, achieving top performance in accuracy and robustness.

Details Motivation: Current deepfake detection methods face challenges in generalization across manipulation types and often rely on complex architectures, necessitating a simpler yet more robust solution. Method: Loupe integrates a patch-aware classifier and segmentation module with conditional queries to enable joint detection and localization. It also uses pseudo-label-guided test-time adaptation for robustness. Result: Loupe achieved first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846, showing improved classification accuracy and localization under diverse forgery patterns. Conclusion: Loupe is an effective framework for deepfake detection and localization, demonstrating state-of-the-art performance on the DDL dataset while maintaining lightweight design. Abstract: The proliferation of generative models has raised serious concerns about visual content forgery. Existing deepfake detection methods primarily target either image-level classification or pixel-wise localization. While some achieve high accuracy, they often suffer from limited generalization across manipulation types or rely on complex architectures. In this paper, we propose Loupe, a lightweight yet effective framework for joint deepfake detection and localization. Loupe integrates a patch-aware classifier and a segmentation module with conditional queries, allowing simultaneous global authenticity classification and fine-grained mask prediction. To enhance robustness against distribution shifts of test set, Loupe introduces a pseudo-label-guided test-time adaptation mechanism by leveraging patch-level predictions to supervise the segmentation head. Extensive experiments on the DDL dataset demonstrate that Loupe achieves state-of-the-art performance, securing the first place in the IJCAI 2025 Deepfake Detection and Localization Challenge with an overall score of 0.846. Our results validate the effectiveness of the proposed patch-level fusion and conditional query design in improving both classification accuracy and spatial localization under diverse forgery patterns. The code is available at https://github.com/Kamichanw/Loupe.

[164] Self-supervised Feature Extraction for Enhanced Ball Detection on Soccer Robots

Can Lin,Daniele Affinita,Marco E. P. Zimmatore,Daniele Nardi,Domenico D. Bloisi,Vincenzo Suriani

Main category: cs.CV

TL;DR: 本文提出了一种用于自主人形足球机器人球检测的自监督学习框架,通过伪标签和多任务学习减少对人工标注的依赖,并实现了更高的检测性能。

Details Motivation: 传统的监督方法需要大量手动标注,成本高且耗时,因此需要一种无需手动标注的自监督学习方法来提高球检测性能。 Method: 提出了一种自监督学习框架,利用预训练模型生成伪标签,并结合颜色化、边缘检测和三元组损失等任务进行特征提取,同时采用MAML策略实现快速适应新场景。 Result: 该方法在新的包含10,000张户外RoboCup SPL比赛图像的数据集上验证有效,并在多个指标上优于基线模型。 Conclusion: 实验结果表明,所提出的自监督学习框架在准确率、F1分数和IoU方面优于基线模型,并且表现出更快的收敛速度。 Abstract: Robust and accurate ball detection is a critical component for autonomous humanoid soccer robots, particularly in dynamic and challenging environments such as RoboCup outdoor fields. However, traditional supervised approaches require extensive manual annotation, which is costly and time-intensive. To overcome this problem, we present a self-supervised learning framework for domain-adaptive feature extraction to enhance ball detection performance. The proposed approach leverages a general-purpose pretrained model to generate pseudo-labels, which are then used in a suite of self-supervised pretext tasks -- including colorization, edge detection, and triplet loss -- to learn robust visual features without relying on manual annotations. Additionally, a model-agnostic meta-learning (MAML) strategy is incorporated to ensure rapid adaptation to new deployment scenarios with minimal supervision. A new dataset comprising 10,000 labeled images from outdoor RoboCup SPL matches is introduced, used to validate the method, and made available to the community. Experimental results demonstrate that the proposed pipeline outperforms baseline models in terms of accuracy, F1 score, and IoU, while also exhibiting faster convergence.

[165] AnyTraverse: An off-road traversability framework with VLM and human operator in the loop

Sattwik Sahu,Agamdeep Singh,Karthik Nambiar,Srikanth Saripalli,P. B. Sujit

Main category: cs.CV

TL;DR: 本文提出AnyTraverse,一种结合自然语言提示和人工操作员协助的零样本学习框架,用于解决不同机器人类型在复杂环境中进行自主导航的越野可通行性问题。

Details Motivation: 当前的框架由于非结构化环境中的显著变化和不确定的场景变化而难以应对,并且不能适应不同的机器人类型。 Method: 利用自然语言处理技术和人机协作机制,实现对复杂环境中的场景进行分割,并仅在遇到未知场景或类别时请求人工干预。 Result: 实验验证包括在RELLIS-3D、Freiburg Forest和RUGD数据集上进行测试,并在多个机器人平台上展示实际应用。与GA-NAV和Off-seg相比,AnyTraverse表现更优,同时提供了一种平衡自动化与定向人工监督的车辆无关的越野可通行性方法。 Conclusion: AnyTraverse是一个零样本学习框架,结合了基于自然语言的提示和人工操作员协助,用于确定适用于各种机器人车辆的可通行区域。 Abstract: Off-road traversability segmentation enables autonomous navigation with applications in search-and-rescue, military operations, wildlife exploration, and agriculture. Current frameworks struggle due to significant variations in unstructured environments and uncertain scene changes, and are not adaptive to be used for different robot types. We present AnyTraverse, a framework combining natural language-based prompts with human-operator assistance to determine navigable regions for diverse robotic vehicles. The system segments scenes for a given set of prompts and calls the operator only when encountering previously unexplored scenery or unknown class not part of the prompt in its region-of-interest, thus reducing active supervision load while adapting to varying outdoor scenes. Our zero-shot learning approach eliminates the need for extensive data collection or retraining. Our experimental validation includes testing on RELLIS-3D, Freiburg Forest, and RUGD datasets and demonstrate real-world deployment on multiple robot platforms. The results show that AnyTraverse performs better than GA-NAV and Off-seg while offering a vehicle-agnostic approach to off-road traversability that balances automation with targeted human supervision.

[166] Camera Calibration via Circular Patterns: A Comprehensive Framework with Measurement Uncertainty and Unbiased Projection Model

Chaehyeon Song,Dongjae Lee,Jongwoo Lim,Ayoung Kim

Main category: cs.CV

TL;DR: This paper introduces an unbiased projection model for circular patterns in camera calibration, improving accuracy and robustness compared to checkerboard methods by incorporating centroid uncertainty and leveraging Markov random fields and Green theorem-based shape representation.

Details Motivation: The motivation stems from the limitations of existing projection models for circle centroids under lens distortion, which results in low performance despite the higher precision of circular patterns over checkerboards. Method: The method involves modeling boundary points of 2D shapes as a Markov random field and propagating shape distribution to centroid uncertainty using Green theorem-based representation. Uncertainty is also introduced in circular patterns to enhance calibration robustness. Result: The framework achieves significant improvements in calibration accuracy and robustness, with enhanced performance in pattern detection, optimization, and evaluation metrics. Conclusion: The proposed unbiased projection model for circular patterns enhances camera calibration accuracy and robustness, outperforming traditional checkerboard methods. Abstract: Camera calibration using planar targets has been widely favored, and two types of control points have been mainly considered as measurements: the corners of the checkerboard and the centroid of circles. Since a centroid is derived from numerous pixels, the circular pattern provides more precise measurements than the checkerboard. However, the existing projection model of circle centroids is biased under lens distortion, resulting in low performance. To surmount this limitation, we propose an unbiased projection model of the circular pattern and demonstrate its superior accuracy compared to the checkerboard. Complementing this, we introduce uncertainty into circular patterns to enhance calibration robustness and completeness. Defining centroid uncertainty improves the performance of calibration components, including pattern detection, optimization, and evaluation metrics. We also provide guidelines for performing good camera calibration based on the evaluation metric. The core concept of this approach is to model the boundary points of a two-dimensional shape as a Markov random field, considering its connectivity. The shape distribution is propagated to the centroid uncertainty through an appropriate shape representation based on the Green theorem. Consequently, the resulting framework achieves marked gains in calibration accuracy and robustness. The complete source code and demonstration video are available at https://github.com/chaehyeonsong/discocal.

[167] Controllable and Expressive One-Shot Video Head Swapping

Chaonan Ji,Jinwei Qi,Peng Zhang,Bang Zhang,Liefeng Bo

Main category: cs.CV

TL;DR: This paper proposes a new diffusion-based framework for video head swapping that allows for holistic head transplantation from images into videos, with support for post-swapping expression editing.

Details Motivation: The motivation behind this work is to overcome limitations of existing face and head-swapping methods that either neglect holistic head morphology or struggle with hairstyle diversity and complex backgrounds, and do not allow users to modify expressions post-swapping. Method: The paper introduces a diffusion-based multi-condition controllable framework with two key strategies: Identity-preserving context fusion using a shape-agnostic mask and hair enhancement, and Expression-aware landmark retargeting and editing using a disentangled 3DMM-driven module. Result: The experimental results show that the proposed method achieves superior performance in background integration, source identity preservation, and expression transfer across both real and virtual characters. Conclusion: The paper concludes that the proposed method excels in seamlessly integrating the transplanted head into the target video while preserving the source identity and allowing for expression editing. Abstract: In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. Existing face-swapping methods mainly focus on localized facial replacement neglecting holistic head morphology, while head-swapping approaches struggling with hairstyle diversity and complex backgrounds, and none of these methods allow users to modify the transplanted head expressions after swapping. To tackle these challenges, our method incorporates several innovative strategies through a unified latent diffusion paradigm. 1) Identity-preserving context fusion: We propose a shape-agnostic mask strategy to explicitly disentangle foreground head identity features from background/body contexts, combining hair enhancement strategy to achieve robust holistic head identity preservation across diverse hair types and complex backgrounds. 2) Expression-aware landmark retargeting and editing: We propose a disentangled 3DMM-driven retargeting module that decouples identity, expression, and head poses, minimizing the impact of original expressions in input images and supporting expression editing. While a scale-aware retargeting strategy is further employed to minimize cross-identity expression distortion for higher transfer precision. Experimental results demonstrate that our method excels in seamless background integration while preserving the identity of the source portrait, as well as showcasing superior expression transfer capabilities applicable to both real and virtual characters.

[168] ParkFormer: A Transformer-Based Parking Policy with Goal Embedding and Pedestrian-Aware Control

Jun Fu,Bin Tian,Haonan Chen,Shi Meng,Tingting Yao

Main category: cs.CV

TL;DR: 本文提出ParkFormer,一种基于Transformer的端到端自主停车框架,结合目标点注意融合和行人预测技术,在复杂城市环境中实现高效精准停车。

Details Motivation: 传统基于规则的停车系统在环境不确定性和拥挤或动态场景中适应性不足,而人类驾驶员能够直观地完成停车任务,因此需要一种更智能和自适应的自主停车方案。 Method: 该方法利用Transformer网络,输入包括环视摄像头图像、目标点表示、车辆运动状态和行人轨迹,输出包括油门、刹车、转向和档位选择的离散控制序列,并在CARLA 0.9.14模拟器中进行了验证。 Result: 实验结果表明该模型在垂直和平行停车场景中的成功率达到96.57%,平均位置误差为0.21米,平均方向误差为0.41度,并通过消融研究验证了关键模块的有效性。 Conclusion: 论文提出了一种基于Transformer的端到端自主停车框架,通过专家演示学习实现了高成功率和精确度的停车控制,并引入了跨注意力模块和基于GRU的行人预测器以提升性能。 Abstract: Autonomous parking plays a vital role in intelligent vehicle systems, particularly in constrained urban environments where high-precision control is required. While traditional rule-based parking systems struggle with environmental uncertainties and lack adaptability in crowded or dynamic scenes, human drivers demonstrate the ability to park intuitively without explicit modeling. Inspired by this observation, we propose a Transformer-based end-to-end framework for autonomous parking that learns from expert demonstrations. The network takes as input surround-view camera images, goal-point representations, ego vehicle motion, and pedestrian trajectories. It outputs discrete control sequences including throttle, braking, steering, and gear selection. A novel cross-attention module integrates BEV features with target points, and a GRU-based pedestrian predictor enhances safety by modeling dynamic obstacles. We validate our method on the CARLA 0.9.14 simulator in both vertical and parallel parking scenarios. Experiments show our model achieves a high success rate of 96.57\%, with average positional and orientation errors of 0.21 meters and 0.41 degrees, respectively. The ablation studies further demonstrate the effectiveness of key modules such as pedestrian prediction and goal-point attention fusion. The code and dataset will be released at: https://github.com/little-snail-f/ParkFormer.

[169] With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You

Fabian Gröger,Shuo Wen,Huyen Le,Maria Brbić

Main category: cs.CV

TL;DR: 本研究提出了一种新的多模态对齐框架,在少量配对数据下实现了高性能的多模态学习。

Details Motivation: 现有的多模态模型通常需要大量配对的多模态样本,这在许多领域中是昂贵或不可行的。因此,研究者希望探索在有限配对数据的情况下构建多模态模型的可能性。 Method: 引入了一种名为STRUCTURE的正则化技术,以保持单模态编码器潜在空间的邻域几何结构,并探索了不同层次对齐的效果。 Result: 研究表明,在仅需数万配对样本(不到常规使用数据量的1%)的情况下,可以实现高质量的多模态对齐,并在24个零样本图像分类和检索基准测试中平均相对提升了51.6%的分类性能和91.8%的检索性能。 Conclusion: 研究结果表明,通过有效的正则化技术和选择适当的层对齐方法,可以在有限的配对数据下实现高质量的多模态对齐,并在多个任务上取得了显著的性能提升。 Abstract: Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$\unicode{x2013}$less than $1\%$ of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of $51.6\%$ in classification and $91.8\%$ in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.

[170] LunarLoc: Segment-Based Global Localization on the Moon

Annika Thomas,Robaire Galliath,Aleksander Garbuz,Luke Anger,Cormac O'Neill,Trevor Johst,Dami Thomas,George Lordos,Jonathan P. How

Main category: cs.CV

TL;DR: LunarLoc is a new method for lunar surface localization that uses boulder landmarks and graph-based matching to achieve highly accurate, drift-free positioning for autonomous systems.

Details Motivation: Autonomous lunar operations require precise pose estimation, but traditional methods like VIO accumulate drift over time, which is problematic for long-duration missions like the ISRU Pilot Excavator. Method: LunarLoc uses instance segmentation to extract boulder landmarks from stereo imagery, constructs a graph-based terrain representation, and aligns it with a reference map using graph-theoretic data association. Result: LunarLoc achieves sub-cm level accuracy in multi-session global localization experiments and significantly outperforms existing methods. Conclusion: LunarLoc enables accurate and drift-free global localization for autonomous lunar operations, outperforming the state of the art. Abstract: Global localization is necessary for autonomous operations on the lunar surface where traditional Earth-based navigation infrastructure, such as GPS, is unavailable. As NASA advances toward sustained lunar presence under the Artemis program, autonomous operations will be an essential component of tasks such as robotic exploration and infrastructure deployment. Tasks such as excavation and transport of regolith require precise pose estimation, but proposed approaches such as visual-inertial odometry (VIO) accumulate odometry drift over long traverses. Precise pose estimation is particularly important for upcoming missions such as the ISRU Pilot Excavator (IPEx) that rely on autonomous agents to operate over extended timescales and varied terrain. To help overcome odometry drift over long traverses, we propose LunarLoc, an approach to global localization that leverages instance segmentation for zero-shot extraction of boulder landmarks from onboard stereo imagery. Segment detections are used to construct a graph-based representation of the terrain, which is then aligned with a reference map of the environment captured during a previous session using graph-theoretic data association. This method enables accurate and drift-free global localization in visually ambiguous settings. LunarLoc achieves sub-cm level accuracy in multi-session global localization experiments, significantly outperforming the state of the art in lunar global localization. To encourage the development of further methods for global localization on the Moon, we release our datasets publicly with a playback module: https://github.com/mit-acl/lunarloc-data.

[171] LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

Fanfei Li,Thomas Klein,Wieland Brendel,Robert Geirhos,Roland S. Zimmermann

Main category: cs.CV

TL;DR: This paper introduces LAION-C, a new benchmark for evaluating out-of-distribution robustness in computer vision models, demonstrating that modern models can match or exceed human performance on challenging OOD tasks.

Details Motivation: Traditional benchmarks like ImageNet-C are no longer effective for evaluating OOD robustness due to the prevalence of common corruptions in modern training datasets. This makes it unclear whether improvements in model performance stem from better OOD generalization or mere exposure to test distortions during training. Method: The authors introduced LAION-C as a novel benchmark with six distortion types designed to be OOD even for web-scale datasets. They conducted comprehensive evaluations on state-of-the-art models and a psychophysical experiment comparing model performance to human observers. Result: LAION-C poses significant challenges to contemporary models, including advanced MLLMs such as Gemini and GPT-4o. The results also indicate a shift in OOD generalization performance, with the best models matching or surpassing human observers. Conclusion: The paper concludes that LAION-C provides a more suitable benchmark for evaluating OOD robustness in the era of web-scale datasets, and it observes a paradigm shift where top models now match or outperform human observers in OOD generalization. Abstract: Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

[172] Visual-Instructed Degradation Diffusion for All-in-One Image Restoration

Wenyang Luo,Haina Qin,Zewen Chen,Libin Wang,Dandan Zheng,Yuming Li,Yufan Liu,Bing Li,Weiming Hu

Main category: cs.CV

TL;DR: Defusion is an all-in-one framework for image restoration that uses visual instructions to guide a diffusion model, effectively addressing multiple degradation types.

Details Motivation: Traditional models need distinct systems for each degradation type, limiting their use in real-world cases with mixed issues. A more generalizable approach is needed. Method: Defusion employs visual instruction-guided degradation diffusion to restore images, using explicit visual instructions derived from standardized elements. Result: Defusion outperforms state-of-the-art methods on diverse and complex image restoration tasks, including real-world degradations. Conclusion: Defusion provides a versatile and effective solution for image restoration, showing superior performance across various tasks. Abstract: Image restoration tasks like deblurring, denoising, and dehazing usually need distinct models for each degradation type, restricting their generalization in real-world scenarios with mixed or unknown degradations. In this work, we propose \textbf{Defusion}, a novel all-in-one image restoration framework that utilizes visual instruction-guided degradation diffusion. Unlike existing methods that rely on task-specific models or ambiguous text-based priors, Defusion constructs explicit \textbf{visual instructions} that align with the visual degradation patterns. These instructions are grounded by applying degradations to standardized visual elements, capturing intrinsic degradation features while agnostic to image semantics. Defusion then uses these visual instructions to guide a diffusion-based model that operates directly in the degradation space, where it reconstructs high-quality images by denoising the degradation effects with enhanced stability and generalizability. Comprehensive experiments demonstrate that Defusion outperforms state-of-the-art methods across diverse image restoration tasks, including complex and real-world degradations.

[173] Reversing Flow for Image Restoration

Haina Qin,Wenyang Luo,Libin Wang,Dandan Zheng,Jingdong Chen,Ming Yang,Bing Li,Weiming Hu

Main category: cs.CV

TL;DR: ResFlow is a novel image restoration framework that efficiently reverses image degradation through deterministic modeling using continuous normalizing flows.

Details Motivation: Existing generative models for image restoration often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. There is a need for a more efficient and practical solution. Method: ResFlow models the degradation process as a deterministic path using continuous normalizing flows. It augments the degradation process with an auxiliary process to enable reversible modeling by matching the velocity field. Result: ResFlow significantly improves the performance and speed of image restoration, completing tasks in fewer than four sampling steps. Conclusion: ResFlow provides a practical and efficient solution for real-world image restoration applications, achieving state-of-the-art results across various benchmarks. Abstract: Image restoration aims to recover high-quality (HQ) images from degraded low-quality (LQ) ones by reversing the effects of degradation. Existing generative models for image restoration, including diffusion and score-based models, often treat the degradation process as a stochastic transformation, which introduces inefficiency and complexity. In this work, we propose ResFlow, a novel image restoration framework that models the degradation process as a deterministic path using continuous normalizing flows. ResFlow augments the degradation process with an auxiliary process that disambiguates the uncertainty in HQ prediction to enable reversible modeling of the degradation process. ResFlow adopts entropy-preserving flow paths and learns the augmented degradation flow by matching the velocity field. ResFlow significantly improves the performance and speed of image restoration, completing the task in fewer than four sampling steps. Extensive experiments demonstrate that ResFlow achieves state-of-the-art results across various image restoration benchmarks, offering a practical and efficient solution for real-world applications.

[174] ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds

Binbin Xiang,Maciej Wielgosz,Stefano Puliti,Kamil Král,Martin Krůček,Azim Missarov,Rasmus Astrup

Main category: cs.CV

TL;DR: ForestFormer3D是一种新的统一端到端框架,用于实现高精度的森林LiDAR点云个体树和语义分割,解决了现有方法在自然森林复杂环境中表现不佳的问题,并在多个测试集中展现了卓越的性能和良好的泛化能力。

Details Motivation: 现有的方法往往难以处理自然森林环境的复杂性和变异性,而森林LiDAR三维点云的分割对于推进森林管理和生态研究至关重要。 Method: ForestFormer3D采用了ISA引导的查询点选择、基于分数的块合并策略以及一对一关联机制。 Result: ForestFormer3D在新引入的FOR-instanceV2数据集上实现了最先进的单木分割性能,并且在未见过的测试集(如Wytham woods和LAUTx)中也表现良好。 Conclusion: ForestFormer3D是一个统一的端到端框架,能够实现精确的单木和语义分割,并在不同森林条件和传感器模式下表现出良好的泛化能力。 Abstract: The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code will be released soon.

[175] Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments

Yasir Ali Farrukh,Syed Wali,Irfan Khan,Nathaniel D. Bastian

Main category: cs.CV

TL;DR: This paper introduces Prmpt2Adpt, an efficient zero-shot domain adaptation framework for vision systems in low-resource environments. It uses prompt-based feature alignment and a teacher-student paradigm to enable fast adaptation and inference, achieving strong performance with minimal data.

Details Motivation: Existing prompt-driven UDA methods rely on large models and require full access to source data, limiting their use in resource-constrained settings like drones. This work aims to develop a lightweight and efficient zero-shot domain adaptation approach suitable for such environments. Method: The method uses a teacher-student paradigm with prompt-based feature alignment. A distilled CLIP model serves as the backbone of a Faster R-CNN teacher, which is briefly fine-tuned using source features aligned to target-domain semantics via Prompt-driven Instance Normalization (PIN). The adapted teacher generates pseudo-labels that guide the student model's on-the-fly adaptation. Result: Experiments on the MDS-A dataset show that Prmpt2Adpt achieves competitive performance while enabling up to 7x faster adaptation and 5x faster inference speed with few source images. Conclusion: Prmpt2Adpt provides a practical and scalable solution for real-time domain adaptation in low-resource environments, offering faster adaptation and inference speeds with competitive detection performance. Abstract: Unsupervised Domain Adaptation (UDA) is a critical challenge in real-world vision systems, especially in resource-constrained environments like drones, where memory and computation are limited. Existing prompt-driven UDA methods typically rely on large vision-language models and require full access to source-domain data during adaptation, limiting their applicability. In this work, we propose Prmpt2Adpt, a lightweight and efficient zero-shot domain adaptation framework built around a teacher-student paradigm guided by prompt-based feature alignment. At the core of our method is a distilled and fine-tuned CLIP model, used as the frozen backbone of a Faster R-CNN teacher. A small set of low-level source features is aligned to the target domain semantics-specified only through a natural language prompt-via Prompt-driven Instance Normalization (PIN). These semantically steered features are used to briefly fine-tune the detection head of the teacher model. The adapted teacher then generates high-quality pseudo-labels, which guide the on-the-fly adaptation of a compact student model. Experiments on the MDS-A dataset demonstrate that Prmpt2Adpt achieves competitive detection performance compared to state-of-the-art methods, while delivering up to 7x faster adaptation and 5x faster inference speed using few source images-making it a practical and scalable solution for real-time adaptation in low-resource domains.

[176] A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X Autonomous Driving

Hanlin Wu,Pengfei Lin,Ehsan Javanmardi,Naren Bao,Bo Qian,Hao Si,Manabu Tsukada

Main category: cs.CV

TL;DR: 本文提出了一种基于协作感知的3D语义占用预测方法,并开发了一个基准模型,通过空间对齐和注意力聚合进行代理间特征融合。

Details Motivation: 单个车辆的感知能力受限于遮挡、传感器范围限制和视角狭窄,因此需要协作感知来交换互补信息以提高完整性与准确性。 Method: 通过在CARLA中使用高分辨率语义体素传感器回放现有协作感知数据集,生成密集且全面的占用注释,并建立具有不同预测范围的基准测试。同时开发了一种基准模型,利用空间对齐和注意力聚合实现代理间的特征融合。 Result: 实验结果表明,该基准模型始终优于单代理模型,并且随着预测范围扩大,性能提升更加显著。 Conclusion: 协作感知能够有效提升3D语义占用预测的准确性和完整性,未来的研究可以进一步优化协作机制和模型性能。 Abstract: 3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy. In the absence of a dedicated dataset for collaborative 3D semantic occupancy prediction, we augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor to provide dense and comprehensive occupancy annotations. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. Experimental results demonstrate that our baseline model consistently outperforms single-agent models, with increasing gains observed as the prediction range expands.

[177] Unsupervised Image Super-Resolution Reconstruction Based on Real-World Degradation Patterns

Yiyang Tie,Hong Zhu,Yunyun Luo,Jing Shi

Main category: cs.CV

TL;DR: This paper proposes a novel TripleGAN framework for real-world super-resolution reconstruction, effectively learning degradation patterns and generating high-quality results from low-resolution images.

Details Motivation: Real-world super-resolution reconstruction models require realistic datasets that capture complex degradation patterns. Existing methods struggle to model both blur and diverse noise characteristics or bridge the domain gap between synthetic and real data. Method: The paper proposes a TripleGAN framework with three components: FirstGAN narrows the domain gap in blur characteristics, SecondGAN performs domain-specific translation, and ThirdGAN reconstructs real-world low-resolution images using pseudo-real data generated by the first two components. Result: Extensive experiments on RealSR and DRealSR datasets show clear advantages of the proposed method in quantitative metrics while maintaining sharp reconstructions without over-smoothing artifacts. Conclusion: The proposed TripleGAN framework effectively learns real-world degradation patterns and synthesizes aligned datasets, enabling superior performance in reconstructing high-quality super-resolution images from real-world low-resolution inputs. Abstract: The training of real-world super-resolution reconstruction models heavily relies on datasets that reflect real-world degradation patterns. Extracting and modeling degradation patterns for super-resolution reconstruction using only real-world low-resolution (LR) images remains a challenging task. When synthesizing datasets to simulate real-world degradation, relying solely on degradation extraction methods fails to capture both blur and diverse noise characteristics across varying LR distributions, as well as more implicit degradations such as color gamut shifts. Conversely, domain translation alone cannot accurately approximate real-world blur characteristics due to the significant degradation domain gap between synthetic and real data. To address these challenges, we propose a novel TripleGAN framework comprising two strategically designed components: The FirstGAN primarily focuses on narrowing the domain gap in blur characteristics, while the SecondGAN performs domain-specific translation to approximate target-domain blur properties and learn additional degradation patterns. The ThirdGAN is trained on pseudo-real data generated by the FirstGAN and SecondGAN to reconstruct real-world LR images. Extensive experiments on the RealSR and DRealSR datasets demonstrate that our method exhibits clear advantages in quantitative metrics while maintaining sharp reconstructions without over-smoothing artifacts. The proposed framework effectively learns real-world degradation patterns from LR observations and synthesizes aligned datasets with corresponding degradation characteristics, thereby enabling the trained network to achieve superior performance in reconstructing high-quality SR images from real-world LR inputs.

[178] Stretching Beyond the Obvious: A Gradient-Free Framework to Unveil the Hidden Landscape of Visual Invariance

Lorenzo Tausani,Paolo Muratore,Morgan B. Talbot,Giacomo Amerio,Gabriel Kreiman,Davide Zoccolan

Main category: cs.CV

TL;DR: This paper introduces Stretch-and-Squeeze (SnS), a framework for analyzing invariance and adversarial sensitivity in visual systems, revealing insights into CNN behavior and its similarity to human vision.

Details Motivation: Understanding how images are transformed into representations that support recognition requires identifying features encoded by high-level visual units and their invariances. Method: SnS uses bi-objective optimization to analyze invariance and adversarial sensitivity by seeking specific image perturbations. Result: SnS revealed that robust CNNs produce more recognizable invariant images than standard networks and highlighted differences in transformations based on image representation. Conclusion: Stretch-and-Squeeze (SnS) is effective in uncovering the invariance landscape of visual units and their vulnerability to adversarial perturbations, showing that robust CNNs offer a higher fidelity model of the visual system. Abstract: Uncovering which features' combinations high-level visual units encode is critical to understand how images are transformed into representations that support recognition. While existing feature visualization approaches typically infer a unit's most exciting images, this is insufficient to reveal the manifold of transformations under which responses remain invariant, which is key to generalization in vision. Here we introduce Stretch-and-Squeeze (SnS), an unbiased, model-agnostic, and gradient-free framework to systematically characterize a unit's invariance landscape and its vulnerability to adversarial perturbations in both biological and artificial visual systems. SnS frames these transformations as bi-objective optimization problems. To probe invariance, SnS seeks image perturbations that maximally alter the representation of a reference stimulus in a given processing stage while preserving unit activation. To probe adversarial sensitivity, SnS seeks perturbations that minimally alter the stimulus while suppressing unit activation. Applied to convolutional neural networks (CNNs), SnS revealed image variations that were further from a reference image in pixel-space than those produced by affine transformations, while more strongly preserving the target unit's response. The discovered invariant images differed dramatically depending on the choice of image representation used for optimization: pixel-level changes primarily affected luminance and contrast, while stretching mid- and late-layer CNN representations altered texture and pose respectively. Notably, the invariant images from robust networks were more recognizable by human subjects than those from standard networks, supporting the higher fidelity of robust CNNs as models of the visual system.

[179] Relaxed syntax modeling in Transformers for future-proof license plate recognition

Florent Meyer,Laurent Guichard,Denis Coquenet,Guillaume Gravier,Yann Soullard,Bertrand Coüasnon

Main category: cs.CV

TL;DR: Transformers struggle with recognizing new license plate designs over time due to reliance on past syntax; this paper proposes SaLT, a syntax-agnostic model that maintains high accuracy for both old and new plates.

Details Motivation: Existing Transformer-based license plate recognition systems suffer from a significant performance drop over time as new license plates with unseen syntax are introduced. This makes them unsuitable for production environments requiring resilience to constant change. Method: The authors analyzed the flow of positional and contextual information in Transformer encoder-decoders and identified reasons for their over-reliance on past syntax. They then devised architectural cut-offs and replacements to create SaLT, which was evaluated through experiments on real and synthetic datasets with ablation studies. Result: SaLT achieves state-of-the-art accuracy on license plates with syntax seen during training while maintaining strong performance on future plates with unseen syntax, unlike conventional Transformer models which perform close to random guessing in such cases. Conclusion: The paper concludes that traditional Transformer-based networks are not suitable for long-term license plate recognition due to their reliance on previously seen syntax. The proposed SaLT model, a Syntax-Less Transformer, effectively maintains performance on both past and future license plates. Abstract: Effective license plate recognition systems are required to be resilient to constant change, as new license plates are released into traffic daily. While Transformer-based networks excel in their recognition at first sight, we observe significant performance drop over time which proves them unsuitable for tense production environments. Indeed, such systems obtain state-of-the-art results on plates whose syntax is seen during training. Yet, we show they perform similarly to random guessing on future plates where legible characters are wrongly recognized due to a shift in their syntax. After highlighting the flows of positional and contextual information in Transformer encoder-decoders, we identify several causes for their over-reliance on past syntax. Following, we devise architectural cut-offs and replacements which we integrate into SaLT, an attempt at a Syntax-Less Transformer for syntax-agnostic modeling of license plate representations. Experiments on both real and synthetic datasets show that our approach reaches top accuracy on past syntax and most importantly nearly maintains performance on future license plates. We further demonstrate the robustness of our architecture enhancements by way of various ablations.

[180] Assembler: Scalable 3D Part Assembly via Anchor Point Diffusion

Wang Zhao,Yan-Pei Cao,Jiale Xu,Yuejiang Dong,Ying Shan

Main category: cs.CV

TL;DR: 本文介绍了一个名为 Assembler 的可扩展且通用的 3D 零件组装框架,该框架通过输入零件网格和参考图像重建完整对象,并在 PartNet 上实现了最先进的性能。

Details Motivation: 现有的方法主要依赖于确定性的零件姿态预测和特定类别的训练,难以处理具有不同零件数量、几何形状和结构的多样化的现实世界对象。因此,作者提出了一种新的方法来解决大规模通用 3D 零件组装的问题。 Method: 首先,将零件组装视为一个生成问题,并使用扩散模型对可能的配置进行采样,以有效捕捉来自对称性、重复零件和多个有效装配的不确定性。其次,引入了一种基于稀疏锚点云的新颖形状中心表示方法,使得可以在欧几里得空间中进行可扩展的生成,而不是 SE(3) 姿态预测。最后,构建了一个包含超过 32 万个多样化零件-对象组装的大规模数据集。 Result: Assembler 在 PartNet 上取得了最先进的性能,并首次展示了复杂现实世界对象的高质量组装。此外,还介绍了一个有趣的部件感知 3D 建模系统,可以从图像生成高分辨率、可编辑的对象,展示了交互式和组合设计的潜力。 Conclusion: Assembler 是一种可扩展且通用的 3D 零件组装框架,它通过创新的任务公式化、表示方法和数据构建,在处理多样化和复杂的现实世界对象方面表现出色,并为交互式和组合设计提供了潜在的应用价值。 Abstract: We present Assembler, a scalable and generalizable framework for 3D part assembly that reconstructs complete objects from input part meshes and a reference image. Unlike prior approaches that mostly rely on deterministic part pose prediction and category-specific training, Assembler is designed to handle diverse, in-the-wild objects with varying part counts, geometries, and structures. It addresses the core challenges of scaling to general 3D part assembly through innovations in task formulation, representation, and data. First, Assembler casts part assembly as a generative problem and employs diffusion models to sample plausible configurations, effectively capturing ambiguities arising from symmetry, repeated parts, and multiple valid assemblies. Second, we introduce a novel shape-centric representation based on sparse anchor point clouds, enabling scalable generation in Euclidean space rather than SE(3) pose prediction. Third, we construct a large-scale dataset of over 320K diverse part-object assemblies using a synthesis and filtering pipeline built on existing 3D shape repositories. Assembler achieves state-of-the-art performance on PartNet and is the first to demonstrate high-quality assembly for complex, real-world objects. Based on Assembler, we further introduce an interesting part-aware 3D modeling system that generates high-resolution, editable objects from images, demonstrating potential for interactive and compositional design. Project page: https://assembler3d.github.io

[181] Acquiring and Accumulating Knowledge from Diverse Datasets for Multi-label Driving Scene Classification

Ke Li,Chenyu Zhang,Yuxin Ding,Xianbiao Hu,Ruwen Qin

Main category: cs.CV

TL;DR: This paper presents a new learning system called KAA-CAL for driving scene identification, which combines knowledge acquisition and accumulation with consistency-based active learning. It outperforms existing models while using significantly less data.

Details Motivation: Driving scene identification is crucial for autonomous vehicles to understand their environment, but there are two main challenges: acquiring a balanced, annotated dataset and balancing learning across tasks. Method: The paper proposes a novel learning system combining knowledge acquisition and accumulation (KAA) with consistency-based active learning (CAL) to address the challenges of multi-label classification for driving scene identification. Result: An ablation study showed a 56.1% performance increase over the baseline model pretrained on ImageNet, with KAA accounting for 31.3% of the gain and CAL contributing 24.8%. KAA-CAL outperformed state-of-the-art models on two public datasets while using 85% less data. Conclusion: The paper concludes that their proposed KAA-CAL system outperforms existing multi-label models while using significantly less data. They also make their dataset and implementation code publicly available. Abstract: Driving scene identification, which assigns multiple non-exclusive class labels to a scene, provides the contextual awareness necessary for enhancing autonomous vehicles' ability to understand, reason about, and interact with the complex driving environment. As a multi-label classification problem, it is better tackled via multitasking learning. However, directly training a multi-label classification model for driving scene identification through multitask learning presents two main challenges: acquiring a balanced, comprehensively annotated multi-label dataset and balancing learning across different tasks. This paper introduces a novel learning system that synergizes knowledge acquisition and accumulation (KAA) with consistency-based active learning (CAL) to address those challenges. KAA acquires and accumulates knowledge about scene identification from various single-label datasets via monotask learning. Subsequently, CAL effectively resolves the knowledge gap caused by the discrepancy between the marginal distributions of individual attributes and their joint distribution. An ablation study on our Driving Scene Identification (DSI) dataset demonstrates a 56.1% performance increase over the baseline model pretrained on ImageNet. Of this, KAA accounts for 31.3% of the gain, and CAL contributes 24.8%. Moreover, KAA-CAL stands out as the best performer when compared to state-of-the-art (SOTA) multi-label models on two public datasets, BDD100K and HSD, achieving this while using 85% less data. The DSI dataset and the implementation code for KAA-CAL are available at https://github.com/KELISBU/KAA-CAL .

[182] RGBTrack: Fast, Robust Depth-Free 6D Pose Estimation and Tracking

Teng Guo,Jingjin Yu

Main category: cs.CV

TL;DR: RGBTrack 是一个基于 RGB 数据的实时 6D 物体姿态估计与跟踪框架,通过创新性方法实现深度估计和鲁棒姿态假设生成,适用于多种应用场景。

Details Motivation: 消除对深度输入的依赖,以提高动态且精确的物体姿态跟踪任务的灵活性和实用性。 Method: 基于 FoundationPose 架构,提出了一种新的二进制搜索策略和渲染对比机制,并结合 XMem 2D 对象跟踪、卡尔曼滤波器和状态机实现稳定跟踪。 Result: RGBTrack 在基准数据集上的评估表明其在没有深度输入的情况下仍能达到具有竞争力的精度和实时性能。 Conclusion: RGBTrack 是一种仅依赖 RGB 数据的实时 6D 姿态估计与跟踪框架,具有良好的准确性和实用性,适用于机器人、增强现实和计算机视觉等领域。 Abstract: We introduce a robust framework, RGBTrack, for real-time 6D pose estimation and tracking that operates solely on RGB data, thereby eliminating the need for depth input for such dynamic and precise object pose tracking tasks. Building on the FoundationPose architecture, we devise a novel binary search strategy combined with a render-and-compare mechanism to efficiently infer depth and generate robust pose hypotheses from true-scale CAD models. To maintain stable tracking in dynamic scenarios, including rapid movements and occlusions, RGBTrack integrates state-of-the-art 2D object tracking (XMem) with a Kalman filter and a state machine for proactive object pose recovery. In addition, RGBTrack's scale recovery module dynamically adapts CAD models of unknown scale using an initial depth estimate, enabling seamless integration with modern generative reconstruction techniques. Extensive evaluations on benchmark datasets demonstrate that RGBTrack's novel depth-free approach achieves competitive accuracy and real-time performance, making it a promising practical solution candidate for application areas including robotics, augmented reality, and computer vision. The source code for our implementation will be made publicly available at https://github.com/GreatenAnoymous/RGBTrack.git.

[183] Dynamic Watermark Generation for Digital Images using Perimeter Gated SPAD Imager PUFs

Md Sakibur Sajal,Marc Dandin

Main category: cs.CV

TL;DR: This paper proposes a novel watermarking technique using perimeter gated SPAD (pgSPAD) imagers to derive digital image watermarks from the imager's physically unclonable functions (PUFs).

Details Motivation: While previous studies have focused on CMOS image sensors (CIS) and active pixel sensors (APS), single photon avalanche diode (SPAD) imagers have not been explored for watermarking. This research aims to investigate the potential of SPAD imagers for this purpose. Method: The study utilized the dark signal non-uniformity (DSNU) of three 64 x 64 pgSPAD imager chips, fabricated in a 0.35 μm standard CMOS process. Simulated watermarks were analyzed for standard test images from publicly available databases. Result: The results showed that both source identification and tamper detection can be achieved using the proposed source-scene-specific dynamic watermarks with a controllable sensitivity-robustness trade-off. Conclusion: The proposed watermarking technique using pgSPAD imagers demonstrates promising results for achieving security features through PUFs, offering flexibility in balancing sensitivity and robustness. Abstract: Digital image watermarks as a security feature can be derived from the imager's physically unclonable functions (PUFs) by utilizing the manufacturing variations, i.e., the dark signal non-uniformity (DSNU). While a few demonstrations focused on the CMOS image sensors (CIS) and active pixel sensors (APS), single photon avalanche diode (SPAD) imagers have never been investigated for this purpose. In this work, we have proposed a novel watermarking technique using perimeter gated SPAD (pgSPAD) imagers. We utilized the DSNU of three 64 x 64 pgSPAD imager chips, fabricated in a 0.35 {\mu}m standard CMOS process and analyzed the simulated watermarks for standard test images from publicly available database. Our observation shows that both source identification and tamper detection can be achieved using the proposed source-scene-specific dynamic watermarks with a controllable sensitivity-robustness trade-off.

[184] Semi-Supervised Multi-Modal Medical Image Segmentation for Complex Situations

Dongdong Meng,Sheng Li,Hao Wu,Guoping Wang,Xueqing Yan

Main category: cs.CV

TL;DR: This paper proposes a novel semi-supervised multi-modal medical image segmentation approach that improves accuracy by leveraging complementary information and introducing contrastive mutual learning.

Details Motivation: Semi-supervised learning faces challenges in handling complex backgrounds and tasks, while multi-modal fusion methods struggle to achieve significant improvements under semi-supervised conditions due to difficulties in utilizing unlabeled data. This work aims to develop an effective and reliable multi-modal learning strategy for such scenarios. Method: The method employs a multi-stage multi-modal fusion and enhancement strategy, aiming to reduce feature discrepancies and enhance feature sharing and alignment. Contrastive mutual learning is also integrated to ensure prediction consistency across modalities. Result: Experimental results on two multi-modal datasets show that the proposed framework achieves superior performance and robustness in medical image segmentation tasks, particularly in complex scenarios. Conclusion: The proposed semi-supervised multi-modal medical image segmentation approach effectively enhances performance by leveraging complementary information across modalities and introduces contrastive mutual learning to improve robustness in segmentation results. Abstract: Semi-supervised learning addresses the issue of limited annotations in medical images effectively, but its performance is often inadequate for complex backgrounds and challenging tasks. Multi-modal fusion methods can significantly improve the accuracy of medical image segmentation by providing complementary information. However, they face challenges in achieving significant improvements under semi-supervised conditions due to the challenge of effectively leveraging unlabeled data. There is a significant need to create an effective and reliable multi-modal learning strategy for leveraging unlabeled data in semi-supervised segmentation. To address these issues, we propose a novel semi-supervised multi-modal medical image segmentation approach, which leverages complementary multi-modal information to enhance performance with limited labeled data. Our approach employs a multi-stage multi-modal fusion and enhancement strategy to fully utilize complementary multi-modal information, while reducing feature discrepancies and enhancing feature sharing and alignment. Furthermore, we effectively introduce contrastive mutual learning to constrain prediction consistency across modalities, thereby facilitating the robustness of segmentation results in semi-supervised tasks. Experimental results on two multi-modal datasets demonstrate the superior performance and robustness of the proposed framework, establishing its valuable potential for solving medical image segmentation tasks in complex scenarios.

[185] On the Theory of Conditional Feature Alignment for Unsupervised Domain-Adaptive Counting

Zhuonan Liang,Dongnan Liu,Jianan Fan,Yaxuan Song,Qiang Qu,Yu Yao,Peng Fu,Weidong Cai

Main category: cs.CV

TL;DR: This paper introduces conditional feature alignment to enhance domain adaptation for object counting by focusing on task-relevant variations, showing improved performance over existing methods.

Details Motivation: Object counting models perform poorly across domains with differing densities, which violates standard domain adaptation assumptions. This work addresses this issue by introducing a method that accounts for task-relevant variations. Method: The paper proposes a theoretical framework of conditional feature alignment, formalizes conditional divergence, derives a joint error bound, and provides an adaptation strategy validated through experiments on multiple counting datasets. Result: The proposed method outperforms existing unsupervised domain adaptation approaches in experiments on counting datasets with varying density distributions. Conclusion: Conditional feature alignment improves cross-domain generalization for object counting by preserving task-relevant variations and filtering out nuisance shifts. Abstract: Object counting models suffer when deployed across domains with differing density variety, since density shifts are inherently task-relevant and violate standard domain adaptation assumptions. To address this, we propose a theoretical framework of conditional feature alignment. We first formalize the notion of conditional divergence by partitioning each domain into subsets (e.g., object vs. background) and measuring divergences per condition. We then derive a joint error bound showing that, under discrete label spaces treated as condition sets, aligning distributions conditionally leads to tighter bounds on the combined source-target decision error than unconditional alignment. These insights motivate a general conditional adaptation principle: by preserving task-relevant variations while filtering out nuisance shifts, one can achieve superior cross-domain generalization for counting. We provide both defining conditional divergence then proving its benefit in lowering joint error and a practical adaptation strategy that preserves task-relevant information in unsupervised domain-adaptive counting. We demonstrate the effectiveness of our approach through extensive experiments on multiple counting datasets with varying density distributions. The results show that our method outperforms existing unsupervised domain adaptation methods, empirically validating the theoretical insights on conditional feature alignment.

[186] Do We Need Large VLMs for Spotting Soccer Actions?

Ritabrata Chakraborty,Rajatsubhra Chakraborty,Avijit Dasgupta,Sandeep Chaurasia

Main category: cs.CV

TL;DR: This paper proposes a lightweight text-based method for soccer action spotting using Large Language Models (LLMs) instead of vision-based models, achieving effective results by analyzing expert commentary.

Details Motivation: Traditional video-based action spotting methods are computationally expensive and rely on dense visual data; thus, there is a need for a more scalable and lightweight approach using rich contextual textual commentary. Method: The researchers used the SoccerNet Echoes dataset with timestamped commentary and employed three specialized LLMs as judges focusing on outcome, excitement, and tactics to evaluate sliding windows of commentary and identify key actions. Result: The language-centric approach effectively detected critical match events like goals, cards, and substitutions, offering a training-free and efficient alternative to existing methods. Conclusion: The study concludes that text-based action spotting using LLMs is a viable, lightweight alternative to traditional video-based methods by leveraging expert commentary for accurate event detection. Abstract: Traditional video-based tasks like soccer action spotting rely heavily on visual inputs, often requiring complex and computationally expensive models to process dense video data. In this work, we propose a shift from this video-centric approach to a text-based task, making it lightweight and scalable by utilizing Large Language Models (LLMs) instead of Vision-Language Models (VLMs). We posit that expert commentary, which provides rich, fine-grained descriptions and contextual cues such as excitement and tactical insights, contains enough information to reliably spot key actions in a match. To demonstrate this, we use the SoccerNet Echoes dataset, which provides timestamped commentary, and employ a system of three LLMs acting as judges specializing in outcome, excitement, and tactics. Each LLM evaluates sliding windows of commentary to identify actions like goals, cards, and substitutions, generating accurate timestamps for these events. Our experiments show that this language-centric approach performs effectively in detecting critical match events, providing a lightweight and training-free alternative to traditional video-based methods for action spotting.

[187] Co-Seg++: Mutual Prompt-Guided Collaborative Learning for Versatile Medical Segmentation

Qing Xu,Yuxiang Luo,Wenting Duan,Zhen Chen

Main category: cs.CV

TL;DR: Co-Seg++是一种新的医学图像分割方法,结合语义与实例分割,显著提高准确率。

Details Motivation: 传统方法孤立处理分割任务,忽略了任务间的相互依赖性,影响分割效果。 Method: 提出STP-Encoder和MTC-Decoder模块,利用空间-时间关系和跨任务引导进行协同分割。 Result: 在多个CT和病理数据集上超越现有最先进方法,适用于牙科结构、组织及细胞核的分割。 Conclusion: Co-Seg++提供了一种高效的医学图像分割框架,通过联合语义和实例分割任务提升性能。 Abstract: Medical image analysis is critical yet challenged by the need of jointly segmenting organs or tissues, and numerous instances for anatomical structures and tumor microenvironment analysis. Existing studies typically formulated different segmentation tasks in isolation, which overlooks the fundamental interdependencies between these tasks, leading to suboptimal segmentation performance and insufficient medical image understanding. To address this issue, we propose a Co-Seg++ framework for versatile medical segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing semantic and instance segmentation tasks to mutually enhance each other. We first devise a spatio-temporal prompt encoder (STP-Encoder) to capture long-range spatial and temporal relationships between segmentation regions and image embeddings as prior spatial constraints. Moreover, we devise a multi-task collaborative decoder (MTC-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, jointly computing semantic and instance segmentation masks. Extensive experiments on diverse CT and histopathology datasets demonstrate that the proposed Co-Seg++ outperforms state-of-the-arts in the semantic, instance, and panoptic segmentation of dental anatomical structures, histopathology tissues, and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg-Plus.

[188] YASMOT: Yet another stereo image multi-object tracker

Ketil Malde

Main category: cs.CV

TL;DR: 提出了一种名为yasmot的轻量级和灵活的对象跟踪器,可以处理流行对象检测器的输出,并跟踪单目或立体相机配置中的对象,同时具备生成共识检测的功能。

Details Motivation: 对于图像时间序列(例如视频或静态图像序列),随着时间的推移跟踪对象并保持对象身份有助于改进对象检测性能,并对于许多下游任务(包括分类和预测行为以及估计总丰度)是必要的。 Method: 分析图像并提取对象的位置和类别标签,并在图像时间序列中跟踪对象并保持对象身份。 Result: 呈现了一种名为yasmot的对象跟踪器,它可以处理流行的对象检测器的输出,并能生成来自对象检测器集合的共识检测。 Conclusion: yasmot是一个轻量级且灵活的对象跟踪器,能够从单目或立体相机配置中处理流行对象检测器的输出,并随时间推移跟踪对象。此外,它还能生成来自对象检测器集合的共识检测。 Abstract: There now exists many popular object detectors based on deep learning that can analyze images and extract locations and class labels for occurrences of objects. For image time series (i.e., video or sequences of stills), tracking objects over time and preserving object identity can help to improve object detection performance, and is necessary for many downstream tasks, including classifying and predicting behaviors, and estimating total abundances. Here we present yasmot, a lightweight and flexible object tracker that can process the output from popular object detectors and track objects over time from either monoscopic or stereoscopic camera configurations. In addition, it includes functionality to generate consensus detections from ensembles of object detectors.

[189] Facial Landmark Visualization and Emotion Recognition Through Neural Networks

Israel Juárez-Jiménez,Tiffany Guadalupe Martínez Paredes,Jesús García-Ramírez,Eric Ramos Aguilar

Main category: cs.CV

TL;DR: 本文提出了一种新的面部数据集可视化方法,并验证了神经网络在情感识别上的优越性能。

Details Motivation: 现有的面部图像数据集分析不够充分,且可视化面部特征点存在挑战。 Method: 提出了面部地标箱线图可视化技术,并比较了神经网络和随机森林分类器的性能。 Result: 通过使用面部地标箱线图,可以有效地识别数据集中的异常值,并发现神经网络模型表现更优。 Conclusion: 神经网络在面部情感识别任务中表现优于随机森林分类器。 Abstract: Emotion recognition from facial images is a crucial task in human-computer interaction, enabling machines to learn human emotions through facial expressions. Previous studies have shown that facial images can be used to train deep learning models; however, most of these studies do not include a through dataset analysis. Visualizing facial landmarks can be challenging when extracting meaningful dataset insights; to address this issue, we propose facial landmark box plots, a visualization technique designed to identify outliers in facial datasets. Additionally, we compare two sets of facial landmark features: (i) the landmarks' absolute positions and (ii) their displacements from a neutral expression to the peak of an emotional expression. Our results indicate that a neural network achieves better performance than a random forest classifier.

[190] Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition

Jiaqi Li,Junshu Tang,Zhiyong Xu,Longhuang Wu,Yuan Zhou,Shuai Shao,Tianbao Yu,Zhiguo Cao,Qinglin Lu

Main category: cs.CV

TL;DR: Hunyuan-GameCraft is a new framework for generating high-quality, interactive gameplay videos by improving action control, training strategies, and inference efficiency.

Details Motivation: Recent advancements in diffusion-based and controllable video generation have laid the groundwork for immersive gaming experiences, but current methods face limitations in dynamics, generality, long-term consistency, and efficiency when creating gameplay videos. Method: The method involves unifying keyboard and mouse inputs into a shared camera representation space, employing a hybrid history-conditioned training strategy for autoregressive sequence extension, and using model distillation to improve efficiency and long-term consistency. Result: Hunyuan-GameCraft achieves high-dynamic interactive video generation with improved visual fidelity, realism, and action controllability, making it suitable for real-time deployment in complex environments. Conclusion: Hunyuan-GameCraft significantly enhances the realism and playability of interactive game video generation compared to existing models. Abstract: Recent advances in diffusion-based and controllable video generation have enabled high-quality and temporally coherent video synthesis, laying the groundwork for immersive interactive gaming experiences. However, current methods face limitations in dynamics, generality, long-term consistency, and efficiency, which limit the ability to create various gameplay videos. To address these gaps, we introduce Hunyuan-GameCraft, a novel framework for high-dynamic interactive video generation in game environments. To achieve fine-grained action control, we unify standard keyboard and mouse inputs into a shared camera representation space, facilitating smooth interpolation between various camera and movement operations. Then we propose a hybrid history-conditioned training strategy that extends video sequences autoregressively while preserving game scene information. Additionally, to enhance inference efficiency and playability, we achieve model distillation to reduce computational overhead while maintaining consistency across long temporal sequences, making it suitable for real-time deployment in complex interactive environments. The model is trained on a large-scale dataset comprising over one million gameplay recordings across over 100 AAA games, ensuring broad coverage and diversity, then fine-tuned on a carefully annotated synthetic dataset to enhance precision and control. The curated game scene data significantly improves the visual fidelity, realism and action controllability. Extensive experiments demonstrate that Hunyuan-GameCraft significantly outperforms existing models, advancing the realism and playability of interactive game video generation.

[191] UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

Teng Li,Quanfeng Lu,Lirui Zhao,Hao Li,Xizhou Zhu,Yu Qiao,Jun Zhang,Wenqi Shao

Main category: cs.CV

TL;DR: 本文研究了统一体系结构中的模态对齐模式,并提出了一个新的Y形架构UniFork,该架构在浅层共享表示学习,在深层进行特定任务分支处理,从而在理解和生成任务中取得了优异的表现。

Details Motivation: 尽管最近有所进展,但这种统一模型的最佳架构设计仍然是一个开放性挑战;理解任务受益于网络深度上逐渐增加的模态对齐,而生成任务则需要早期增加而后减少的模态对齐。 Method: 通过分析特定任务专家模型和当前统一模型的模态对齐行为,提出了一种新的Y形架构UniFork,在浅层共享跨任务表示学习,同时在深层采用特定任务分支以避免任务干扰。 Result: 通过广泛的消融实验,证明了UniFork consistently outperforms传统的全共享Transformer架构,并达到了与或超过特定任务模型相当的性能。 Conclusion: UniFork有效地平衡了共享学习和任务专业化,优于传统的完全共享Transformer架构,并实现了与任务专用模型相当或更好的性能。 Abstract: Unified image understanding and generation has emerged as a promising paradigm in multimodal artificial intelligence. Despite recent progress, the optimal architectural design for such unified models remains an open challenge. In this work, we start by analyzing the modality alignment behaviors of task-specific expert models for understanding and generation, as well as current unified models. Our analysis reveals a crucial observation: understanding tasks benefit from a progressively increasing modality alignment across network depth, which helps build up semantic information for better comprehension; In contrast, generation tasks follow a different trend: modality alignment increases in the early layers but decreases in the deep layers to recover spatial details. These divergent alignment patterns create a fundamental conflict in fully shared Transformer backbones, where a uniform representational flow often leads to performance compromises across two tasks. Motivated by this finding, we introduce UniFork, a novel Y-shaped architecture that shares the shallow layers for cross-task representation learning, while employing task-specific branches in deeper layers to avoid task interference. This design effectively balances shared learning and task specialization. Through extensive ablation experiments, we demonstrate that Unifork consistently outperforms conventional fully shared Transformer architectures, and achieves performance on par with or better than task-specific models.

[192] Part$^{2}$GS: Part-aware Modeling of Articulated Objects using 3D Gaussian Splatting

Tianjiao Yu,Vedant Shah,Muntasir Wahed,Ying Shen,Kiet A. Nguyen,Ismini Lourentzou

Main category: cs.CV

TL;DR: Part$^{2}$GS is a novel framework for articulated object modeling that achieves high-fidelity geometry and physically consistent motion using a part-aware 3D Gaussian representation and physics-guided constraints.

Details Motivation: Modeling the structure and motion of articulated objects is challenging in 3D reconstruction, requiring high fidelity and physical consistency in movement. Method: Part$^{2}$GS uses a part-aware 3D Gaussian representation for encoding articulated components and introduces a motion-aware canonical representation guided by physics-based constraints, along with repel points to prevent collisions. Result: Part$^{2}$GS outperforms state-of-the-art methods by up to 10$ imes$ in Chamfer Distance on synthetic and real-world datasets. Conclusion: Part$^{2}$GS provides a new framework for modeling articulated digital twins with high-fidelity geometry and physically consistent articulation, significantly improving motion coherence over baselines. Abstract: Articulated objects are common in the real world, yet modeling their structure and motion remains a challenging task for 3D reconstruction methods. In this work, we introduce Part$^{2}$GS, a novel framework for modeling articulated digital twins of multi-part objects with high-fidelity geometry and physically consistent articulation. Part$^{2}$GS leverages a part-aware 3D Gaussian representation that encodes articulated components with learnable attributes, enabling structured, disentangled transformations that preserve high-fidelity geometry. To ensure physically consistent motion, we propose a motion-aware canonical representation guided by physics-based constraints, including contact enforcement, velocity consistency, and vector-field alignment. Furthermore, we introduce a field of repel points to prevent part collisions and maintain stable articulation paths, significantly improving motion coherence over baselines. Extensive evaluations on both synthetic and real-world datasets show that Part$^{2}$GS consistently outperforms state-of-the-art methods by up to 10$\times$ in Chamfer Distance for movable parts.

[193] Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation

Xiuyu Yang,Shuhan Tan,Philipp Krähenbühl

Main category: cs.CV

TL;DR: 本文提出了一种新的交通模拟模型InfGen,它解决了长期交通模拟中的关键问题,并在短期和长期模拟中都表现出色。

Details Motivation: 为了提供一个真实的、长期的端到端交通模拟,解决现有模型和基准测试在长期模拟中存在的问题,即场景中的主体进入和退出问题。 Method: 提出了统一的下一个标记预测模型InfGen,该模型执行交错的闭环运动模拟和场景生成,并能够自动在这两种模式之间切换。 Result: InfGen不仅在短期(9秒)交通模拟中达到最先进水平,在长期(30秒)模拟中也显著优于所有其他方法。 Conclusion: InfGen实现了在短期和长期交通模拟中的卓越性能,特别是在长期模拟中显著优于其他方法。 Abstract: An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment. Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene. This is problematic for long-term simulation. Agents enter and exit the scene as the ego vehicle enters new regions. We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. InfGen automatically switches between closed-loop motion simulation and scene generation mode. It enables stable long-term rollout simulation. InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation. The code and model of InfGen will be released at https://orangesodahub.github.io/InfGen

[194] Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Zeyuan Yang,Xueyang Yu,Delin Chen,Maohao Shen,Chuang Gan

Main category: cs.CV

TL;DR: This paper proposes Mirage, a framework that enhances vision-language models' (VLMs) visual reasoning by incorporating latent visual tokens during decoding, eliminating the need for explicit image generation.

Details Motivation: VLMs struggle with tasks demanding visual imagination due to their reliance on text-only decoding. Existing approaches that involve image generation often compromise reasoning ability, prompting the need for an alternative method inspired by human mental imagery. Method: The paper introduces Mirage, a framework that augments VLM decoding with latent visual tokens. These tokens are initially supervised through distillation from image embeddings and later aligned with task objectives using text-only supervision. Reinforcement learning is employed to enhance reasoning capabilities. Result: Experiments show that Mirage improves multimodal reasoning across various benchmarks without generating explicit images. Conclusion: Mirage enables VLMs to perform multimodal reasoning without explicit image generation, enhancing performance on tasks requiring visual imagination. Abstract: Vision-language models (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

[195] Emergent Temporal Correspondences from Video Diffusion Transformers

Jisu Nam,Soowon Son,Dahyun Chung,Jiyoung Kim,Siyoon Jin,Junhwa Hur,Seungryong Kim

Main category: cs.CV

TL;DR: 本文提出DiffTrack,用于分析视频扩散模型如何建立帧间时间对应关系,揭示了特定层的查询-键相似性在去噪过程中的关键作用,并展示了其在跟踪和视频生成上的应用。

Details Motivation: 尽管视频扩散模型在生成时间连贯视频方面取得了成功,但其内部如何建立时间对应关系仍是一个未解的基本问题。 Method: DiffTrack构建了一个具有伪真实跟踪标注的数据集,并提出了新的评估指标,以分析DiTs中各组件如何建立时间对应关系。 Result: 研究发现,特定层中的查询-键相似性在去噪过程中对时间匹配起关键作用。DiffTrack还在零样本点跟踪任务中表现出色,并改进了视频生成的时间一致性。 Conclusion: DiffTrack提供了一个深入理解视频扩散模型(DiTs)内部工作机制的框架,并为进一步研究和应用基于DiTs的视频生成提供了基础。 Abstract: Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question persists: how do these models internally establish and represent temporal correspondences across frames? We introduce DiffTrack, the first quantitative analysis framework designed to answer this question. DiffTrack constructs a dataset of prompt-generated video with pseudo ground-truth tracking annotations and proposes novel evaluation metrics to systematically analyze how each component within the full 3D attention mechanism of DiTs (e.g., representations, layers, and timesteps) contributes to establishing temporal correspondences. Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching, and that this matching becomes increasingly prominent during the denoising process. We demonstrate practical applications of DiffTrack in zero-shot point tracking, where it achieves state-of-the-art performance compared to existing vision foundation and self-supervised video models. Further, we extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training. We believe our work offers crucial insights into the inner workings of video DiTs and establishes a foundation for further research and applications leveraging their temporal understanding.

[196] VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

Zhangyang Qi,Zhixiong Zhang,Yizhou Yu,Jiaqi Wang,Hengshuang Zhao

Main category: cs.CV

TL;DR: 本文提出VLN-R1框架,利用LVLMs实现从视频流到连续导航动作的端到端转换,并引入新训练策略提升性能。

Details Motivation: 当前基于语言模型的导航系统受限于离散拓扑图,缺乏细粒度动作控制,需探索更有效的路径规划方式。 Method: 提出了端到端框架VLN-R1,采用GRPO-based训练;构建VLN-Ego数据集并提出Long-Short Memory Sampling;使用两阶段训练方法(SFT和RFT)结合TDR机制。 Result: 实验结果显示VLN-R1在VLN-CE基准测试中表现出色。 Conclusion: VLN-R1证明了LVLMs可以驱动具身导航,并通过数据高效、奖励驱动的后训练增强任务特定推理。 Abstract: Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions. Current language model-based navigation systems operate on discrete topological graphs, limiting path planning to predefined node connections. We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions, adopting GRPO-based training inspired by DeepSeek-R1. To enable effective training, we first construct the VLN-Ego dataset using a 3D simulator, Habitat, and propose Long-Short Memory Sampling to balance historical and current observations. While large language models can supervise complete textual instructions, they lack fine-grained action-level control. Our framework employs a two-stage training approach: a) Supervised fine-tuning (SFT) to align the model's action sequence text predictions with expert demonstrations, followed by b) Reinforcement fine-tuning (RFT) enhanced with a Time-Decayed Reward (TDR) mechanism that strategically weights multi-step future actions. Experimental results show VLN-R1 achieves strong performance on VLN-CE benchmark. VLN-R1 proves LVLMs can drive embodied navigation and enhance task-specific reasoning through data-efficient, reward-driven post-training.