Table of Contents
cs.CL [Back]
[1] ClimateChat: Designing Data and Methods for Instruction Tuning LLMs to Answer Climate Change Queries
Zhou Chen,Xiao Wang,Yuanhong Liao,Ming Lin,Yuqi Bai
Main category: cs.CL
TL;DR: The paper proposes an automated method to generate climate change-related instruction data, which is used to fine-tune open-source LLMs into a model named ClimateChat. The research highlights the importance of selecting suitable base models and demonstrates ClimateChat's effectiveness in climate change Q&A tasks.
Details
Motivation: To address the limitation of current research in efficiently producing large volumes of high-precision instruction data for climate change, which restricts the development of climate change-specific LLMs. Method: An automated method is introduced to construct instruction data by generating instructions from document facts and background knowledge, enhancing diversity through web scraping and seed instruction collection. This results in the creation of ClimateChat-Corpus, a dataset used to fine-tune open-source LLMs into ClimateChat. Result: ClimateChat significantly improves performance on climate change question-and-answer tasks. The study also evaluates the impact of different base models and instruction data on LLM performance, demonstrating adaptability to various climate change scientific discovery tasks. Conclusion: This research provides valuable references and empirical support for constructing climate change instruction data and training climate change-specific LLMs. Abstract: As the issue of global climate change becomes increasingly severe, the demand for research in climate science continues to grow. Natural language processing technologies, represented by Large Language Models (LLMs), have been widely applied to climate change-specific research, providing essential information support for decision-makers and the public. Some studies have improved model performance on relevant tasks by constructing climate change-related instruction data and instruction-tuning LLMs. However, current research remains inadequate in efficiently producing large volumes of high-precision instruction data for climate change, which limits further development of climate change LLMs. This study introduces an automated method for constructing instruction data. The method generates instructions using facts and background knowledge from documents and enhances the diversity of the instruction data through web scraping and the collection of seed instructions. Using this method, we constructed a climate change instruction dataset, named ClimateChat-Corpus, which was used to fine-tune open-source LLMs, resulting in an LLM named ClimateChat. Evaluation results show that ClimateChat significantly improves performance on climate change question-and-answer tasks. Additionally, we evaluated the impact of different base models and instruction data on LLM performance and demonstrated its capability to adapt to a wide range of climate change scientific discovery tasks, emphasizing the importance of selecting an appropriate base model for instruction tuning. This research provides valuable references and empirical support for constructing climate change instruction data and training climate change-specific LLMs.[2] Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles
Antara Raaghavi Bhattacharya,Isabel Papadimitriou,Kathryn Davidson,David Alvarez-Melis
Main category: cs.CL
TL;DR: Large Language Models (LLMs) have difficulty with cross-linguistic numeral puzzles, unlike humans. LLMs perform better when math operations are explicitly marked with known symbols and struggle with inferring implicit compositional numeral structures.
Details
Motivation: To understand why LLMs struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems that humans can solve successfully. Method: Conduct a series of experiments untangling the linguistic and mathematical aspects of numbers in language and ablation studies probing how individual parameters of numeral construction and combination affect performance. Result: Models cannot consistently solve problems unless mathematical operations are explicitly marked using known symbols. Humans use linguistic understanding to make inferences about implicit numeral structure while LLMs lack this notion. Conclusion: The ability to infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models. Abstract: Across languages, numeral systems vary widely in how they construct and combine numbers. While humans consistently learn to navigate this diversity, large language models (LLMs) struggle with linguistic-mathematical puzzles involving cross-linguistic numeral systems, which humans can learn to solve successfully. We investigate why this task is difficult for LLMs through a series of experiments that untangle the linguistic and mathematical aspects of numbers in language. Our experiments establish that models cannot consistently solve such problems unless the mathematical operations in the problems are explicitly marked using known symbols ($+$, $\times$, etc, as in "twenty + three"). In further ablation studies, we probe how individual parameters of numeral construction and combination affect performance. While humans use their linguistic understanding of numbers to make inferences about the implicit compositional structure of numerals, LLMs seem to lack this notion of implicit numeral structure. We conclude that the ability to flexibly infer compositional rules from implicit patterns in human-scale data remains an open challenge for current reasoning models.[3] VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training
Jipeng Zhang,Kehao Miao,Renjie Pi,Zhaowei Wang,Runtao Liu,Rui Pan,Tong Zhang
Main category: cs.CL
TL;DR: Reinforcement Fine-Tuning (RFT) with verifiable rewards has advanced large language models but remains underexplored for Vision-Language (VL) models. To address the challenges in training effective VL-RMs, including the bootstrapping dilemma and modality bias, this paper proposes an iterative training framework leveraging vision experts, Chain-of-Thought (CoT) rationales, and Margin-based Rejection Sampling.
Details
Motivation: The motivation of this paper is to advance the alignment of Vision-Language (VL) models using Reinforcement Fine-Tuning (RFT) with verifiable rewards, overcoming the challenges such as the bootstrapping dilemma and modality bias that hinder the development of effective Vision-Language Reward Models (VL-RMs). Method: The method proposed in this paper involves an iterative training framework which leverages vision experts, Chain-of-Thought (CoT) rationales, and Margin-based Rejection Sampling. This approach aims to refine preference datasets, enhance structured critiques, and iteratively improve reasoning to overcome the challenges in training VL-RMs. Result: Experiments across VL-RM benchmarks demonstrate superior performance in hallucination detection and multimodal reasoning, indicating that the proposed framework successfully addresses the challenges and advances VL model alignment with reinforcement learning. Conclusion: The conclusion is that the proposed iterative training framework effectively improves the training of Vision-Language Reward Models (VL-RMs), leading to better hallucination detection and multimodal reasoning capabilities, thus advancing the alignment of Vision-Language models. Abstract: Reinforcement Fine-Tuning (RFT) with verifiable rewards has advanced large language models but remains underexplored for Vision-Language (VL) models. The Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback, yet training effective VL-RMs faces two major challenges. First, the bootstrapping dilemma arises as high-quality training data depends on already strong VL models, creating a cycle where self-generated supervision reinforces existing biases. Second, modality bias and negative example amplification occur when VL models hallucinate incorrect visual attributes, leading to flawed preference data that further misguides training. To address these issues, we propose an iterative training framework leveraging vision experts, Chain-of-Thought (CoT) rationales, and Margin-based Rejection Sampling. Our approach refines preference datasets, enhances structured critiques, and iteratively improves reasoning. Experiments across VL-RM benchmarks demonstrate superior performance in hallucination detection and multimodal reasoning, advancing VL model alignment with reinforcement learning.[4] EmoNews: A Spoken Dialogue System for Expressive News Conversations
Ryuki Matsuura,Shikhar Bharadwaj,Jiarui Liu,Dhatchi Kunde Govindarajan
Main category: cs.CL
TL;DR: The paper develops an emotional task-oriented spoken dialogue system (SDS) for news conversations using a sentiment analyzer and PromptTTS, proposing a subjective evaluation scale. Experiments show improved emotion regulation and engagement compared to baseline systems.
Details
Motivation: To create more empathetic news conversations by regulating emotional speech based on contextual cues in a task-oriented SDS. Method: Developed an emotional SDS utilizing a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. Also proposed a subjective evaluation scale for emotional SDSs. Result: Experiments demonstrated that the emotional SDS outperformed the baseline system in terms of emotion regulation and engagement. Conclusion: Speech emotion plays a critical role in enhancing the engagement of conversations. Abstract: We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues to enable more empathetic news conversations. Despite advancements in emotional text-to-speech (TTS) techniques, task-oriented emotional SDSs remain underexplored due to the compartmentalized nature of SDS and emotional TTS research, as well as the lack of standardized evaluation metrics for social goals. We address these challenges by developing an emotional SDS for news conversations that utilizes a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. We also propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems. Experiments showed that our emotional SDS outperformed a baseline system in terms of the emotion regulation and engagement. These results suggest the critical role of speech emotion for more engaging conversations. All our source code is open-sourced at https://github.com/dhatchi711/espnet-emotional-news/tree/emo-sds/egs2/emo_news_sds/sds1[5] Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations
Abhilekh Borah,Chhavi Sharma,Danush Khanna,Utkarsh Bhatt,Gurpreet Singh,Hasnat Md Abdullah,Raghav Kaushik Ravi,Vinija Jain,Jyoti Patel,Shubham Singh,Vasu Sharma,Arpita Vats,Rahul Raja,Aman Chadha,Amitava Das
Main category: cs.CL
TL;DR: 对齐质量指数(AQI)是一种新的几何和提示不变的度量标准,通过分析潜在空间中安全和不安全激活的分离来评估大型语言模型的对齐情况。结合多种指标,AQI可以检测隐藏的错位和越狱风险,并作为早期预警信号。此外,还提出了LITMUS数据集以促进稳健评估。
Details
Motivation: 随着大型语言模型进入高风险领域,其行为必须可靠地反映人类一致的价值观和安全约束。然而,当前的评估方法依赖于行为代理,存在关键盲点,对齐模型容易受到越狱、生成的随机性和对齐伪装的影响。 Method: 引入了对齐质量指数(AQI),通过分析潜在空间中安全和不安全激活的分离来评估LLM对齐情况。结合Davies-Bouldin Score (DBS)、Dunn Index (DI)、Xie-Beni Index (XBI)和Calinski-Harabasz Index (CHI)等措施,捕捉聚类质量。同时提出了LITMUS数据集以支持稳健评估。 Result: 在LITMUS数据集上对不同模型进行的经验测试表明,AQI与外部评审员的相关性以及揭示被拒绝指标忽略的漏洞的能力。 Conclusion: 研究结果表明,AQI可以有效检测隐藏的错位和越狱风险,并作为早期预警信号,为未来的研究提供了公开可用的实现。 Abstract: Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.[6] ASMR: Augmenting Life Scenario using Large Generative Models for Robotic Action Reflection
Shang-Chi Tsai,Seiya Kawano,Angel Garcia Contreras,Koichiro Yoshino,Yun-Nung Chen
Main category: cs.CL
TL;DR: 通过结合对话与环境图像的数据增强框架,利用大语言模型和稳定扩散模型生成数据,提升机器人多模态模型的动作选择能力,达到业内领先水平。
Details
Motivation: 在设计辅助人类日常活动的机器人时,仅依靠用户请求难以准确理解意图,加入视觉线索的多模态分类任务可以改善这一问题。然而,收集包含视觉和语言元素的大规模数据集困难且耗时。 Method: 提出一种新的数据增强框架,使用大语言模型模拟对话和环境情境,并用稳定扩散模型生成相应图像,从而丰富多模态模型训练数据,以提高机器人动作选择的准确性。 Result: 实验结果表明,该方法显著提升了机器人在真实场景中的动作选择能力,达到了当前最佳性能。 Conclusion: 本研究提出的基于对话和环境图像的数据增强框架有效提升了机器人多模态模型的性能,为未来机器人辅助技术的发展提供了新方向。 Abstract: When designing robots to assist in everyday human activities, it is crucial to enhance user requests with visual cues from their surroundings for improved intent understanding. This process is defined as a multimodal classification task. However, gathering a large-scale dataset encompassing both visual and linguistic elements for model training is challenging and time-consuming. To address this issue, our paper introduces a novel framework focusing on data augmentation in robotic assistance scenarios, encompassing both dialogues and related environmental imagery. This approach involves leveraging a sophisticated large language model to simulate potential conversations and environmental contexts, followed by the use of a stable diffusion model to create images depicting these environments. The additionally generated data serves to refine the latest multimodal models, enabling them to more accurately determine appropriate actions in response to user interactions with the limited target data. Our experimental results, based on a dataset collected from real-world scenarios, demonstrate that our methodology significantly enhances the robot's action selection capabilities, achieving the state-of-the-art performance.[7] Are manual annotations necessary for statutory interpretations retrieval?
Aleksander Smywiński-Pohl,Tomer Libal,Adam Kaczmarczyk,Magdalena Król
Main category: cs.CL
TL;DR: The paper explores the optimal number of annotations per legal concept, whether sentences for annotation should be drawn randomly or as best candidates, and the outcome of automating the annotation process with an LLM.
Details
Motivation: To reduce the cost and repetition of manual annotation in retrieving relevant legal interpretations. Method: Conducting experiments to determine the volume, scope, and necessity of manual annotation by checking the optimal number of annotations per concept, the selection method of sentences for annotation, and the effect of automating annotation with an LLM. Result: Answers to the three questions regarding the optimal number of annotations, the selection method of sentences, and the automation of annotation process are obtained. Conclusion: The findings provide insights into optimizing the manual annotation process and potentially automating it with LLMs. Abstract: One of the elements of legal research is looking for cases where judges have extended the meaning of a legal concept by providing interpretations of what a concept means or does not mean. This allow legal professionals to use such interpretations as precedents as well as laymen to better understand the legal concept. The state-of-the-art approach for retrieving the most relevant interpretations for these concepts currently depends on the ranking of sentences and the training of language models over annotated examples. That manual annotation process can be quite expensive and need to be repeated for each such concept, which prompted recent research in trying to automate this process. In this paper, we highlight the results of various experiments conducted to determine the volume, scope and even the need for manual annotation. First of all, we check what is the optimal number of annotations per a legal concept. Second, we check if we can draw the sentences for annotation randomly or there is a gain in the performance of the model, when only the best candidates are annotated. As the last question we check what is the outcome of automating the annotation process with the help of an LLM.[8] AI shares emotion with humans across languages and cultures
Xiuwen Wu,Hao Wang,Zhiang Yan,Xiaohan Tang,Pengfei Xu,Wai-Ting Siok,Ping Li,Jia-Hong Gao,Bingjiang Lyu,Lang Qin
Main category: cs.CL
TL;DR: Effective and safe human-AI collaboration needs emotional exchange. This study finds that LLMs' emotion representation aligns with human perception, can be controlled using human-centric emotion concepts, and accurately predicts behavioral data on word ratings.
Details
Motivation: To investigate whether large language models (LLMs) can represent emotions like humans do and if their emotional tone can be controlled to ensure effective and safe human-machine collaboration. Method: Assessing human-AI emotional alignment across linguistic-cultural groups and model-families by translating interpretable LLM features from concept-sets of over twenty nuanced emotion categories and analyzing their structural congruence with human perception. Result: LLM-derived emotion spaces align structurally with human perception based on valence and arousal dimensions. Emotion-related features predict large-scale behavioral data on word ratings, showing universal and language-specific patterns. Steering vectors from human-centric emotion concepts can modulate model expressions across emotion categories. Conclusion: AI shares emotional representations with humans, and its affective outputs can be precisely guided using psychologically grounded emotion concepts. Abstract: Effective and safe human-machine collaboration requires the regulated and meaningful exchange of emotions between humans and artificial intelligence (AI). Current AI systems based on large language models (LLMs) can provide feedback that makes people feel heard. Yet it remains unclear whether LLMs represent emotion in language as humans do, or whether and how the emotional tone of their output can be controlled. We assess human-AI emotional alignment across linguistic-cultural groups and model-families, using interpretable LLM features translated from concept-sets for over twenty nuanced emotion categories (including six basic emotions). Our analyses reveal that LLM-derived emotion spaces are structurally congruent with human perception, underpinned by the fundamental affective dimensions of valence and arousal. Furthermore, these emotion-related features also accurately predict large-scale behavioural data on word ratings along these two core dimensions, reflecting both universal and language-specific patterns. Finally, by leveraging steering vectors derived solely from human-centric emotion concepts, we show that model expressions can be stably and naturally modulated across distinct emotion categories, which provides causal evidence that human emotion concepts can be used to systematically induce LLMs to produce corresponding affective states when conveying content. These findings suggest AI not only shares emotional representations with humans but its affective outputs can be precisely guided using psychologically grounded emotion concepts.[9] Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text
Amr Mohamed,Yang Zhang,Michalis Vazirgiannis,Guokan Shang
Main category: cs.CL
TL;DR: This paper systematically evaluates how Large Language Models (LLMs) process code-switched inputs by creating variants of reasoning and comprehension benchmarks. It finds that embedding English into other languages can improve comprehension, while fine-tuning is more effective than prompting in mitigating performance degradation.
Details
Motivation: Code-switching is becoming increasingly common, especially in online communication within multilingual communities. As LLMs are often exposed to such mixed-language texts, it is essential to evaluate their ability to process and reason about code-switched inputs. Method: The researchers generated code-switched variants of established reasoning and comprehension benchmarks to assess LLMs' performance. They examined the impact of inserting foreign tokens into English text and vice versa, considering both linguistic constraints and different intervention methods like prompting and fine-tuning. Result: LLMs show a degradation in performance when foreign tokens disrupt English text. However, embedding English into other languages tends to improve comprehension. Prompting has mixed results, whereas fine-tuning provides a more consistent way to reduce degradation. Conclusion: Understanding how LLMs handle code-switched inputs is crucial for improving their effectiveness in multilingual contexts. Fine-tuning emerges as a promising approach to mitigate performance issues. Abstract: Code-switching (CSW) is the act of alternating between two or more languages within a single discourse. This phenomenon is widespread in multilingual communities, and increasingly prevalent in online content, where users naturally mix languages in everyday communication. As a result, Large Language Models (LLMs), now central to content processing and generation, are frequently exposed to code-switched inputs. Given their widespread use, it is crucial to understand how LLMs process and reason about such mixed-language text. This paper presents a systematic evaluation of LLM comprehension under code-switching by generating CSW variants of established reasoning and comprehension benchmarks. While degradation is evident when foreign tokens disrupt English text$\unicode{x2013}$even under linguistic constraints$\unicode{x2013}$embedding English into other languages often improves comprehension. Though prompting yields mixed results, fine-tuning offers a more stable path to degradation mitigation.[10] MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
Xueqing Peng,Lingfei Qian,Yan Wang,Ruoyu Xiang,Yueru He,Yang Ren,Mingyang Jiang,Jeff Zhao,Huan He,Yi Han,Yun Feng,Yuechen Jiang,Yupeng Cao,Haohang Li,Yangyang Yu,Xiaoyu Wang,Penglei Gao,Shengyuan Lin,Keyi Wang,Shanshan Yang,Yilun Zhao,Zhiwei Liu,Peng Lu,Jerry Huang,Suyuchen Wang,Triantafillos Papadopoulos,Polydoros Giannouris,Efstathia Soufleri,Nuo Chen,Guojun Xiong,Zhiyang Deng,Yijia Zhao,Mingquan Lin,Meikang Qiu,Kaleb E Smith,Arman Cohan,Xiao-Yang Liu,Jimin Huang,Alejandro Lopez-Lira,Xi Chen,Junichi Tsujii,Jian-Yun Nie,Sophia Ananiadou,Qianqian Xie
Main category: cs.CL
TL;DR: Recent advances in LLMs have boosted financial NLP, but benchmarks are still limited. This paper introduces MultiFinBen, the first multilingual and multimodal benchmark for evaluating LLMs in the financial domain across different modalities and linguistic settings. It includes novel tasks such as PolyFiQA and OCR-embedded QA, revealing that even the strongest models struggle with complex cross-lingual and multimodal tasks.
Details
Motivation: To address the limitation of existing benchmarks which are monolingual and unimodal, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. Method: Introduced MultiFinBen, a multilingual and multimodal benchmark for the global financial domain. It evaluates LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. Novel tasks like PolyFiQA-Easy/Expert and EnglishOCR/SpanishOCR were introduced. A dynamic, difficulty-aware selection mechanism was also proposed. Result: Extensive evaluation of 22 state-of-the-art models showed that despite their general multimodal and multilingual capabilities, they struggle significantly when faced with complex cross-lingual and multimodal tasks in the financial domain. Conclusion: MultiFinBen is publicly released to promote transparent, reproducible, and inclusive progress in financial studies and applications. Abstract: Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.[11] An Interdisciplinary Review of Commonsense Reasoning and Intent Detection
Md Nazmus Sakib
Main category: cs.CL
TL;DR: This review explores recent advances in commonsense reasoning and intent detection, analyzing 28 papers from ACL, EMNLP, and CHI (2020-2025). It highlights emerging trends toward more adaptive, multilingual, and context-aware models, and identifies key gaps.
Details
Motivation: To provide a comprehensive analysis of recent advances in commonsense reasoning and intent detection, two critical areas in natural language understanding. Method: The review organizes and analyzes 28 papers from ACL, EMNLP, and CHI (2020-2025) by methodology and application. It examines commonsense reasoning across zero-shot learning, cultural adaptation, structured evaluation, and interactive contexts. For intent detection, it looks at open-set models, generative formulations, clustering, and human-centered systems. Result: Emerging trends point towards more adaptive, multilingual, and context-aware models. Key gaps remain in grounding, generalization, and benchmark design. Conclusion: The review concludes by highlighting the need for further research to address gaps in grounding, generalization, and benchmark design while continuing to develop adaptive, multilingual, and context-aware models. Abstract: This review explores recent advances in commonsense reasoning and intent detection, two key challenges in natural language understanding. We analyze 28 papers from ACL, EMNLP, and CHI (2020-2025), organizing them by methodology and application. Commonsense reasoning is reviewed across zero-shot learning, cultural adaptation, structured evaluation, and interactive contexts. Intent detection is examined through open-set models, generative formulations, clustering, and human-centered systems. By bridging insights from NLP and HCI, we highlight emerging trends toward more adaptive, multilingual, and context-aware models, and identify key gaps in grounding, generalization, and benchmark design.[12] Ace-CEFR -- A Dataset for Automated Evaluation of the Linguistic Difficulty of Conversational Texts for LLM Applications
David Kogan,Max Schumacher,Sam Nguyen,Masanori Suzuki,Melissa Smith,Chloe Sophia Bellows,Jared Bernstein
Main category: cs.CL
TL;DR: This paper presents Ace-CEFR, a dataset for evaluating the difficulty of conversational English text passages, and shows that models trained on this dataset can outperform human experts in measuring text difficulty.
Details
Motivation: There is a need to evaluate the language difficulty of short, conversational passages of text for training and filtering Large Language Models (LLMs). Method: Introduced Ace-CEFR, a dataset of English conversational text passages annotated with their text difficulty levels. Experimented with several models including Transformer-based models and LLMs on this dataset. Result: Models trained on Ace-CEFR can measure text difficulty more accurately than human experts and have latency suitable for production environments. Conclusion: The Ace-CEFR dataset has been released to the public to facilitate further research and development. Abstract: There is an unmet need to evaluate the language difficulty of short, conversational passages of text, particularly for training and filtering Large Language Models (LLMs). We introduce Ace-CEFR, a dataset of English conversational text passages expert-annotated with their corresponding level of text difficulty. We experiment with several models on Ace-CEFR, including Transformer-based models and LLMs. We show that models trained on Ace-CEFR can measure text difficulty more accurately than human experts and have latency appropriate to production environments. Finally, we release the Ace-CEFR dataset to the public for research and development.[13] Automatic Extraction of Clausal Embedding Based on Large-Scale English Text Data
Iona Carslaw,Sivan Milton,Nicolas Navarre,Ciyang Qing,Wataru Uegaki
Main category: cs.CL
TL;DR: This paper presents a methodological approach for detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics.
Details
Motivation: For linguists, embedded clauses have been of special interest because of their intricate distribution of syntactic and semantic features. Yet, current research relies on schematically created language examples to investigate these constructions, missing out on statistical information and naturally-occurring examples that can be gained from large language corpora. Method: Detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics. Result: The tool has been evaluated on the dataset Golden Embedded Clause Set (GECS), which includes hand-annotated examples of naturally-occurring English embedded clause sentences. Conclusion: A large-scale dataset of naturally-occurring English embedded clauses has been presented, extracted from the open-source corpus Dolma using the extraction tool. Abstract: For linguists, embedded clauses have been of special interest because of their intricate distribution of syntactic and semantic features. Yet, current research relies on schematically created language examples to investigate these constructions, missing out on statistical information and naturally-occurring examples that can be gained from large language corpora. Thus, we present a methodological approach for detecting and annotating naturally-occurring examples of English embedded clauses in large-scale text data using constituency parsing and a set of parsing heuristics. Our tool has been evaluated on our dataset Golden Embedded Clause Set (GECS), which includes hand-annotated examples of naturally-occurring English embedded clause sentences. Finally, we present a large-scale dataset of naturally-occurring English embedded clauses which we have extracted from the open-source corpus Dolma using our extraction tool.[14] Abstract Meaning Representation for Hospital Discharge Summarization
Paul Landes,Sitara Rao,Aaron Jeremy Chaise,Barbara Di Eugenio
Main category: cs.CL
TL;DR: 大型语言模型(LLMs)的弱点是幻觉,尤其在临床领域有严重后果。本文旨在探索结合语言图和深度学习模型的新方法,以提高自动总结的可靠性和可信度。该方法在MIMIC-III语料库和匿名医院医生书写的临床笔记上表现出了令人印象深刻的可靠性结果。
Details
Motivation: 自动生成出院摘要可以减轻医生的文档负担,使他们能够更专注于患者护理。然而,LLMs存在幻觉问题,这在临床领域可能导致严重后果。因此,需要一种新方法来解决自动总结中的内容来源和可信度问题。 Method: 本研究提出了一种结合语言图和深度学习模型的方法,用于生成可靠的自动总结。该方法被应用于MIMIC-III语料库和匿名医院医生书写的临床笔记。 Result: 该方法在MIMIC-III语料库和匿名医院的临床笔记上表现出显著的可靠性结果,提高了自动总结的可信度和质量。 Conclusion: 通过将语言图与深度学习模型相结合,本研究成功地解决了自动总结中的可信度问题,为临床领域的自动文本生成提供了新的可能性。 Abstract: The Achilles heel of Large Language Models (LLMs) is hallucination, which has drastic consequences for the clinical domain. This is particularly important with regards to automatically generating discharge summaries (a lengthy medical document that summarizes a hospital in-patient visit). Automatically generating these summaries would free physicians to care for patients and reduce documentation burden. The goal of this work is to discover new methods that combine language-based graphs and deep learning models to address provenance of content and trustworthiness in automatic summarization. Our method shows impressive reliability results on the publicly available Medical Information Mart for Intensive III (MIMIC-III) corpus and clinical notes written by physicians at Anonymous Hospital. rovide our method, generated discharge ary output examples, source code and trained models.[15] Essential-Web v1.0: 24T tokens of organized web data
Essential AI,:,Andrew Hojel,Michael Pust,Tim Romanski,Yash Vanjani,Ritvik Kapila,Mohit Parmar,Adarsh Chaluvaraju,Alok Tripathy,Anil Thomas,Ashish Tanwer,Darsh J Shah,Ishaan Shah,Karl Stratos,Khoi Nguyen,Kurt Smith,Michael Callahan,Peter Rushton,Philip Monk,Platon Mazarakis,Saad Jamal,Saurabh Srivastava,Somanshu Singla,Ashish Vaswani
Main category: cs.CL
TL;DR: The paper introduces Essential-Web v1.0, a 24-trillion-token dataset with taxonomy annotations to create competitive web-curated datasets in various fields.
Details
Motivation: To address the costly and inaccessible data pipelines due to the lack of massive, well-organized pre-training datasets for language models. Method: Developed a 24-trillion-token dataset named Essential-Web v1.0 where every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality using EAI-Distill-0.5b model. Result: Achieved competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%) using SQL-style filters. Conclusion: Essential-Web v1.0 is available on HuggingFace at https://huggingface.co/datasets/EssentialAI/essential-web-v1.0. Abstract: Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0[16] Sampling from Your Language Model One Byte at a Time
Jonathan Hayase,Alisa Liu,Noah A. Smith,Sewoong Oh
Main category: cs.CL
TL;DR: A inference-time method converts autoregressive LMs with a BPE tokenizer into character/byte-level LMs, solving the Prompt Boundary Problem and unifying vocabularies for model ensembling and proxy-tuning.
Details
Motivation: Tokenization can distort model generations, cause Prompt Boundary Problems (PBP), hinder interoperability due to mismatching tokenizers, and prevent direct model ensembling. Method: Convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM at inference time without altering its text-level generative distribution. Result: The ensemble and proxy-tuned models perform better on downstream evaluations compared to their individual components. Conclusion: This method efficiently solves the PBP, unifies vocabularies of different tokenizers, and enhances model interoperability. Abstract: Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations. For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. This Prompt Boundary Problem (PBP) also arises in languages such as Chinese and in code generation, where tokens often do not line up with syntactic boundaries. Additionally mismatching tokenizers often hinder model composition and interoperability. For example, it is not possible to directly ensemble models with different tokenizers due to their mismatching vocabularies. To address these issues, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM, without changing its generative distribution at the text level. Our method efficient solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time as well as transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals.[17] DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization
Chengyu Huang,Tanya Goyal
Main category: cs.CL
TL;DR: The paper explores the relationship between preference optimization performance and preference datasets in large language models (LLMs). It introduces Distance Calibrated Reward Margin (DCRM) as a metric to assess response pair quality for preference optimization. The authors propose a best-of-$N^2$ pairing method based on DCRM, which selects high-quality response pairs, leading to improved model performance on various benchmarks.
Details
Motivation: Preference optimization performance in LLMs is influenced by the underlying preference datasets. However, the actual differences between preferred and dispreferred responses may not align with what the model should ideally learn. Method: The authors use distance and reward margin to create the Distance Calibrated Reward Margin (DCRM) metric. This quantifies the quality of response pairs used in preference optimization. They study three types of preference datasets and find a correlation between higher DCRM values and better learning outcomes. Based on this, they propose a best-of-$N^2$ pairing method that selects response pairs with the highest DCRM. Result: Empirical results show that the proposed method produces training datasets that improve model performance on benchmarks such as AlpacaEval, MT-Bench, and Arena-Hard compared to existing training sets. Conclusion: The DCRM metric effectively measures the quality of response pairs for preference optimization. Using the best-of-$N^2$ pairing method can enhance the performance of LLMs in various settings. Abstract: Recent research has attempted to associate preference optimization (PO) performance with the underlying preference datasets. In this work, our observation is that the differences between the preferred response $y^+$ and dispreferred response $y^-$ influence what LLMs can learn, which may not match the desirable differences to learn. Therefore, we use distance and reward margin to quantify these differences, and combine them to get Distance Calibrated Reward Margin (DCRM), a metric that measures the quality of a response pair for PO. Intuitively, DCRM encourages minimal noisy differences and maximal desired differences. With this, we study 3 types of commonly used preference datasets, classified along two axes: the source of the responses and the preference labeling function. We establish a general correlation between higher DCRM of the training set and better learning outcome. Inspired by this, we propose a best-of-$N^2$ pairing method that selects response pairs with the highest DCRM. Empirically, in various settings, our method produces training datasets that can further improve models' performance on AlpacaEval, MT-Bench, and Arena-Hard over the existing training sets.[18] S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models
Tao He,Guang Huang,Yu Yang,Tianshi Xu,Sicheng Zhao,Guiguang Ding,Pengyang Wang,Feng Tian
Main category: cs.CL
TL;DR: The paper introduces S$^4$C, a framework that improves speculative sampling for large language models (LLMs) by incorporating syntactic and semantic coherence, leading to faster and more efficient token generation with fewer resources.
Details
Motivation: To overcome the inference latency issue in LLMs caused by their autoregressive nature, which hinders real-time applications. Method: S$^4$C extends speculative sampling through multi-head drafting for quick token generation and continuous verification tree for effective candidate validation and feature reuse. Result: S$^4$C surpasses baseline methods in mainstream tasks with enhanced efficiency and parallelism, generating more valid tokens using fewer computational resources. It achieves an acceleration ratio of 2.26x-2.60x on Spec-bench benchmarks. Conclusion: S$^4$C is a significant advancement in speculative sampling for LLMs, offering superior performance and efficiency. Abstract: Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S$^4$C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S$^4$C surpasses baseline methods across mainstream tasks, offering enhanced efficiency, parallelism, and the ability to generate more valid tokens with fewer computational resources. On Spec-bench benchmarks, S$^4$C achieves an acceleration ratio of 2.26x-2.60x, outperforming state-of-the-art methods.[19] MIST: Towards Multi-dimensional Implicit Bias and Stereotype Evaluation of LLMs via Theory of Mind
Yanlin Li,Hao Liu,Huimin Liu,Yinwei Wei,Yupeng Hu
Main category: cs.CL
TL;DR: This paper proposes an evaluation framework for assessing implicit bias in Large Language Models' Theory of Mind using the Stereotype Content Model and two indirect tasks, revealing complex bias structures.
Details
Motivation: Evaluating implicit bias in Large Language Models' capacity for reasoning about mental states is challenging due to social desirability effects and the multi-dimensional nature of bias. Method: The paper introduces a framework leveraging the Stereotype Content Model (SCM) to reconceptualize bias across Competence, Sociability, and Morality. It includes two indirect tasks: the Word Association Bias Test (WABT) and the Affective Attribution Test (AAT). Result: Experiments on 8 State-of-the-Art LLMs show the framework's ability to uncover complex bias structures such as sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification. Conclusion: The proposed framework provides a more robust methodology for identifying the structural nature of implicit bias in Large Language Models. Abstract: Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework's capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.[20] GRAM: A Generative Foundation Reward Model for Reward Generalization
Chenglong Wang,Yang Gan,Yifu Huo,Yongyu Mu,Qiaozhi He,Murun Yang,Bei Li,Tong Xiao,Chunliang Zhang,Tongran Liu,Jingbo Zhu
Main category: cs.CL
TL;DR: The paper explores methods to train reward models using both unlabeled and labeled data, developing a generative reward model that combines unsupervised and supervised learning. It introduces label smoothing as optimizing a regularized pairwise ranking loss, linking generative and discriminative models under the same training objectives. The resulting foundation reward model generalizes well across tasks like response ranking, reinforcement learning, and task adaptation with fine-tuning.
Details
Motivation: Reward models have been crucial in aligning large language models but are typically trained as discriminative models relying only on labeled human preference data. This limits their potential. Method: Develop a generative reward model initially trained via large-scale unsupervised learning and then fine-tuned via supervised learning. Introduce label smoothing which optimizes a regularized pairwise ranking loss. Result: The foundation reward model generalizes well across several tasks including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, showing significant performance improvements over strong baseline models. Conclusion: The proposed generative reward model trained with both unlabeled and labeled data, along with label smoothing technique, creates a versatile foundation reward model that performs well across different tasks. Abstract: In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.[21] Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages
Tuan Nguyen,Huy-Dat Tran
Main category: cs.CL
TL;DR: The study proposes a phrase-level mixing method to generate synthetic code-switching (CS) data for improving ASR performance in multilingual settings, focusing on under-resourced Southeast Asian language pairs and demonstrating cost-effective CS-ASR development.
Details
Motivation: Code-switching poses challenges for ASR systems due to the scarcity and high cost of transcribed data, particularly in linguistically complex scenarios. This motivates the need for developing effective methods to build CS-ASR using synthetic data. Method: A phrase-level mixing method is proposed to create synthetic CS data that imitates natural patterns. Monolingual data augmented with this synthetic data is used to fine-tune large pretrained ASR models such as Whisper, MMS, and SeamlessM4T. Result: The experimental results indicate that the proposed training strategy improves ASR performance on both monolingual and CS tests, with the most significant gains observed in Malay-English (BM-EN), followed by Tamil-English (TA-EN) and Mandarin-Malay (ZH-BM). Conclusion: This research establishes a new comprehensive benchmark for CS-ASR in three Southeast Asian language pairs and provides a cost-effective approach for advancing CS-ASR development, which can benefit both academic research and industry applications. Abstract: Code-switching (CS), common in multilingual settings, presents challenges for ASR due to scarce and costly transcribed data caused by linguistic complexity. This study investigates building CS-ASR using synthetic CS data. We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns. Utilizing monolingual augmented with synthetic phrase-mixed CS data to fine-tune large pretrained ASR models (Whisper, MMS, SeamlessM4T). This paper focuses on three under-resourced Southeast Asian language pairs: Malay-English (BM-EN), Mandarin-Malay (ZH-BM), and Tamil-English (TA-EN), establishing a new comprehensive benchmark for CS-ASR to evaluate the performance of leading ASR models. Experimental results show that the proposed training strategy enhances ASR performance on monolingual and CS tests, with BM-EN showing highest gains, then TA-EN and ZH-BM. This finding offers a cost-effective approach for CS-ASR development, benefiting research and industry.[22] AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR
Tuan Nguyen,Huy-Dat Tran
Main category: cs.CL
TL;DR: The paper presents AsyncSwitch, a framework that uses web data to improve code-switched ASR systems.
Details
Motivation: Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual data. Existing methods for generating synthetic audio are computationally intensive. Method: A three-stage process: (1) Train decoder self-attention and feedforward layers on code-switched text, (2) Align decoder and encoder via cross-attention using limited speech-text data, (3) Fully fine-tune the entire model. Result: Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants. Conclusion: AsyncSwitch effectively leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Abstract: Developing code-switched ASR systems is challenging due to language ambiguity and limited exposure to multilingual, code-switched data, while collecting such speech is costly. Prior work generates synthetic audio from text, but these methods are computationally intensive and hard to scale. We introduce AsyncSwitch, a novel asynchronous adaptation framework that leverages large-scale, text-rich web data to pre-expose ASR models to diverse code-switched domains before fine-tuning on paired speech-text corpora. Our three-stage process (1) trains decoder self-attention and feedforward layers on code-switched text, (2) aligns decoder and encoder via cross-attention using limited speech-text data, and (3) fully fine-tunes the entire model. Experiments with Whisper on Malay-English code-switching demonstrate a 9.02% relative WER reduction, while improving monolingual performance in Singlish, Malay, and other English variants.[23] MAS-LitEval : Multi-Agent System for Literary Translation Quality Assessment
Junghwan Kim,Kieun Park,Sohee Park,Hyunggug Kim,Bongwon Suh
Main category: cs.CL
TL;DR: The paper proposes MAS-LitEval, a multi-agent system using LLMs to evaluate literary translations based on terminology, narrative, and style. It outperforms traditional metrics in capturing literary nuances.
Details
Motivation: Traditional metrics like BLEU and METEOR fail to assess cultural nuances and stylistic elements in literary translation due to their focus on lexical overlap. Method: Propose MAS-LitEval, a multi-agent system using Large Language Models (LLMs) to evaluate translations based on terminology, narrative, and style. Result: MAS-LitEval outperformed traditional metrics, with top models scoring up to 0.890 in capturing literary nuances. Conclusion: This work introduces a scalable, nuanced framework for Translation Quality Assessment (TQA), offering a practical tool for translators and researchers. Abstract: Literary translation requires preserving cultural nuances and stylistic elements, which traditional metrics like BLEU and METEOR fail to assess due to their focus on lexical overlap. This oversight neglects the narrative consistency and stylistic fidelity that are crucial for literary works. To address this, we propose MAS-LitEval, a multi-agent system using Large Language Models (LLMs) to evaluate translations based on terminology, narrative, and style. We tested MAS-LitEval on translations of The Little Prince and A Connecticut Yankee in King Arthur's Court, generated by various LLMs, and compared it to traditional metrics. \textbf{MAS-LitEval} outperformed these metrics, with top models scoring up to 0.890 in capturing literary nuances. This work introduces a scalable, nuanced framework for Translation Quality Assessment (TQA), offering a practical tool for translators and researchers.[24] ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations
Brihi Joshi,Keyu He,Sahana Ramnath,Sadra Sabouri,Kaitlyn Zhou,Souti Chattopadhyay,Swabha Swayamdipta,Xiang Ren
Main category: cs.CL
TL;DR: 尽管语言模型在教育中的应用广泛,但其针对不同知识背景和信息需求的学习者定制回答的能力尚未得到充分探索。为此,本文提出了ELI-Why,一个包含13.4K个“为什么”问题的基准测试,用于评估语言模型的教学能力。通过两项大规模人类研究,评估了语言模型生成的解释性答案对小学、中学和研究生三个不同教育阶段的适用性。研究表明,GPT-4生成的解释仅在50%的情况下与其目标教育背景匹配,而普通人策划的解释则为79%。此外,用户认为GPT-4生成的解释平均比普通人策划的解释低20%适合他们的信息需求。自动评估指标还显示,不同语言模型家族生成的解释在年级水平上难以区分,限制了其教学效果。
Details
Motivation: 当前语言模型在教育领域的应用中,缺乏根据学习者的多样化信息需求和知识背景定制回答的能力,这促使研究者探索如何更好地利用语言模型进行个性化教学。 Method: 引入ELI-Why基准测试,包括13.4K个“为什么”问题,并进行两项大规模人类研究:第一项研究中,人类评分员扮演“教育者”的角色,评估模型解释与不同教育阶段的匹配程度;第二项研究中,人类评分员扮演“学习者”的角色,评估解释是否符合自己的信息需求。同时使用自动化评估指标来分析不同语言模型生成解释的质量。 Result: GPT-4生成的解释在匹配目标教育背景方面表现不佳(50%),远低于普通人策划的解释(79%)。用户也认为GPT-4生成的解释比普通人策划的解释更不适合他们的信息需求(低20%)。自动化评估指标表明,不同语言模型生成的解释难以区分其年级水平,影响了教学效果。 Conclusion: 尽管语言模型在教育领域有广泛应用,但在生成符合特定教育背景和学习者需求的解释方面仍有显著改进空间。未来的研究需要关注提高语言模型在个性化教学中的有效性和适应性。 Abstract: Language models today are widely used in education, yet their ability to tailor responses for learners with varied informational needs and knowledge backgrounds remains under-explored. To this end, we introduce ELI-Why, a benchmark of 13.4K "Why" questions to evaluate the pedagogical capabilities of language models. We then conduct two extensive human studies to assess the utility of language model-generated explanatory answers (explanations) on our benchmark, tailored to three distinct educational grades: elementary, high-school and graduate school. In our first study, human raters assume the role of an "educator" to assess model explanations' fit to different educational grades. We find that GPT-4-generated explanations match their intended educational background only 50% of the time, compared to 79% for lay human-curated explanations. In our second study, human raters assume the role of a learner to assess if an explanation fits their own informational needs. Across all educational backgrounds, users deemed GPT-4-generated explanations 20% less suited on average to their informational needs, when compared to explanations curated by lay people. Additionally, automated evaluation metrics reveal that explanations generated across different language model families for different informational needs remain indistinguishable in their grade-level, limiting their pedagogical effectiveness.[25] Intended Target Identification for Anomia Patients with Gradient-based Selective Augmentation
Jongho Kim,Romain Storaï,Seung-won Hwang
Main category: cs.CL
TL;DR: The study explores using language models to help patients with anomia by addressing term failure and semantic paraphasia through gradient-based selective augmentation.
Details
Motivation: To assist patients experiencing anomia, a condition where they struggle to identify the names of items, by overcoming challenges like unseen relevant terms and perturbed terms due to semantic paraphasia. Method: Propose a method to robustify the model against semantically paraphasic errors and enhance it with unseen terms using gradient-based selective augmentation. The gradient value manages data quality amidst errors, and variance guides inclusion of unseen terms. Result: Showed strong performance compared to baselines when evaluated on the Tip-of-the-Tongue dataset and real patient data from AphasiaBank. Conclusion: Language models can effectively aid anomia patients by handling challenges of term failure and semantic paraphasia. Abstract: In this study, we investigate the potential of language models (LMs) in aiding patients experiencing anomia, a difficulty identifying the names of items. Identifying the intended target item from patient's circumlocution involves the two challenges of term failure and error: (1) The terms relevant to identifying the item remain unseen. (2) What makes the challenge unique is inherent perturbed terms by semantic paraphasia, which are not exactly related to the target item, hindering the identification process. To address each, we propose robustifying the model from semantically paraphasic errors and enhancing the model with unseen terms with gradient-based selective augmentation. Specifically, the gradient value controls augmented data quality amid semantic errors, while the gradient variance guides the inclusion of unseen but relevant terms. Due to limited domain-specific datasets, we evaluate the model on the Tip-of-the-Tongue dataset as an intermediary task and then apply our findings to real patient data from AphasiaBank. Our results demonstrate strong performance against baselines, aiding anomia patients by addressing the outlined challenges.[26] AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie,Dylan Xu,Xuandong Zhao,Dawn Song
Main category: cs.CL
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. Our pipeline begins with an LLM-based task proposer guided by a persona, followed by an execution agent that completes the task and logs the trajectory. This process is repeated iteratively to form a sequence of subtasks, which are then summarized by a separate agent into a composite task of controllable difficulty. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of \$0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are publicly available at https://github.com/sunblaze-ucb/AgentSynth[27] CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation
Jia-Chen Zhang,Zheng Zhou,Yu-Jie Xiong,Chun-Ming Xia,Fei Dai
Main category: cs.CL
TL;DR: The paper introduces CausalDiffTab, a diffusion model-based generative model for mixed tabular data with numerical and categorical features. It also proposes hybrid adaptive causal regularization to improve performance without sacrificing generative capabilities. Experiments on seven datasets show it outperforms baseline methods.
Details
Motivation: Generating high-quality mixed-type tabular data is challenging due to heterogeneous data types, complex inter-variable relationships, and intricate column-wise distributions. Existing solutions do not adequately address these issues. Method: CausalDiffTab uses a diffusion model to generate mixed tabular data and incorporates a hybrid adaptive causal regularization method based on Hierarchical Prior Fusion to adaptively control the weight of causal regularization. Result: Experiments on seven datasets demonstrate that CausalDiffTab surpasses baseline methods in all metrics. Conclusion: CausalDiffTab effectively generates high-quality mixed tabular data and improves performance through hybrid adaptive causal regularization. The code is publicly available. Abstract: Training data has been proven to be one of the most critical components in training generative AI. However, obtaining high-quality data remains challenging, with data privacy issues presenting a significant hurdle. To address the need for high-quality data. Synthesize data has emerged as a mainstream solution, demonstrating impressive performance in areas such as images, audio, and video. Generating mixed-type data, especially high-quality tabular data, still faces significant challenges. These primarily include its inherent heterogeneous data types, complex inter-variable relationships, and intricate column-wise distributions. In this paper, we introduce CausalDiffTab, a diffusion model-based generative model specifically designed to handle mixed tabular data containing both numerical and categorical features, while being more flexible in capturing complex interactions among variables. We further propose a hybrid adaptive causal regularization method based on the principle of Hierarchical Prior Fusion. This approach adaptively controls the weight of causal regularization, enhancing the model's performance without compromising its generative capabilities. Comprehensive experiments conducted on seven datasets demonstrate that CausalDiffTab outperforms baseline methods across all metrics. Our code is publicly available at: https://github.com/Godz-z/CausalDiffTab.[28] Explainable Detection of Implicit Influential Patterns in Conversations via Data Augmentation
Sina Abdidizaji,Md Kowsher,Niloofar Yousefi,Ivan Garibay
Main category: cs.CL
TL;DR: In the digital age, malicious actors use implicit influential verbal patterns in conversations to manipulate and extract information. This paper proposes an enhanced model that leverages advanced language models to improve detection of these patterns, identifying their specific locations within conversations, resulting in a 6% improvement in detection and significant enhancements in related classification tasks.
Details
Motivation: With the rise of digital platforms, there is an increasing need to detect subtle and implicit linguistic strategies used to influence public perception and extract information. Current models are adept at finding explicit patterns but struggle with implicit ones embedded in conversations. Method: The paper introduces an improved approach for detecting implicit influential verbal patterns by augmenting an existing dataset using reasoning capabilities of state-of-the-art language models. The designed framework can also identify the exact locations of these patterns within conversations. Result: The proposed framework led to a 6% increase in detecting implicit influential patterns. It also improved multi-label classification tasks regarding influence techniques by 33% and victim vulnerability by 43%. Conclusion: This study successfully demonstrates the effectiveness of the new model in enhancing the detection and understanding of implicit influential verbal patterns in conversations. Abstract: In the era of digitalization, as individuals increasingly rely on digital platforms for communication and news consumption, various actors employ linguistic strategies to influence public perception. While models have become proficient at detecting explicit patterns, which typically appear in texts as single remarks referred to as utterances, such as social media posts, malicious actors have shifted toward utilizing implicit influential verbal patterns embedded within conversations. These verbal patterns aim to mentally penetrate the victim's mind in order to influence them, enabling the actor to obtain the desired information through implicit means. This paper presents an improved approach for detecting such implicit influential patterns. Furthermore, the proposed model is capable of identifying the specific locations of these influential elements within a conversation. To achieve this, the existing dataset was augmented using the reasoning capabilities of state-of-the-art language models. Our designed framework resulted in a 6% improvement in the detection of implicit influential patterns in conversations. Moreover, this approach improved the multi-label classification tasks related to both the techniques used for influence and the vulnerability of victims by 33% and 43%, respectively.[29] Chaining Event Spans for Temporal Relation Grounding
Jongho Kim,Dohyeon Lee,Minsoo Kim,Seung-won Hwang
Main category: cs.CL
TL;DR: 提出了一种新的方法——Timeline Reasoning Network(TRN),通过预测事件的时间跨度来引导正确的推理行为,解决现有方法中由于答案重叠导致的不可靠结果问题。TRN在TORQUE和TB-dense数据集上的实验结果表明,其在时间和关系抽取任务中优于先前的方法。
Details
Motivation: 准确理解事件之间的时间关系是许多任务的关键组成部分,例如时间阅读理解(TRC)和关系抽取(TRE)。然而,现有的解决方案依赖于答案重叠作为对比相似和不同问题的代理标签,这可能导致不可靠的结果,因为两个不同的问题可能有巧合相同的答案。 Method: 提出了一种新颖的方法——Timeline Reasoning Network(TRN),它通过一个两步归纳推理过程操作:第一步,模型根据语义和句法信息初步回答每个问题;第二步,将多个关于同一事件的问题串联起来预测时间线,并用该时间线来支持答案。 Result: 在TORQUE和TB-dense数据集上进行的实验表明,TRN通过有效地利用预测的时间线解决了虚假重叠问题,从而在TRC和TRE任务上优于先前的方法。 Conclusion: TRN是一种有效的解决时间和关系抽取任务的新方法,能够通过预测时间线克服现有方法中的虚假重叠问题,提升推理性能。 Abstract: Accurately understanding temporal relations between events is a critical building block of diverse tasks, such as temporal reading comprehension (TRC) and relation extraction (TRE). For example in TRC, we need to understand the temporal semantic differences between the following two questions that are lexically near-identical: "What finished right before the decision?" or "What finished right after the decision?". To discern the two questions, existing solutions have relied on answer overlaps as a proxy label to contrast similar and dissimilar questions. However, we claim that answer overlap can lead to unreliable results, due to spurious overlaps of two dissimilar questions with coincidentally identical answers. To address the issue, we propose a novel approach that elicits proper reasoning behaviors through a module for predicting time spans of events. We introduce the Timeline Reasoning Network (TRN) operating in a two-step inductive reasoning process: In the first step model initially answers each question with semantic and syntactic information. The next step chains multiple questions on the same event to predict a timeline, which is then used to ground the answers. Results on the TORQUE and TB-dense, TRC and TRE tasks respectively, demonstrate that TRN outperforms previous methods by effectively resolving the spurious overlaps using the predicted timeline.[30] Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team
Md Tanzib Hosain,Salman Rahman,Md Kishor Morol,Md Rizwan Parvez
Main category: cs.CL
TL;DR: Xolver,一个无需训练的多代理推理框架,为大型语言模型提供持续演进的经验记忆,通过整合多种经验模式提升推理能力,在多个基准上超越专门推理代理。
Details
Motivation: 当前大型语言模型在复杂推理方面取得进展,但通常独立运行,缺乏积累和整合经验知识的能力,而专家问题解决者则利用丰富的经验来提升解决问题的能力。 Method: 引入Xolver框架,为黑盒大型语言模型提供持久、演进的整体经验记忆,整合外部和自我检索、工具使用、协作互动、代理驱动评估和迭代改进等多种经验模式。 Result: Xolver在多个基准测试中持续超越专门推理代理,即使使用轻量级模型也能胜过先进模型,并在GSM8K、AIME'24、AIME'25、Math-500和LiveCodeBench-V5等数据集上取得最佳结果。 Conclusion: 整体经验学习是迈向具备专家级推理能力的通用代理的关键步骤。 Abstract: Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation - treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers - such as Olympiad or programming contest teams - leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. We introduce Xolver, a training-free multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. Xolver integrates diverse experience modalities, including external and self-retrieval, tool use, collaborative interactions, agent-driven evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch - marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, Xolver consistently outperforms specialized reasoning agents. Even with lightweight backbones (e.g., QWQ-32B), it often surpasses advanced models including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With o3-mini-high, it achieves new best results on GSM8K (98.1%), AIME'24 (94.4%), AIME'25 (93.7%), Math-500 (99.8%), and LiveCodeBench-V5 (91.6%) - highlighting holistic experience learning as a key step toward generalist agents capable of expert-level reasoning. Code and data are available at https://kagnlp.github.io/xolver.github.io/.[31] A Multi-Expert Structural-Semantic Hybrid Framework for Unveiling Historical Patterns in Temporal Knowledge Graphs
Yimin Deng,Yuxia Wu,Yejing Wang,Guoshuai Zhao,Li Zhu,Qidong Liu,Derong Xu,Zichuan Fu,Xian Wu,Yefeng Zheng,Xiangyu Zhao,Xueming Qian
Main category: cs.CL
TL;DR: The paper proposes a MESH framework with three expert modules integrating structural and semantic information for temporal knowledge graph reasoning, showing effectiveness through experiments.
Details
Motivation: Existing methods for temporal knowledge graph reasoning focus on either graph structure learning or semantic reasoning but fail to integrate both perspectives. They also lack the ability to distinguish between historical and non-historical events, limiting generalization across temporal contexts. Method: The proposed MESH framework employs three kinds of expert modules to integrate structural and semantic information, providing guidance for the reasoning process tailored to different events. Result: Extensive experiments conducted on three datasets demonstrate the effectiveness of the MESH framework in temporal knowledge graph reasoning. Conclusion: The MESH framework successfully integrates dual reasoning perspectives and captures differences between event types, improving generalization and prediction capabilities. Abstract: Temporal knowledge graph reasoning aims to predict future events with knowledge of existing facts and plays a key role in various downstream tasks. Previous methods focused on either graph structure learning or semantic reasoning, failing to integrate dual reasoning perspectives to handle different prediction scenarios. Moreover, they lack the capability to capture the inherent differences between historical and non-historical events, which limits their generalization across different temporal contexts. To this end, we propose a Multi-Expert Structural-Semantic Hybrid (MESH) framework that employs three kinds of expert modules to integrate both structural and semantic information, guiding the reasoning process for different events. Extensive experiments on three datasets demonstrate the effectiveness of our approach.[32] Re-Initialization Token Learning for Tool-Augmented Large Language Models
Chenghao Li,Liu Liu,Baosheng Yu,Jiayan Qiu,Yibing Zhan
Main category: cs.CL
TL;DR: 通过提出一种新的标记学习方法,对大型语言模型进行改进,使其能够更好地与外部工具集成,从而提高解决复杂任务的能力。
Details
Motivation: 大型语言模型尽管表现出色,但在数值推理和计划生成等复杂任务上仍存在困难。将外部工具(如计算器和数据库)集成到大型语言模型中对于增强解决问题的能力至关重要。然而,当前的方法未能充分考虑工具令牌与单词令牌之间的关系,限制了在预训练的大型语言模型中的适应性。 Method: 我们提出了一种新颖的标记学习方法,该方法从初始化的角度出发,将工具标记与现有的单词嵌入空间对齐,以提高模型性能。首先根据工具名称或描述构建每个工具的先验标记嵌入,用于初始化和规范可学习的工具标记嵌入,确保学习到的嵌入与单词标记空间良好对齐,从而提高工具调用的准确性。 Result: 我们在GSM8K-XL、FuncQA、KAMEL和VirtualHome数据集上评估了该方法在数值推理、基于知识的问题回答和实体计划生成等任务上的表现。结果表明,与CoT、REACT、ICL和ToolkenGPT等最近的基线相比有明显的改进,证明我们的方法有效地通过相关标记增强了大型语言模型的工具使用能力,适用于各种领域。 Conclusion: 提出的标记学习方法成功地解决了现有方法中工具令牌与单词令牌之间关系不足的问题,显著提高了大型语言模型在不同领域中使用工具的能力。 Abstract: Large language models have demonstrated exceptional performance, yet struggle with complex tasks such as numerical reasoning, plan generation. Integrating external tools, such as calculators and databases, into large language models (LLMs) is crucial for enhancing problem-solving capabilities. Current methods assign a unique token to each tool, enabling LLMs to call tools through token prediction-similar to word generation. However, this approach fails to account for the relationship between tool and word tokens, limiting adaptability within pre-trained LLMs. To address this issue, we propose a novel token learning method that aligns tool tokens with the existing word embedding space from the perspective of initialization, thereby enhancing model performance. We begin by constructing prior token embeddings for each tool based on the tool's name or description, which are used to initialize and regularize the learnable tool token embeddings. This ensures the learned embeddings are well-aligned with the word token space, improving tool call accuracy. We evaluate the method on tasks such as numerical reasoning, knowledge-based question answering, and embodied plan generation using GSM8K-XL, FuncQA, KAMEL, and VirtualHome datasets. The results demonstrate clear improvements over recent baselines, including CoT, REACT, ICL, and ToolkenGPT, indicating that our approach effectively augments LLMs with tools through relevant tokens across diverse domains.[33] From What to Respond to When to Respond: Timely Response Generation for Open-domain Dialogue Agents
Seongbo Jang,Minjin Jeon,Jaehoon Lee,Seonghyeon Lee,Dongha Lee,Hwanjo Yu
Main category: cs.CL
TL;DR: The paper introduces timely dialogue response generation, a novel task for language models to predict appropriate time intervals and generate time-conditioned responses. They create the TimelyChat benchmark, a large-scale training dataset, and the Timer dialogue agent which outperforms other models in evaluations.
Details
Motivation: Current research on dialogue response generation mainly focuses on generating coherent responses based on textual context, but lacks exploration on responding grounded on temporal context. Method: Propose the timely dialogue response generation task, introduce the TimelyChat benchmark, construct a large-scale training dataset using unlabeled event knowledge and a LLM, and train the Timer dialogue agent to predict time intervals and generate corresponding responses. Result: Timer surpasses prompting-based LLMs and other fine-tuned baselines in both turn-level and dialogue-level evaluations. Conclusion: The authors have successfully created a new task, benchmark, training dataset, and dialogue agent for timely dialogue response generation, demonstrating its effectiveness through experiments. Abstract: While research on dialogue response generation has primarily focused on generating coherent responses conditioning on textual context, the critical question of when to respond grounded on the temporal context remains underexplored. To bridge this gap, we propose a novel task called timely dialogue response generation and introduce the TimelyChat benchmark, which evaluates the capabilities of language models to predict appropriate time intervals and generate time-conditioned responses. Additionally, we construct a large-scale training dataset by leveraging unlabeled event knowledge from a temporal commonsense knowledge graph and employing a large language model (LLM) to synthesize 55K event-driven dialogues. We then train Timer, a dialogue agent designed to proactively predict time intervals and generate timely responses that align with those intervals. Experimental results show that Timer outperforms prompting-based LLMs and other fine-tuned baselines in both turn-level and dialogue-level evaluations. We publicly release our data, model, and code.[34] Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent
Xueyang Feng,Jingsen Zhang,Jiakai Tang,Wei Li,Guohao Cai,Xu Chen,Quanyu Dai,Yue Zhu,Zhenhua Dong
Main category: cs.CL
TL;DR: The paper presents ECPO, a new multi-turn preference optimization paradigm that uses Expectation Confirmation Theory to improve user satisfaction in Conversational Recommendation Agents (CRAs) powered by Large Language Models (LLMs). It reduces sampling overhead and enhances interaction capabilities.
Details
Motivation: To solve the issue of short-sighted responses from CRAs and the high cost and poor performance of preference optimization in multi-turn dialogues. Method: Introduced ECPO which leverages Expectation Confirmation Theory for turn-level preference optimization and an LLM-based user simulator AILO for simulating user feedback. Result: Experimental results indicate significant enhancement in CRA's interaction capabilities, with improvements in both efficiency and effectiveness compared to existing methods. Conclusion: ECPO successfully optimizes unsatisfactory responses and improves user satisfaction in multi-turn dialogues while reducing sampling overhead. Abstract: Recent advancements in Large Language Models (LLMs) have significantly propelled the development of Conversational Recommendation Agents (CRAs). However, these agents often generate short-sighted responses that fail to sustain user guidance and meet expectations. Although preference optimization has proven effective in aligning LLMs with user expectations, it remains costly and performs poorly in multi-turn dialogue. To address this challenge, we introduce a novel multi-turn preference optimization (MTPO) paradigm ECPO, which leverages Expectation Confirmation Theory to explicitly model the evolution of user satisfaction throughout multi-turn dialogues, uncovering the underlying causes of dissatisfaction. These causes can be utilized to support targeted optimization of unsatisfactory responses, thereby achieving turn-level preference optimization. ECPO ingeniously eliminates the significant sampling overhead of existing MTPO methods while ensuring the optimization process drives meaningful improvements. To support ECPO, we introduce an LLM-based user simulator, AILO, to simulate user feedback and perform expectation confirmation during conversational recommendations. Experimental results show that ECPO significantly enhances CRA's interaction capabilities, delivering notable improvements in both efficiency and effectiveness over existing MTPO methods.[35] Evaluation Should Not Ignore Variation: On the Impact of Reference Set Choice on Summarization Metrics
Silvia Casola,Yang Janet Liu,Siyao Peng,Oliver Kraus,Albert Gatt,Barbara Plank
Main category: cs.CL
TL;DR: The study explores the instability of reference-based metrics in summarization evaluation using three datasets, revealing significant inconsistency in model rankings and weak correlation with human judgments, especially for LLM outputs. It suggests considering reference set variation to improve evaluation consistency.
Details
Motivation: Summarization evaluation often overlooks the variation in language production, focusing on single or limited reference sets which may not adequately capture diverse communication styles and intents. Method: The work investigates the sensitivity of popular reference-based metrics by analyzing their performance across different reference sets within three summarization datasets: SummEval, GUMSum, and DUC2004. Human judgments on LLM outputs are also collected to assess metric correlations. Result: Many widely-used metrics, particularly n-gram based ones like ROUGE, show significant instability with varying model rankings depending on the reference sets. Correlation between these metrics and human judgments is found to be weak-to-none. Conclusion: To enhance reliability and consistency in summarization evaluation, it is recommended to account for reference set variation, especially when evaluating large language models. Abstract: Human language production exhibits remarkable richness and variation, reflecting diverse communication styles and intents. However, this variation is often overlooked in summarization evaluation. While having multiple reference summaries is known to improve correlation with human judgments, the impact of using different reference sets on reference-based metrics has not been systematically investigated. This work examines the sensitivity of widely used reference-based metrics in relation to the choice of reference sets, analyzing three diverse multi-reference summarization datasets: SummEval, GUMSum, and DUC2004. We demonstrate that many popular metrics exhibit significant instability. This instability is particularly concerning for n-gram-based metrics like ROUGE, where model rankings vary depending on the reference sets, undermining the reliability of model comparisons. We also collect human judgments on LLM outputs for genre-diverse data and examine their correlation with metrics to supplement existing findings beyond newswire summaries, finding weak-to-no correlation. Taken together, we recommend incorporating reference set variation into summarization evaluation to enhance consistency alongside correlation with human judgments, especially when evaluating LLMs.[36] A Vision for Geo-Temporal Deep Research Systems: Towards Comprehensive, Transparent, and Reproducible Geo-Temporal Information Synthesis
Bruno Martins,Piotr Szymański,Piotr Gramacki
Main category: cs.CL
TL;DR: The paper outlines a vision for next-generation deep research systems that integrate geo-temporal reasoning, enhancing capabilities to answer context-rich questions.
Details
Motivation: Current deep research systems lack geo-temporal capabilities essential for answering questions with geographic and/or temporal constraints. Method: Augment retrieval and synthesis processes with the ability to handle geo-temporal constraints, supported by open and reproducible infrastructures and rigorous evaluation protocols. Result: Identification of technical, infrastructural, and evaluative challenges in integrating geo-temporal reasoning into deep research pipelines. Conclusion: The outlined vision aims towards more advanced and geo-temporally aware deep research systems, potentially impacting AI-driven information access. Abstract: The emergence of Large Language Models (LLMs) has transformed information access, with current LLMs also powering deep research systems that can generate comprehensive report-style answers, through planned iterative search, retrieval, and reasoning. Still, current deep research systems lack the geo-temporal capabilities that are essential for answering context-rich questions involving geographic and/or temporal constraints, frequently occurring in domains like public health, environmental science, or socio-economic analysis. This paper reports our vision towards next generation systems, identifying important technical, infrastructural, and evaluative challenges in integrating geo-temporal reasoning into deep research pipelines. We argue for augmenting retrieval and synthesis processes with the ability to handle geo-temporal constraints, supported by open and reproducible infrastructures and rigorous evaluation protocols. Our vision outlines a path towards more advanced and geo-temporally aware deep research systems, of potential impact to the future of AI-driven information access.[37] Digital Gatekeepers: Google's Role in Curating Hashtags and Subreddits
Amrit Poudel,Yifan Ding,Jurgen Pfeffer,Tim Weninger
Main category: cs.CL
TL;DR: 搜索引擎,如Google,通过算法调节内容可见性,影响用户接触到的信息。研究发现Google压制与成人内容、阴谋论、广告和加密货币相关的子版块和主题标签,同时推广高互动内容,从而塑造公共话语。
Details
Motivation: 了解搜索引擎如何作为数字把关人,通过算法调节社交媒体内容的可见性,影响用户所遇到的信息。 Method: 比较搜索引擎结果与来自Reddit和Twitter/X的非抽样数据,揭示内容可见性的系统性偏差。 Result: Google算法压制与性相关内容、阴谋论、广告和加密货币相关的子版块和主题标签,同时推广高互动内容。 Conclusion: Google的把关实践通过策划用户可获得的社交媒体叙述来影响公共话语。 Abstract: Search engines play a crucial role as digital gatekeepers, shaping the visibility of Web and social media content through algorithmic curation. This study investigates how search engines like Google selectively promotes or suppresses certain hashtags and subreddits, impacting the information users encounter. By comparing search engine results with nonsampled data from Reddit and Twitter/X, we reveal systematic biases in content visibility. Google's algorithms tend to suppress subreddits and hashtags related to sexually explicit material, conspiracy theories, advertisements, and cryptocurrencies, while promoting content associated with higher engagement. These findings suggest that Google's gatekeeping practices influence public discourse by curating the social media narratives available to users.[38] ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
Lucile Favero,Daniel Frases,Juan Antonio Pérez-Ortiz,Tanja Käser,Nuria Oliver
Main category: cs.CL
TL;DR: The paper explores using LLMs to foster deeper reasoning by generating critical questions in debate interventions, achieving first place in a shared task competition.
Details
Motivation: Concerns about chat interfaces based on LLMs promoting superficial learning and undermining critical thinking skills prompted the exploration of LLMs' potential to generate critical questions that challenge unsupported or vague claims. Method: A two-step framework is proposed involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones. Result: The system ranked first in the shared task competition focused on automatic critical question generation at the 12th Workshop on Argument Mining. Conclusion: The proposed LLM-based approach has the potential to encourage critical engagement with argumentative texts. Abstract: The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions. This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation. We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones. Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts.[39] Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding
Yeonkyoung So,Gyuseong Lee,Sungmok Jung,Joonhak Lee,JiA Kang,Sangho Kim,Jaejin Lee
Main category: cs.CL
TL;DR: 本研究提出了Thunder-NUBench,一个专门评估大型语言模型句子级否定理解的新基准。该基准包含人工策划的句子-否定对和多项选择数据集,能够深入评估模型在不同类型否定结构上的理解能力。
Details
Motivation: 现有的基准测试通常将否定视为自然语言推理等更广泛任务中的次要情况,缺乏专门针对否定理解的基准。 Method: 引入Thunder-NUBench基准测试,超越表面线索检测,对比标准否定与结构多样的替代方案(如局部否定、矛盾和同义改写),并使用人工策划的句子-否定对和多项选择数据集进行深入评估。 Result: 提供了可以深入评估模型在不同类型的否定结构上理解能力的工具和数据集。 Conclusion: Thunder-NUBench为评估大型语言模型的句子级否定理解提供了一个新的、全面的基准。 Abstract: Negation is a fundamental linguistic phenomenon that poses persistent challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Existing benchmarks often treat negation as a side case within broader tasks like natural language inference, resulting in a lack of benchmarks that exclusively target negation understanding. In this work, we introduce \textbf{Thunder-NUBench}, a novel benchmark explicitly designed to assess sentence-level negation understanding in LLMs. Thunder-NUBench goes beyond surface-level cue detection by contrasting standard negation with structurally diverse alternatives such as local negation, contradiction, and paraphrase. The benchmark consists of manually curated sentence-negation pairs and a multiple-choice dataset that enables in-depth evaluation of models' negation understanding.[40] ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge
Zeinab Sadat Taghavi,Ali Modarressi,Yunpu Ma,Hinrich Schütze
Main category: cs.CL
TL;DR: 研究人员提出了一个新的基准测试ImpliRet,它将推理挑战转移到文档处理侧。尽管评估了多种稀疏和密集检索器以及长上下文模型,但所有模型在处理隐式文档信息时表现仍然不佳,表明文档侧推理仍是一个挑战。
Details
Motivation: 现有的检索系统通常依赖于表面级别的线索,例如关键词重叠和词汇语义相似性。为了超越这些浅层信号进行检索,研究者们引入了需要大量推理的查询。然而,这些方法主要依赖于查询侧技术来解决复杂性。 Method: 提出名为ImpliRet的新基准测试,该测试将推理挑战转移到文档处理侧。在该基准中,查询简单,但相关性取决于文档中隐式陈述的事实,包括时间、算术和世界知识关系。使用多种稀疏和密集检索器进行评估,并测试长上下文模型是否能克服这一限制。 Result: 所有检索器在此设置下表现不佳,最佳nDCG@10仅为15.07%。即使包含正确文档在内仅十篇文档的短上下文中,GPT-4.1得分也只有35.06%,表明文档侧推理仍然是一个挑战。 Conclusion: 文档侧推理对于检索系统来说仍然是一个重大挑战,现有技术和模型在这方面仍有待改进。 Abstract: Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques -- like prompting or multi-hop retrieval -- that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving "two days ago"), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 15.07%. We also test whether long-context models can overcome this limitation. But even with a short context of only ten documents, including the positive document, GPT-4.1 scores only 35.06%, showing that document-side reasoning remains a challenge. Our codes are available at github.com/ZeinabTaghavi/IMPLIRET.Contribution.[41] LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
Xiaoran Liu,Zhigeng Liu,Zengfeng Huang,Qipeng Guo,Ziwei He,Xipeng Qiu
Main category: cs.CL
TL;DR: 大型语言扩散模型(diffusion LLMs)在NLP研究中备受关注,但其长上下文能力尚未被探索。本文首次系统地比较了diffusion LLMs和传统自回归LLMs的长上下文性能,发现diffusion LLMs具有稳定的困惑度和局部感知现象,并提出了无需训练的方法LongLLaDA以扩展上下文窗口。
Details
Motivation: 尽管diffusion LLMs受到广泛关注,但其长上下文能力尚未被系统分析或提出相应扩展方法。 Method: 识别diffusion LLMs的独特特性,如稳定困惑度和局部感知现象,并基于此提出无需训练的方法LongLLaDA,该方法结合了LLaDA与基于NTK的RoPE外推技术。 Result: 验证了已有的外推缩放定律对diffusion LLMs的上下文窗口扩展仍然有效,并确定了diffusion LLMs优于或不如自回归LLMs的长上下文任务。 Conclusion: 本研究建立了首个diffusion LLMs的上下文外推方法,为未来研究提供了重要的理论见解和经验基准。 Abstract: Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably \textbf{\textit{stable perplexity}} during direct context extrapolation. Furthermore, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct \textbf{\textit{local perception}} phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first context extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs.[42] How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison
Jiayin Wang,Zhiquang Guo,Weizhi Ma,Min Zhang
Main category: cs.CL
TL;DR: 大型语言模型的评估可能影响我们走向通用人工智能的进程,因此需要全面且前瞻性的评估。现有的基准主要评估静态知识,而智能还包括从经验中快速学习的能力。本文提出语义游戏作为有效的测试平台来评估测试时学习,并引入客观评估框架。结果显示,尽管LLM具备测试时学习能力,但其改进不如人类稳定且进展更慢。这表明LLM有潜力成为通用学习机器,但也揭示了与人类之间显著的智力差距。
Details
Motivation: 现有对大型语言模型的评估方法主要关注静态知识,忽略了智能所必须的从经验中快速学习的能力。为了更好地推动向通用人工智能发展,有必要对模型在实际推理任务中的学习能力进行评估。 Method: 提出语义游戏作为测试平台,用于评估模型在测试时的学习能力;构建包含四种经验表示形式的客观评估框架,比较模型在有限经验和累积经验设置下的表现,并招募八名人类参与者完成相同任务以提供比较基线。 Result: 实验结果表明,大型语言模型展现出可测量的测试时学习能力,但在累积经验下的改进不如人类稳定,进步速度也较慢。 Conclusion: 大型语言模型具有成为通用学习机器的潜力,但与人类相比仍存在显著的智力差距,无论其在静态基准上的表现如何出色。 Abstract: As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks.[43] LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM's Textual Training Data
Eyal German,Sagiv Antebi,Edan Habler,Asaf Shabtai,Yuval Elovici
Main category: cs.CL
TL;DR: 提出了一种名为LexiMark的新颖文本水印技术,通过同义词替换高熵词来增强LLM对水印文本的记忆能力,同时保持语义完整性。该方法在多个训练场景下进行了评估,结果表明其比现有方法更有效。
Details
Motivation: 验证特定LLM是否使用了未经授权的数据进行训练或微调非常具有挑战性,现有的数据集水印方法容易被检测和移除。 Method: LexiMark是一种针对文本和文档设计的水印技术,通过对精心挑选的高熵词进行同义词替换,增强LLM对水印文本的记忆能力,同时不改变文本的语义完整性。 Result: 在多种训练设置下对LexiMark进行了评估,结果显示AUROC分数显著提高,证明了该方法的有效性。 Conclusion: LexiMark提供了一种可靠的方法来验证未经授权的水印数据是否被用于LLM训练,且难以检测和移除。 Abstract: Large language models (LLMs) can be trained or fine-tuned on data obtained without the owner's consent. Verifying whether a specific LLM was trained on particular data instances or an entire dataset is extremely challenging. Dataset watermarking addresses this by embedding identifiable modifications in training data to detect unauthorized use. However, existing methods often lack stealth, making them relatively easy to detect and remove. In light of these limitations, we propose LexiMark, a novel watermarking technique designed for text and documents, which embeds synonym substitutions for carefully selected high-entropy words. Our method aims to enhance an LLM's memorization capabilities on the watermarked text without altering the semantic integrity of the text. As a result, the watermark is difficult to detect, blending seamlessly into the text with no visible markers, and is resistant to removal due to its subtle, contextually appropriate substitutions that evade automated and manual detection. We evaluated our method using baseline datasets from recent studies and seven open-source models: LLaMA-1 7B, LLaMA-3 8B, Mistral 7B, Pythia 6.9B, as well as three smaller variants from the Pythia family (160M, 410M, and 1B). Our evaluation spans multiple training settings, including continued pretraining and fine-tuning scenarios. The results demonstrate significant improvements in AUROC scores compared to existing methods, underscoring our method's effectiveness in reliably verifying whether unauthorized watermarked data was used in LLM training.[44] LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
Jiyuan Fu,Kaixun Jiang,Lingyi Hong,Jinglun Li,Haijing Guo,Dingkang Yang,Zhaoyu Chen,Wenqiang Zhang
Main category: cs.CL
TL;DR: Multimodal Large Language Models (MLLMs) are vulnerable to attacks that exploit computational resources during inference. This paper proposes LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences by postponing EOS token generation and constraining output diversity. Experiments show LingoLoop can significantly increase generated tokens and energy consumption on models like Qwen2.5-VL-3B.
Details
Motivation: Attackers can exploit the substantial computational resources required by MLLMs during inference by inducing excessive output, leading to resource exhaustion and service degradation. Prior attacks have limitations due to neglecting the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts. Method: The proposed attack, LingoLoop, consists of two mechanisms: a POS-Aware Delay Mechanism that postpones EOS token generation by adjusting attention weights guided by POS information, and a Generative Path Pruning Mechanism that limits the magnitude of hidden states to encourage the model to produce persistent loops. Result: Extensive experiments demonstrate that LingoLoop can increase generated tokens by up to 30 times and energy consumption by a comparable factor on models like Qwen2.5-VL-3B, consistently driving MLLMs towards their maximum generation limits. Conclusion: These findings expose significant vulnerabilities in MLLMs, posing challenges for their reliable deployment. Abstract: Multimodal Large Language Models (MLLMs) have shown great promise but require substantial computational resources during inference. Attackers can exploit this by inducing excessive output, leading to resource exhaustion and service degradation. Prior energy-latency attacks aim to increase generation time by broadly shifting the output token distribution away from the EOS token, but they neglect the influence of token-level Part-of-Speech (POS) characteristics on EOS and sentence-level structural patterns on output counts, limiting their efficacy. To address this, we propose LingoLoop, an attack designed to induce MLLMs to generate excessively verbose and repetitive sequences. First, we find that the POS tag of a token strongly affects the likelihood of generating an EOS token. Based on this insight, we propose a POS-Aware Delay Mechanism to postpone EOS token generation by adjusting attention weights guided by POS information. Second, we identify that constraining output diversity to induce repetitive loops is effective for sustained generation. We introduce a Generative Path Pruning Mechanism that limits the magnitude of hidden states, encouraging the model to produce persistent loops. Extensive experiments demonstrate LingoLoop can increase generated tokens by up to 30 times and energy consumption by a comparable factor on models like Qwen2.5-VL-3B, consistently driving MLLMs towards their maximum generation limits. These findings expose significant MLLMs' vulnerabilities, posing challenges for their reliable deployment. The code will be released publicly following the paper's acceptance.[45] M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models
Can Zheng,Jiguang He,Chung G. Kang,Guofa Cai,Zitong Yu,Merouane Debbah
Main category: cs.CL
TL;DR: This paper introduces M2BeamLLM, a novel neural network framework for beam prediction in mmWave mMIMO systems. It integrates multi-modal sensor data and leverages LLMs like GPT-2 to achieve high accuracy and robustness.
Details
Motivation: To improve beam prediction accuracy and robustness in mmWave mMIMO communication systems by integrating multi-modal sensor data and leveraging the reasoning capabilities of large language models. Method: M2BeamLLM framework combines sensing data encoding, multimodal alignment and fusion, and supervised fine-tuning (SFT) to process images, radar, LiDAR, and GPS data using LLMs such as GPT-2. Result: M2BeamLLM outperforms traditional DL models in both standard and few-shot scenarios with increased prediction performance as sensing modalities become more diverse. Conclusion: M2BeamLLM provides an efficient and intelligent beam prediction solution for V2I mmWave communication systems. Abstract: This paper introduces a novel neural network framework called M2BeamLLM for beam prediction in millimeter-wave (mmWave) massive multi-input multi-output (mMIMO) communication systems. M2BeamLLM integrates multi-modal sensor data, including images, radar, LiDAR, and GPS, leveraging the powerful reasoning capabilities of large language models (LLMs) such as GPT-2 for beam prediction. By combining sensing data encoding, multimodal alignment and fusion, and supervised fine-tuning (SFT), M2BeamLLM achieves significantly higher beam prediction accuracy and robustness, demonstrably outperforming traditional deep learning (DL) models in both standard and few-shot scenarios. Furthermore, its prediction performance consistently improves with increased diversity in sensing modalities. Our study provides an efficient and intelligent beam prediction solution for vehicle-to-infrastructure (V2I) mmWave communication systems.[46] AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs
Di He,Ajay Jaiswal,Songjun Tu,Li Shen,Ganzhao Yuan,Shiwei Liu,Lu Yin
Main category: cs.CL
TL;DR: AlphaDecay is a method that adaptively assigns different weight decay strengths to each module of an LLM based on Heavy-Tailed Self-Regularization (HT-SR) theory, achieving better perplexity and generalization than conventional methods.
Details
Motivation: Standard weight decay techniques for training LLMs typically assign a uniform decay rate to every layer, ignoring the structural diversity of LLMs and varying spectral properties across modules. Method: The AlphaDecay method adaptively assigns different weight decay strengths to each module of an LLM. It uses Heavy-Tailed Self-Regularization (HT-SR) theory to analyze the empirical spectral density (ESD) of weight correlation matrices and quantify 'heavy-tailedness'. Modules with more pronounced heavy-tailed ESDs are assigned weaker decay, while those with lighter-tailed spectra receive stronger decay. Result: Extensive pre-training tasks with various model sizes from 60M to 1B parameters demonstrate that AlphaDecay achieves better perplexity and generalization compared to conventional uniform decay and other adaptive decay baselines. Conclusion: AlphaDecay improves performance by balancing module-wise differences in spectral properties through tailored weight decay assignments. Abstract: Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines.[47] GenerationPrograms: Fine-grained Attribution with Executable Programs
David Wan,Eran Hirsch,Elias Stengel-Eskin,Ido Dagan,Mohit Bansal
Main category: cs.CL
TL;DR: Recent large language models (LLMs) excel in text generation but struggle with providing fine-grained attributions. To address this, the paper introduces GenerationPrograms, a modular framework that separates output generation from attribution. This two-stage process improves attribution quality and interpretability across various tasks.
Details
Motivation: Large language models perform well in generating texts conditioned on sources but often fail to provide verifiable and trustworthy fine-grained attributions. Current methods lack the ability to explain how models use source documents, limiting their interpretability. Method: The method involves a modular generation framework called GenerationPrograms. It splits the generation process into two stages: creating an executable program plan with modular text operations tailored to the query, and executing these operations to produce the final response. Result: Empirical evaluations show significant improvements in attribution quality at both document and sentence levels in long-form question-answering and multi-document summarization tasks. The framework also functions effectively as a post-hoc attribution method, outperforming traditional techniques. Conclusion: GenerationPrograms enhances the attribution quality and interpretability of language models by decomposing the generation process into distinct stages, allowing for localized refinements through modular improvements. Abstract: Recent large language models (LLMs) achieve impressive performance in source-conditioned text generation but often fail to correctly provide fine-grained attributions for their outputs, undermining verifiability and trust. Moreover, existing attribution methods do not explain how and why models leverage the provided source documents to generate their final responses, limiting interpretability. To overcome these challenges, we introduce a modular generation framework, GenerationPrograms, inspired by recent advancements in executable "code agent" architectures. Unlike conventional generation methods that simultaneously generate outputs and attributions or rely on post-hoc attribution, GenerationPrograms decomposes the process into two distinct stages: first, creating an executable program plan composed of modular text operations (such as paraphrasing, compression, and fusion) explicitly tailored to the query, and second, executing these operations following the program's specified instructions to produce the final response. Empirical evaluations demonstrate that GenerationPrograms significantly improves attribution quality at both the document level and sentence level across two long-form question-answering tasks and a multi-document summarization task. We further demonstrate that GenerationPrograms can effectively function as a post-hoc attribution method, outperforming traditional techniques in recovering accurate attributions. In addition, the interpretable programs generated by GenerationPrograms enable localized refinement through modular-level improvements that further enhance overall attribution quality.[48] Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees
Ahmed Heakl,Sarim Hashmi,Chaimaa Abi,Celine Lee,Abdulrahman Mahmoud
Main category: cs.CL
TL;DR: The paper introduces GG (Guaranteed Guess), an ISA-centric transpilation pipeline that uses pre-trained large language models (LLMs) and software testing constructs to translate programs across different instruction set architectures (ISAs). It achieves high functional/semantic correctness on HumanEval and BringupBench programs, better runtime performance, energy efficiency, and memory usage compared to Rosetta 2 framework.
Details
Motivation: To enhance the portability and longevity of existing code by translating low-level programs across different ISAs in a quick, flexible, and correct way. Method: GG generates candidate translations using an LLM from one ISA to another and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. Result: Achieved functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs. Showcased 1.73x faster runtime performance, 1.47x better energy efficiency, and 2.41x better memory usage compared to Rosetta 2 framework. Conclusion: GG is effective for real-world CISC-to-RISC translation tasks and the authors will open-source their codes, data, models, and benchmarks. Abstract: The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms. In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73x faster runtime performance, 1.47x better energy efficiency, and 2.41x better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks. We will open-source our codes, data, models, and benchmarks to establish a common foundation for ISA-level code translation research.[49] When Does Meaning Backfire? Investigating the Role of AMRs in NLI
Junghyun Min,Xiulin Yang,Shira Wein
Main category: cs.CL
TL;DR: The paper explores the impact of adding AMR to pretrained language models in NLI, finding that while AMR in prompting settings slightly improves GPT-4o, it can mislead models when core meaning is preserved.
Details
Motivation: To investigate if incorporating AMR semantic information helps pretrained language models better generalize in NLI tasks. Method: Conduct experiments integrating AMR into NLI through both fine-tuning and prompting settings, followed by an ablation study. Result: Fine-tuning with AMR hinders model generalization. Prompting with AMR leads to slight gains in GPT-4o but improvements stem from amplifying surface-level differences rather than aiding semantic reasoning. This can cause models to predict non-entailment even when core meaning is preserved. Conclusion: Adding AMR semantic information does not aid semantic reasoning in NLI as expected and can potentially mislead models. Abstract: Natural Language Inference (NLI) relies heavily on adequately parsing the semantic content of the premise and hypothesis. In this work, we investigate whether adding semantic information in the form of an Abstract Meaning Representation (AMR) helps pretrained language models better generalize in NLI. Our experiments integrating AMR into NLI in both fine-tuning and prompting settings show that the presence of AMR in fine-tuning hinders model generalization while prompting with AMR leads to slight gains in \texttt{GPT-4o}. However, an ablation study reveals that the improvement comes from amplifying surface-level differences rather than aiding semantic reasoning. This amplification can mislead models to predict non-entailment even when the core meaning is preserved.[50] Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models
Chenchen Yuan,Zheyu Zhang,Shuo Yang,Bardh Prenkaj,Gjergji Kasneci
Main category: cs.CL
TL;DR: Large Language Models (LLMs) have strong moral reasoning but struggle with complex dilemmas. This paper proposes a framework to synthesize multiple LLMs' moral judgments into a collective consensus, using an aggregation mechanism and embedding optimization for misaligned models. Experiments show this improves consistency and fidelity.
Details
Motivation: To address discrepancies in LLMs' moral reasoning when faced with complex, multi-factor moral dilemmas. Method: Propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment. Use an aggregation mechanism to fuse continuous moral acceptability scores into a collective probability, weighting by model reliability. For misaligned models, apply a targeted embedding-optimization procedure to fine-tune token embeddings for moral philosophical theories. Result: Experiments on a large-scale social moral dilemma dataset demonstrate that the approach builds robust consensus and improves individual model fidelity. Conclusion: The findings highlight the value of data-driven moral alignment across multiple models, pointing towards safer and more consistent AI systems. Abstract: Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs' moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.[51] AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation
Leah von der Heyde,Anna-Carolina Haensch,Bernd Weiß,Jessika Daikeler
Main category: cs.CL
TL;DR: 该研究探讨了大型语言模型(LLMs)在编码德语开放式调查回答中的应用,发现不同LLMs的表现差异很大,只有经过微调的LLM能达到令人满意的预测性能。此外,不同提示方法的表现取决于所使用的LLM,且在不同类别上的分类表现不均衡。
Details
Motivation: 尽管现有研究表明LLMs可能成为手动编码和监督机器学习预训练的有效替代方案,但大多数研究仅聚焦于英语环境或单一LLM的应用,其结果是否具有普适性尚不清楚。因此,需要探索LLMs在其他语言和复杂主题中的表现。 Method: 本研究使用德国关于调查参与原因的数据,比较了几种最先进的LLMs和多种提示方法,并通过人类专家编码来评估LLMs的性能。 Result: 不同LLMs的整体性能存在显著差异,只有经过微调的LLM能够达到令人满意的预测性能。提示方法的表现依赖于所使用的LLM。此外,在没有微调的情况下,LLMs在不同类别的分类表现不均衡,导致类别分布不同。 Conclusion: 研究强调了在选择自动化方法进行开放式回答分类时研究人员需要考虑的权衡,并为如何在调查研究中高效、准确和可靠地利用LLMs提供了见解。 Abstract: The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.[52] Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot
Xiang Cheng,Chengyan Pan,Minjun Zhao,Deyang Li,Fangchao Liu,Xinyu Zhang,Xiao Zhang,Yong Liu
Main category: cs.CL
TL;DR: 尽管链式思维示例在早期模型中增强了推理能力,但对如Qwen2.5系列等较新且强大的模型而言,传统及增强型链式思维示例均未能提升其数学推理表现,反而主要作用于输出格式的调整。研究表明,这些模型更倾向于关注指令而非示例,因此未观察到推理能力的提升。这揭示了当前ICL+CoT框架在数学推理中的局限性,呼吁重新审视ICL范式及示例定义。
Details
Motivation: 随着大型语言模型(LLMs)能力的持续进步,研究者希望了解传统的链式思维(CoT)示例是否仍然能够为更强的模型提供推理能力的提升,特别是在数学任务中。 Method: 通过系统实验,评估了传统CoT示例和增强型CoT示例(使用如Qwen2.5-Max和DeepSeek-R1等高级模型生成的答案构建)对近期强大模型(如Qwen2.5系列)的影响,并分析模型对示例和指令的关注程度。 Result: 实验结果表明,对于近期的强模型,添加传统CoT示例相较于零样本CoT并未提升推理性能,而增强型CoT示例同样未能改善模型的推理表现。进一步分析发现,模型倾向于忽略示例而主要关注指令。 Conclusion: 当前ICL+CoT框架在数学推理方面存在局限性,需要重新审视ICL范式以及示例的定义。 Abstract: In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model's reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.[53] Passing the Turing Test in Political Discourse: Fine-Tuning LLMs to Mimic Polarized Social Media Comments
. Pazzaglia,V. Vendetti,L. D. Comencini,F. Deriu,V. Modugno
Main category: cs.CL
TL;DR: The study investigates how fine-tuned LLMs can replicate and amplify polarizing discourse online using Reddit data, finding that they produce plausible and provocative comments raising ethical concerns.
Details
Motivation: To understand the potential of fine-tuned LLMs in exacerbating ideological polarization through generating biased content. Method: Fine-tune an open-source LLM on a dataset of politically charged Reddit discussions and evaluate outputs via linguistic analysis, sentiment scoring, and human annotation focusing on credibility and rhetorical alignment. Result: LLMs trained on partisan data can generate highly plausible and provocative comments similar to those written by humans. Conclusion: Raises ethical questions about AI use in political discourse and emphasizes the need for AI governance, platform regulation, and detection tools. Abstract: The increasing sophistication of large language models (LLMs) has sparked growing concerns regarding their potential role in exacerbating ideological polarization through the automated generation of persuasive and biased content. This study explores the extent to which fine-tuned LLMs can replicate and amplify polarizing discourse within online environments. Using a curated dataset of politically charged discussions extracted from Reddit, we fine-tune an open-source LLM to produce context-aware and ideologically aligned responses. The model's outputs are evaluated through linguistic analysis, sentiment scoring, and human annotation, with particular attention to credibility and rhetorical alignment with the original discourse. The results indicate that, when trained on partisan data, LLMs are capable of producing highly plausible and provocative comments, often indistinguishable from those written by humans. These findings raise significant ethical questions about the use of AI in political discourse, disinformation, and manipulation campaigns. The paper concludes with a discussion of the broader implications for AI governance, platform regulation, and the development of detection tools to mitigate adversarial fine-tuning risks.[54] GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors
Hengyuan Zhang,Xinrong Chen,Yingmin Qiu,Xiao Liang,Ziyue Li,Guanyu Wang,Weiping Li,Tong Mo,Wenyue Li,Hayden Kwok-Hay So,Ngai Wong
Main category: cs.CL
TL;DR: GuiLoMo,一种带有GuidedSelection Vectors(GSVs)的细粒度分层专家数量和秩分配策略,通过先验双层优化过程学习GSVs以捕捉模型和任务特定需求,从而为不同层分配最佳专家数量和秩。实验表明,GuiLoMo在三个主干模型上表现优于或等同于所有基线,并提供了关于专家数量和秩如何随层和任务变化的关键见解。
Details
Motivation: 现有的PEFT方法如LoRA受限于少量可训练参数,而结合了MoE的LoRA-MoE虽提升了容量,但仍存在两方面限制:1)下游任务对专家数量分配的影响;2)所有LoRA专家统一秩分配限制了表示多样性。因此需要一种新策略来解决这些问题,充分利用LoRA-MoE潜力。 Method: 提出GuiLoMo策略,包含GuidedSelection Vectors(GSVs),通过先验双层优化过程学习GSVs以捕捉模型和任务特定需求,然后利用GSVs为不同层分配最佳专家数量和秩。 Result: 在三个主干模型和多个基准测试上的实验表明,GuiLoMo持续展现出优于或等同于所有基线的表现。进一步分析揭示了专家数量和秩在不同层和任务中的变化情况,强调了自适应专家配置的好处。 Conclusion: GuiLoMo通过细粒度分层专家数量和秩分配策略,成功解决了现有LoRA-MoE方法的限制,提升了性能并提供了关于专家配置的重要见解。 Abstract: Parameter-efficient fine-tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), offer an efficient way to adapt large language models with reduced computational costs. However, their performance is limited by the small number of trainable parameters. Recent work combines LoRA with the Mixture-of-Experts (MoE), i.e., LoRA-MoE, to enhance capacity, but two limitations remain in hindering the full exploitation of its potential: 1) the influence of downstream tasks when assigning expert numbers, and 2) the uniform rank assignment across all LoRA experts, which restricts representational diversity. To mitigate these gaps, we propose GuiLoMo, a fine-grained layer-wise expert numbers and ranks allocation strategy with GuidedSelection Vectors (GSVs). GSVs are learned via a prior bilevel optimization process to capture both model- and task-specific needs, and are then used to allocate optimal expert numbers and ranks. Experiments on three backbone models across diverse benchmarks show that GuiLoMo consistently achieves superior or comparable performance to all baselines. Further analysis offers key insights into how expert numbers and ranks vary across layers and tasks, highlighting the benefits of adaptive expert configuration. Our code is available at https://github.com/Liar406/Gui-LoMo.git.[55] Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality
Yuto Harada,Yusuke Yamauchi,Yusuke Oda,Yohei Oseki,Yusuke Miyao,Yu Takagi
Main category: cs.CL
TL;DR: 研究人员通过在包含代码生成、数学推理和通用领域任务的多种数据集上训练大量基础模型,生成了1000多个监督微调(SFT)模型。他们发现了一些跨模型的训练任务协同作用,并强调了模型特定策略的重要性。此外,他们证明了困惑度能够持续预测SFT的有效性,并且中层权重的变化与性能提升的相关性最强。
Details
Motivation: 尽管监督微调(SFT)对于将大语言模型(LLMs)与人类指令和价值观对齐至关重要,但许多方面的SFT仍未被充分理解。因此,研究者希望通过大规模实验揭示SFT的关键属性和影响因素。 Method: 研究者在各种数据集(包括代码生成、数学推理和通用领域任务)上训练了广泛的基础模型,生成了1000多个SFT模型。然后,他们识别出最重要的数据集属性,并检查了SFT引入的逐层修改。 Result: 一些训练任务的协同作用在所有模型中都存在,而另一些则差异很大。困惑度可以持续预测SFT的有效性,中层权重变化与性能提升的相关性最强。 Conclusion: SFT的效果因模型而异,需要采用模型特定的策略。同时,研究者将发布1000多个SFT模型和基准测试结果以促进进一步的研究。 Abstract: Supervised fine-tuning (SFT) is a critical step in aligning large language models (LLMs) with human instructions and values, yet many aspects of SFT remain poorly understood. We trained a wide range of base models on a variety of datasets including code generation, mathematical reasoning, and general-domain tasks, resulting in 1,000+ SFT models under controlled conditions. We then identified the dataset properties that matter most and examined the layer-wise modifications introduced by SFT. Our findings reveal that some training-task synergies persist across all models while others vary substantially, emphasizing the importance of model-specific strategies. Moreover, we demonstrate that perplexity consistently predicts SFT effectiveness--often surpassing superficial similarity between trained data and benchmark--and that mid-layer weight changes correlate most strongly with performance gains. We will release these 1,000+ SFT models and benchmark results to accelerate further research.[56] Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers
Daniel D'souza,Julia Kreutzer,Adrien Morisot,Ahmet Üstün,Sara Hooker
Main category: cs.CL
TL;DR: 通过优化训练协议,提高模型在推理时对稀有案例的可控性和性能。创建数据特征和任务来源的详细分类法,微调基础模型以自动推断标记,从而在推理时显式控制生成属性并隐式调节生成内容。该方法在长尾分布样本上显著提升了性能,尤其是在未充分代表的任务中。
Details
Motivation: 现代机器学习面临的一个重要挑战是如何在稀有和未充分代表的特征的长尾上表现良好。大型通用模型虽然适用于许多任务,但主要在高频使用场景下效果最佳。通过提示工程或少量示例来最大化特定测试用例的输出质量可能令人沮丧,因为模型对小变化高度敏感、反应不可预测或依赖固定的系统提示来维持性能。 Method: 重新审视训练和推理技术之间的区别,以提高长尾性能,并为用户提供一组模型受训后能够响应的控制杠杆。创建数据特征和任务来源的详细分类法,以明确控制生成属性并在推理时隐式调节生成内容。微调基础模型以自动推断这些标记,从而使它们在推理时可选。 Result: 这种原则性和灵活的方法带来了显著的性能提升,特别是在训练分布的长尾样本上。观察到开放生成质量的平均胜率提高了5.7%,而在未充分代表的领域中则获得了超过9.1%的增益。在未充分代表的任务(如CodeRepair)中观察到高达14.1%的相对提升,以及长度指令跟随评估中的绝对改进35.3%。 Conclusion: 通过优化训练协议,可以同时提高模型在推理时的可控性和在未充分代表使用案例上的性能。提供了一组用户可以使用的控制杠杆,使模型在推理时能更有效地应对稀有任务。 Abstract: One of the most profound challenges of modern machine learning is performing well on the long-tail of rare and underrepresented features. Large general-purpose models are trained for many tasks, but work best on high-frequency use cases. After training, it is hard to adapt a model to perform well on specific use cases underrepresented in the training corpus. Relying on prompt engineering or few-shot examples to maximize the output quality on a particular test case can be frustrating, as models can be highly sensitive to small changes, react in unpredicted ways or rely on a fixed system prompt for maintaining performance. In this work, we ask: "Can we optimize our training protocols to both improve controllability and performance on underrepresented use cases at inference time?" We revisit the divide between training and inference techniques to improve long-tail performance while providing users with a set of control levers the model is trained to be responsive to. We create a detailed taxonomy of data characteristics and task provenance to explicitly control generation attributes and implicitly condition generations at inference time. We fine-tune a base model to infer these markers automatically, which makes them optional at inference time. This principled and flexible approach yields pronounced improvements in performance, especially on examples from the long tail of the training distribution. While we observe an average lift of 5.7% win rates in open-ended generation quality with our markers, we see over 9.1% gains in underrepresented domains. We also observe relative lifts of up to 14.1% on underrepresented tasks like CodeRepair and absolute improvements of 35.3% on length instruction following evaluations.[57] Capacity Matters: a Proof-of-Concept for Transformer Memorization on Real-World Data
Anton Changalidis,Aki Härmä
Main category: cs.CL
TL;DR: This paper explores the influence of model architecture and data configurations on the empirical memorization capacity of generative transformers using synthetic text datasets derived from SNOMED. It finds that embedding size is crucial for learning speed and capacity, additional layers have limited benefits, Softmax activation function has greater stability and capacity, and increasing data complexity improves memorization.
Details
Motivation: To understand how different aspects of model architecture (like embedding size, number of layers, and activation functions) and data configurations impact the empirical memorization capacity of generative transformers. Method: Training generative transformer models using synthetic text datasets derived from the Systematized Nomenclature of Medicine (SNOMED) knowledge graph which includes triplets and sequences. Analyzing the effect of various model parameters (embedding size, layers, activation functions) and dataset complexities on the memorization capacity. Result: Embedding size was found to be the primary determinant of learning speed and capacity. Additional layers provided limited benefits and may even hinder performance on simpler datasets. The Softmax activation function demonstrated greater stability and capacity compared to others. Increasing the complexity of the dataset improved the final memorization. Conclusion: The study enhances our understanding of transformer memory mechanisms and provides a framework for optimizing model design when working with structured real-world data. Abstract: This paper studies how the model architecture and data configurations influence the empirical memorization capacity of generative transformers. The models are trained using synthetic text datasets derived from the Systematized Nomenclature of Medicine (SNOMED) knowledge graph: triplets, representing static connections, and sequences, simulating complex relation patterns. The results show that embedding size is the primary determinant of learning speed and capacity, while additional layers provide limited benefits and may hinder performance on simpler datasets. Activation functions play a crucial role, and Softmax demonstrates greater stability and capacity. Furthermore, increasing the complexity of the data set seems to improve the final memorization. These insights improve our understanding of transformer memory mechanisms and provide a framework for optimizing model design with structured real-world data.[58] Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs
Ring Team,Bin Hu,Cai Chen,Deng Zhao,Ding Liu,Dingnan Jin,Feng Zhu,Hao Dai,Hongzhi Luan,Jia Guo,Jiaming Liu,Jiewei Wu,Jun Mei,Jun Zhou,Junbo Zhao,Junwu Xiong,Kaihong Zhang,Kuan Xu,Lei Liang,Liang Jiang,Liangcheng Fu,Longfei Zheng,Qiang Gao,Qing Cui,Quan Wan,Shaomian Zheng,Shuaicheng Li,Tongkai Yang,Wang Ren,Xiaodong Yan,Xiaopei Wan,Xiaoyun Feng,Xin Zhao,Xinxing Yang,Xinyu Kong,Xuemin Yang,Yang Li,Yingting Wu,Yongkang Liu,Zhankai Xu,Zhenduo Zhang,Zhenglei Zhou,Zhenyu Huang,Zhiqiang Zhang,Zihao Wang,Zujie Wen
Main category: cs.CL
TL;DR: The paper presents Ring-lite, a Mixture-of-Experts (MoE) based large language model optimized via reinforcement learning (RL). It achieves efficient and robust reasoning capabilities with fewer activated parameters compared to state-of-the-art models.
Details
Motivation: To create a more efficient and robust reasoning model that matches the performance of state-of-the-art small-scale reasoning models while using significantly fewer parameters. Method: Building upon the Ling-lite model, the authors introduce a joint training pipeline integrating distillation with RL. They address optimization instability during RL training through Constrained Contextual Computation Policy Optimization (C3PO), select distillation checkpoints based on entropy loss for better performance-efficiency trade-offs in RL training, and develop a two-stage training paradigm to handle multi-domain data integration. Result: Ring-lite matches the performance of state-of-the-art small-scale reasoning models on challenging benchmarks while activating only one-third of the parameters required by comparable models. Conclusion: Ring-lite demonstrates efficient and robust reasoning capabilities with fewer activated parameters, and the authors plan to release the model, dataset, and code. Abstract: We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.[59] Reasoning with Exploration: An Entropy Perspective
Daixuan Cheng,Shaohan Huang,Xuekai Zhu,Bo Dai,Wayne Xin Zhao,Zhenliang Zhang,Furu Wei
Main category: cs.CL
TL;DR: The paper reexamines entropy in reinforcement learning (RL) for language models (LMs), finding correlations between high-entropy regions and exploratory reasoning actions. By modifying the advantage function with an entropy-based term, the method promotes longer reasoning chains, enhancing LM reasoning capabilities significantly on the Pass@K metric.
Details
Motivation: Despite recent advances in language model reasoning, most methods focus on exploitation rather than exploration, leading to performance plateaus. The authors aim to address this by exploring the role of entropy in promoting exploratory reasoning in LMs. Method: The authors empirically analyze the relationship between entropy and exploratory reasoning actions in LMs, identifying correlations with pivotal tokens, reflective actions, and rare behaviors. They then introduce a minimal modification to standard RL by augmenting the advantage function with an entropy-based term to encourage longer reasoning chains. Result: The proposed method achieves significant improvements on the Pass@K metric, which estimates LM reasoning capabilities, even when evaluated with very large K values, indicating enhanced reasoning abilities. Conclusion: By revisiting entropy and its role in exploratory reasoning, the authors demonstrate that promoting longer reasoning chains can push the boundaries of language model reasoning, offering a new approach to balance exploration and exploitation in RL. Abstract: Balancing exploration and exploitation is a central goal in reinforcement learning (RL). Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy -- a signal of exploration in RL -- and examine its relationship to exploratory reasoning in LMs. Through empirical analysis, we uncover strong positive correlations between high-entropy regions and three types of exploratory reasoning actions: (1) pivotal tokens that determine or connect logical steps, (2) reflective actions such as self-verification and correction, and (3) rare behaviors under-explored by the base LMs. Motivated by this, we introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term. Unlike traditional maximum-entropy methods which encourage exploration by promoting uncertainty, we encourage exploration by promoting longer and deeper reasoning chains. Notably, our method achieves significant gains on the Pass@K metric -- an upper-bound estimator of LM reasoning capabilities -- even when evaluated with extremely large K values, pushing the boundaries of LM reasoning.[60] From Bytes to Ideas: Language Modeling with Autoregressive U-Nets
Mathurin Videau,Badr Youbi Idrissi,Alessandro Leite,Marc Schoenauer,Olivier Teytaud,David Lopez-Paz
Main category: cs.CL
TL;DR: An autoregressive U-Net is introduced which learns to embed its own tokens during training, providing a multi-scale view of text sequences and improving predictions.
Details
Motivation: Tokenization using methods like Byte Pair Encoding (BPE) creates a fixed granularity that limits flexibility in how language models operate on data. This rigidity can restrict the model's ability to adapt to different levels of detail and semantic patterns. Method: The method involves creating an autoregressive U-Net that processes raw bytes, pooling them into increasingly larger units (words, pairs of words, up to 4 words). This allows for a multi-scale view where deeper stages predict further into the future, focusing on broader semantic patterns while earlier stages handle finer details. Result: Careful tuning and control of pretraining compute shows that shallow hierarchies match strong BPE baselines, while deeper hierarchies demonstrate a promising trend. The system also has the capability to handle character-level tasks and transfer knowledge across low-resource languages. Conclusion: By embedding tokenization within the model, the autoregressive U-Net provides greater flexibility and adaptability, allowing for better handling of various linguistic tasks and languages. Abstract: Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future -- anticipating the next few words rather than the next byte -- so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.[61] A Variational Framework for Improving Naturalness in Generative Spoken Language Models
Li-Wei Chen,Takuya Higuchi,Zakaria Aldeneh,Ahmed Hussen Abdelaziz,Alexander Rudnicky
Main category: cs.CL
TL;DR: To address the lack of naturalness in speech generation due to neglected prosodic information, this paper proposes an end-to-end variational approach that automatically encodes continuous speech attributes to enhance semantic tokens derived from self-supervised models.
Details
Motivation: The motivation is the limitation of current methods which discretize speech for autoregressive modeling and focus on linguistic aspects while neglecting prosodic information, leading to less natural speech generation. Existing fixes using pitch features are insufficient and require hand-engineering. Method: An end-to-end variational approach is proposed to automatically learn and encode continuous speech attributes, enhancing the semantic tokens without manual extraction or selection of paralinguistic features. Result: The model produces preferred speech continuations as rated by humans, indicating improved naturalness and quality of generated speech. Conclusion: This approach successfully overcomes limitations of previous methods by integrating continuous speech attributes into semantic tokens, offering a more natural speech generation process. Abstract: The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at https://github.com/b04901014/vae-gslm.cs.CV [Back]
[62] Non-planar Object Detection and Identification by Features Matching and Triangulation Growth
Filippo Leveni
Main category: cs.CV
TL;DR: This paper presents a feature-based approach for object detection and identification using Delaunay triangulation and incremental grouping of feature matches.
Details
Motivation: Object detection and identification is crucial in many computer vision applications, but existing methods may fail when the template is non-planar or appears distorted in the image. Method: The proposed method uses Delaunay triangulation as a graph to iteratively group feature matches between the scene image and the template based on local consistency criteria derived from geometric and photometric properties. Result: The approach performs as well or better than homography-based RANSAC in scenarios with little distortion, and shows better performance when deformation becomes significant. Conclusion: This feature-based approach allows for accurate object identification even when geometric models do not hold, such as with non-planar templates or distorted images. Abstract: Object detection and identification is surely a fundamental topic in the computer vision field; it plays a crucial role in many applications such as object tracking, industrial robots control, image retrieval, etc. We propose a feature-based approach for detecting and identifying distorted occurrences of a given template in a scene image by incremental grouping of feature matches between the image and the template. For this purpose, we consider the Delaunay triangulation of template features as an useful tool through which to be guided in this iterative approach. The triangulation is treated as a graph and, starting from a single triangle, neighboring nodes are considered and the corresponding features are identified; then matches related to them are evaluated to determine if they are worthy to be grouped. This evaluation is based on local consistency criteria derived from geometric and photometric properties of local features. Our solution allows the identification of the object in situations where geometric models (e.g. homography) does not hold, thus enable the detection of objects such that the template is non planar or when it is planar but appears distorted in the image. We show that our approach performs just as well or better than application of homography-based RANSAC in scenarios in which distortion is nearly absent, while when the deformation becomes relevant our method shows better description performance.[63] CDST: Color Disentangled Style Transfer for Universal Style Reference Customization
Shiwen Zhang,Zhuowei Chen,Lang Chen,Yanze Wu
Main category: cs.CV
TL;DR: CDST is a new two-stream style transfer training paradigm that separates color from style, providing universal style transfer capabilities without tuning during inference and achieving state-of-the-art results.
Details
Motivation: Current methods do not completely isolate color from style in style transfer tasks, leading to limitations in characteristics-preserved style transfer with style and content references. Method: The CDST method introduces a two-stream style transfer training paradigm which forces the style stream to be color-blinded. It uses multi-feature image embeddings compression to improve style similarity and preserves editing capability through a new style definition inspired by Diffusion UNet disentanglement law. Result: CDST achieves state-of-the-art results on various style transfer tasks as demonstrated by thorough qualitative and quantitative experiments and human evaluations. Conclusion: CDST successfully isolates color from style, provides universal style transfer capabilities without tuning during inference, and solves characteristics-preserved style transfer with style and content references for the first time. Abstract: We introduce Color Disentangled Style Transfer (CDST), a novel and efficient two-stream style transfer training paradigm which completely isolates color from style and forces the style stream to be color-blinded. With one same model, CDST unlocks universal style transfer capabilities in a tuning-free manner during inference. Especially, the characteristics-preserved style transfer with style and content references is solved in the tuning-free way for the first time. CDST significantly improves the style similarity by multi-feature image embeddings compression and preserves strong editing capability via our new CDST style definition inspired by Diffusion UNet disentanglement law. By conducting thorough qualitative and quantitative experiments and human evaluations, we demonstrate that CDST achieves state-of-the-art results on various style transfer tasks.[64] Hidden Bias in the Machine: Stereotypes in Text-to-Image Models
Sedat Porikli,Vedat Porikli
Main category: cs.CV
TL;DR: Text-to-Image (T2I) models can generate realistic images from natural language prompts, but they may replicate societal biases. The researchers investigated this by curating a diverse set of prompts and generating over 16,000 images using Stable Diffusion 1.5 and Flux-1 models. They found significant disparities in the representation of gender, race, age, somatotype, and other human-centric factors across generated images, which often reinforce harmful stereotypes.
Details
Motivation: To investigate the potential of Text-to-Image models to replicate and magnify existing societal biases. Method: The researchers curated a diverse set of prompts spanning thematic categories such as occupations, traits, actions, ideologies, emotions, family roles, place descriptions, spirituality, and life events. For each of the 160 unique topics, they crafted multiple prompt variations to reflect a wide range of meanings and perspectives. Using Stable Diffusion 1.5 and Flux-1 models with original checkpoints, they generated over 16,000 images under consistent settings. Additionally, they collected 8,000 comparison images from Google Image Search. Result: The analysis reveals significant disparities in the representation of gender, race, age, somatotype, and other human-centric factors across generated images. These disparities often mirror and reinforce harmful stereotypes embedded in societal narratives. Conclusion: The researchers discuss the implications of these findings and emphasize the need for more inclusive datasets and development practices to foster fairness in generative visual systems. Abstract: Text-to-Image (T2I) models have transformed visual content creation, producing highly realistic images from natural language prompts. However, concerns persist around their potential to replicate and magnify existing societal biases. To investigate these issues, we curated a diverse set of prompts spanning thematic categories such as occupations, traits, actions, ideologies, emotions, family roles, place descriptions, spirituality, and life events. For each of the 160 unique topics, we crafted multiple prompt variations to reflect a wide range of meanings and perspectives. Using Stable Diffusion 1.5 (UNet-based) and Flux-1 (DiT-based) models with original checkpoints, we generated over 16,000 images under consistent settings. Additionally, we collected 8,000 comparison images from Google Image Search. All outputs were filtered to exclude abstract, distorted, or nonsensical results. Our analysis reveals significant disparities in the representation of gender, race, age, somatotype, and other human-centric factors across generated images. These disparities often mirror and reinforce harmful stereotypes embedded in societal narratives. We discuss the implications of these findings and emphasize the need for more inclusive datasets and development practices to foster fairness in generative visual systems.[65] Fake it till You Make it: Reward Modeling as Discriminative Prediction
Runtao Liu,Jiahao Zhan,Yingqing He,Chen Wei,Alan Yuille,Qifeng Chen
Main category: cs.CV
TL;DR: The paper introduces GAN-RM, an efficient reward modeling framework for reinforcement learning that eliminates the need for manual preference annotation and explicit quality dimension engineering by using a small set of unpaired target samples.
Details
Motivation: Current reward modeling approaches in reinforcement learning are complex to implement due to reliance on extensive human-annotated preference data or engineered quality dimensions which can be incomplete and labor-intensive. Method: The method, GAN-RM, trains the reward model by discriminating between a small set of unpaired target samples (Preference Proxy Data) and model-generated outputs without requiring manual annotations or explicit quality dimension engineering. Result: Comprehensive experiments show the effectiveness of GAN-RM across multiple applications such as test-time scaling (Best-of-N sample filtering) and post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Conclusion: GAN-RM provides an efficient alternative for reward modeling in reinforcement learning by simplifying implementation and reducing dependency on manual annotations and explicit quality dimensions. Abstract: An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).[66] DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding
Thomas Kreutz,Max Mühlhäuser,Alejandro Sanchez Guinea
Main category: cs.CV
TL;DR: 尽管LiDAR在人类活动理解的多模态对比预训练中尚未得到充分探索,本文提出了DeSPITE模型,通过噪声对比估计学习四种模态(LiDAR点云、人体骨骼姿态、IMU数据和文本)的联合嵌入空间。实验表明,DeSPITE在点云序列的人类活动理解任务中表现出色,并且是点云HAR的有效预训练策略。
Details
Motivation: 在人类活动理解领域,如人类活动识别(HAR)、检索或人员再识别(RE-ID),LiDAR作为一种保护隐私的有效替代方案尚未得到充分研究。 Method: 提出了一种名为DeSPITE的深度骨架-点云-IMU-文本嵌入模型,该模型通过噪声对比估计有效地学习了这四种模态的联合嵌入空间。将现有的LIPD和Babel数据集结合起来,实现了所有四种模态的数据同步,从而探索了新的联合嵌入空间的学习。 Result: 实验证明,DeSPITE可以实现包括骨架<->点云<->IMU匹配、检索和时间片段检索等新颖的人类活动理解任务。此外,DeSPITE在MSR-Action3D和HMPEAR上的实验表明其是点云HAR的有效预训练策略。 Conclusion: DeSPITE模型成功地学习了LiDAR点云、人体骨骼姿态、IMU数据和文本之间的对应关系,为人类活动理解提供了新的可能性,并证明了其作为点云HAR预训练策略的有效性。 Abstract: Despite LiDAR (Light Detection and Ranging) being an effective privacy-preserving alternative to RGB cameras to perceive human activities, it remains largely underexplored in the context of multi-modal contrastive pre-training for human activity understanding (e.g., human activity recognition (HAR), retrieval, or person re-identification (RE-ID)). To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a Deep Skeleton-Pointcloud-IMU-Text Embedding model, which effectively learns a joint embedding space across these four modalities through noise contrastive estimation. At the heart of our empirical exploration, we have combined the existing LIPD and Babel datasets, which enabled us to synchronize data of all four modalities, allowing us to explore the learning of a new joint embedding space. Our experiments demonstrate novel human activity understanding tasks for point cloud sequences enabled through DeSPITE, including Skeleton<->Pointcloud<->IMU matching, retrieval, and temporal moment retrieval. Furthermore, we show that DeSPITE is an effective pre-training strategy for point cloud HAR through experiments in MSR-Action3D and HMPEAR.[67] OPTIMUS: Observing Persistent Transformations in Multi-temporal Unlabeled Satellite-data
Raymond Yu,Paul Han,Josh Myers-Dean,Piper Wolters,Favyen Bastani
Main category: cs.CV
TL;DR: OPTIMUS is a self-supervised learning method designed to detect changes in satellite images, which improves AUROC score from 56.3% to 87.6%.
Details
Motivation: Monitoring surface changes on Earth is crucial due to environmental issues, but supervised methods are limited by lack of labeled satellite data. Method: OPTIMUS uses self-supervised learning based on recovering the relative order of images in time series, applying change point detection methods on model outputs. Result: Achieved an improvement in AUROC score from 56.3% to 87.6% compared to baselines in distinguishing changed time series from unchanged ones. Conclusion: OPTIMUS can directly detect interesting changes in satellite images without needing extensive labeled data. Abstract: In the face of pressing environmental issues in the 21st century, monitoring surface changes on Earth is more important than ever. Large-scale remote sensing, such as satellite imagery, is an important tool for this task. However, using supervised methods to detect changes is difficult because of the lack of satellite data annotated with change labels, especially for rare categories of change. Annotation proves challenging due to the sparse occurrence of changes in satellite images. Even within a vast collection of images, only a small fraction may exhibit persistent changes of interest. To address this challenge, we introduce OPTIMUS, a self-supervised learning method based on an intuitive principle: if a model can recover information about the relative order of images in the time series, then that implies that there are long-lasting changes in the images. OPTIMUS demonstrates this principle by using change point detection methods on model outputs in a time series. We demonstrate that OPTIMUS can directly detect interesting changes in satellite images, achieving an improvement in AUROC score from 56.3% to 87.6% at distinguishing changed time series from unchanged ones compared to baselines. Our code and dataset are available at https://huggingface.co/datasets/optimus-change/optimus-dataset/.[68] Intelligent Image Sensing for Crime Analysis: A ML Approach towards Enhanced Violence Detection and Investigation
Aritra Dutta,Pushpita Boral,G Suseela
Main category: cs.CV
TL;DR: The paper presents a comprehensive framework using Machine Learning for violence detection and classification in video streams, improving computational efficiency and accuracy.
Details
Motivation: The increasing global crime rate and limitations of traditional surveillance methods in promptly detecting violent acts highlight the need for automatic violence detection. Method: The framework employs Supervised Learning for binary and multi-class violence classification. The detection model uses 3D Convolutional Neural Networks while the classification model leverages separable convolutional 3D model for feature extraction and bidirectional LSTM for temporal processing. It is trained on diverse datasets with frame-level annotations. Result: Improved performance in terms of computational resource efficiency and accuracy is demonstrated. Conclusion: This paper successfully introduces a framework for violence detection and classification that shows improved efficiency and accuracy. Abstract: The increasing global crime rate, coupled with substantial human and property losses, highlights the limitations of traditional surveillance methods in promptly detecting diverse and unexpected acts of violence. Addressing this pressing need for automatic violence detection, we leverage Machine Learning to detect and categorize violent events in video streams. This paper introduces a comprehensive framework for violence detection and classification, employing Supervised Learning for both binary and multi-class violence classification. The detection model relies on 3D Convolutional Neural Networks, while the classification model utilizes the separable convolutional 3D model for feature extraction and bidirectional LSTM for temporal processing. Training is conducted on a diverse customized datasets with frame-level annotations, incorporating videos from surveillance cameras, human recordings, hockey fight, sohas and wvd dataset across various platforms. Additionally, a camera module integrated with raspberry pi is used to capture live video feed, which is sent to the ML model for processing. Thus, demonstrating improved performance in terms of computational resource efficiency and accuracy.[69] HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment
Numair Nadeem,Saeed Anwar,Muhammad Hamza Asad,Abdul Bais
Main category: cs.CV
TL;DR: HierVL是一种结合视觉和语言模型的统一框架,针对半监督语义分割任务,在标签稀缺和领域变化的情况下表现出色。它通过引入分层语义查询生成器、跨模态空间对齐模块和双查询Transformer解码器等创新组件,以及专门设计的正则化损失,显著提高了多个基准数据集上的性能。
Details
Motivation: 现有的视觉模型在严重标签稀缺和领域变化的情况下难以进行准确的语义分割,而纯视觉方法容易出现像素误分类和边界定位不佳的问题;视觉-语言模型虽然具有强大的语义理解能力,但缺乏空间定位能力。因此需要一种能结合两者优势的方法来解决这些问题。 Method: 提出了一种名为HierVL的统一框架,包含三个关键组件:1)分层语义查询生成器,用于过滤和投影抽象类别嵌入以处理类内变异性;2)跨模态空间对齐模块,用于将语义查询与像素特征对齐以提高边界精度;3)双查询Transformer解码器,用于融合语义和实例级查询以防止实例崩溃。同时引入了针对性的正则化损失以保持视觉-语言对齐。 Result: 在COCO、Pascal VOC、ADE20和Cityscapes四个基准数据集上分别取得了+4.4%、+3.1%、+5.9%和+1.8%的mIoU提升(在1%监督条件下),证明了该方法在标签效率和细粒度实例感知泛化方面的优越性。 Conclusion: HierVL通过语言引导的语义分割有效缩小了标签效率差距,并开启了更高水平的细粒度、实例感知泛化能力。 Abstract: Semi-supervised semantic segmentation remains challenging under severe label scarcity and domain variability. Vision-only methods often struggle to generalize, resulting in pixel misclassification between similar classes, poor generalization and boundary localization. Vision-Language Models offer robust, domain-invariant semantics but lack the spatial grounding required for dense prediction. We introduce HierVL, a unified framework that bridges this gap by integrating abstract text embeddings into a mask-transformer architecture tailored for semi-supervised segmentation. HierVL features three novel components: a Hierarchical Semantic Query Generator that filters and projects abstract class embeddings into multi-scale queries to suppress irrelevant classes and handle intra-class variability; a Cross-Modal Spatial Alignment Module that aligns semantic queries with pixel features for sharper boundaries under sparse supervision; and a Dual-Query Transformer Decoder that fuses semantic and instance-level queries to prevent instance collapse. We also introduce targeted regularization losses that maintain vision-language alignment throughout training to reinforce semantic grounding. HierVL establishes a new state-of-the-art by achieving a +4.4% mean improvement of the intersection over the union on COCO (with 232 labeled images), +3.1% on Pascal VOC (with 92 labels), +5.9% on ADE20 (with 158 labels) and +1.8% on Cityscapes (with 100 labels), demonstrating better performance under 1% supervision on four benchmark datasets. Our results show that language-guided segmentation closes the label efficiency gap and unlocks new levels of fine-grained, instance-aware generalization.[70] Mapping Farmed Landscapes from Remote Sensing
Michelangelo Conserva,Alex Wilson,Charlotte Stanton,Vishal Batchu,Varun Gulshan
Main category: cs.CV
TL;DR: The paper presents Farmscapes, a large-scale, high-resolution map of rural landscape features in England created using deep learning. It aims to aid biodiversity management and habitat restoration.
Details
Motivation: Effective management of agricultural landscapes is essential for global biodiversity targets, but this is hindered by the lack of detailed, large-scale ecological maps. Method: A deep learning segmentation model was trained on a novel dataset of 942 manually annotated tiles from aerial imagery to generate a map (Farmscapes) covering most of England with a resolution of 25cm. The map includes ecologically important features like hedgerows, woodlands, and stone walls. Result: The model accurately identifies key habitats with high f1-scores for woodland (96%) and farmed land (95%), and effectively segments linear features with an F1-score of 72% for hedgerows. Conclusion: Farmscapes provides an open-access tool for ecologists and policymakers, enabling data-driven planning for habitat restoration and supporting initiatives such as the EU Biodiversity Strategy. It also lays the groundwork for analyzing landscape connectivity. Abstract: Effective management of agricultural landscapes is critical for meeting global biodiversity targets, but efforts are hampered by the absence of detailed, large-scale ecological maps. To address this, we introduce Farmscapes, the first large-scale (covering most of England), high-resolution (25cm) map of rural landscape features, including ecologically vital elements like hedgerows, woodlands, and stone walls. This map was generated using a deep learning segmentation model trained on a novel, dataset of 942 manually annotated tiles derived from aerial imagery. Our model accurately identifies key habitats, achieving high f1-scores for woodland (96\%) and farmed land (95\%), and demonstrates strong capability in segmenting linear features, with an F1-score of 72\% for hedgerows. By releasing the England-wide map on Google Earth Engine, we provide a powerful, open-access tool for ecologists and policymakers. This work enables data-driven planning for habitat restoration, supports the monitoring of initiatives like the EU Biodiversity Strategy, and lays the foundation for advanced analysis of landscape connectivity.[71] FindMeIfYouCan: Bringing Open Set metrics to $\textit{near} $, $ \textit{far} $ and $\textit{farther}$ Out-of-Distribution Object Detection
Daniel Montoya,Aymen Bouguerra,Alexandra Gomez-Villa,Fabio Arnez
Main category: cs.CV
TL;DR: 当前OOD-OD评估协议存在问题,可能导致在实际部署中对未知物体的检测过于自信。本文通过手动整理和丰富现有基准,利用语义相似性创建新的评估分类,并结合开放集社区的成熟度量标准进行综合评估。
Details
Motivation: 现有的对象检测方法主要基于封闭世界的假设,但在关键应用领域(如自动驾驶和医学成像)中,检测和定位未知对象对于安全性至关重要。因此,研究OOD检测对于改进对象检测技术具有重要意义。 Method: 作者首先指出当前OOD-OD评估协议的问题,然后通过语义相似性手动创建了新的评估分类(near、far、farther),并引入开放集社区的成熟度量标准以提供更深入的评估视角。 Result: 语义和视觉上接近的OOD对象比远的对象更容易定位,但也更容易与ID对象混淆;而远和更远的对象虽然更难定位,但较少被误认为是ID对象。 Conclusion: 通过新的评估分类和度量标准,作者展示了不同类别OOD对象的检测难度及混淆情况,为未来的研究提供了更全面的评估框架。 Abstract: State-of-the-art Object Detection (OD) methods predominantly operate under a closed-world assumption, where test-time categories match those encountered during training. However, detecting and localizing unknown objects is crucial for safety-critical applications in domains such as autonomous driving and medical imaging. Recently, Out-Of-Distribution (OOD) detection has emerged as a vital research direction for OD, focusing on identifying incorrect predictions typically associated with unknown objects. This paper shows that the current evaluation protocol for OOD-OD violates the assumption of non-overlapping objects with respect to the In-Distribution (ID) datasets, and obscures crucial situations such as ignoring unknown objects, potentially leading to overconfidence in deployment scenarios where truly novel objects might be encountered. To address these limitations, we manually curate, and enrich the existing benchmark by exploiting semantic similarity to create new evaluation splits categorized as $\textit{near}$, $\textit{far}$, and $\textit{farther}$ from ID distributions. Additionally, we incorporate established metrics from the Open Set community, providing deeper insights into how effectively methods detect unknowns, when they ignore them, and when they mistakenly classify OOD objects as ID. Our comprehensive evaluation demonstrates that semantically and visually close OOD objects are easier to localize than far ones, but are also more easily confounded with ID objects. $\textit{Far}$ and $\textit{farther}$ objects are harder to localize but less prone to be taken for an ID object.[72] Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation
Nick Yiwen Huang,Akin Caliskan,Berkay Kicanaoglu,James Tompkin,Hyeongwoo Kim
Main category: cs.CV
TL;DR: The paper presents a method for disentangling 3D features from large vision-language models, enabling free-form text and 3D geometry control of generative 3D portraits while overcoming issues with noise in the model's embedding space.
Details
Motivation: To enable creators to control 3D generators using their own 2D face data without needing extensive resources for labeling large datasets or training large models. Method: The approach uses a pre-trained large vision-language model (LVLM) and a predefined 3D morphable model (FLAME). It disentangles features by canonicalization to a 2D reference frame from a deformable neural 3D triplane representation. To address noise in the LVLM's embedding space, the method employs Jacobian regularization computed efficiently with a stochastic approximator. Result: Compared to existing methods, this approach produces high-quality 3D portraits that remain consistent when either text or 3D controls are changed. Conclusion: This work successfully disentangles 3D features from vision-language models, allowing free-form text and 3D geometry control over generative 3D portraits. It enables creators to use their own 2D face data without requiring extensive resources. Abstract: We consider the problem of disentangling 3D from large vision-language models, which we show on generative 3D portraits. This allows free-form text control of appearance attributes like age, hair style, and glasses, and 3D geometry control of face expression and camera pose. In this setting, we assume we use a pre-trained large vision-language model (LVLM; CLIP) to generate from a smaller 2D dataset with no additional paired labels and with a pre-defined 3D morphable model (FLAME). First, we disentangle using canonicalization to a 2D reference frame from a deformable neural 3D triplane representation. But another form of entanglement arises from the significant noise in the LVLM's embedding space that describes irrelevant features. This damages output quality and diversity, but we overcome this with a Jacobian regularization that can be computed efficiently with a stochastic approximator. Compared to existing methods, our approach produces portraits with added text and 3D control, where portraits remain consistent when either control is changed. Broadly, this approach lets creators control 3D generators on their own 2D face data without needing resources to label large data or train large models.[73] SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement
Chelsi Jain,Yiran Wu,Yifan Zeng,Jiale Liu,S hengyu Dai,Zhenwen Shao,Qingyun Wu,Huazheng Wang
Main category: cs.CV
TL;DR: The paper presents SimpleDoc, a retrieval-augmented framework for Document Visual Question Answering (DocVQA) that efficiently gathers evidence pages using a dual-cue retriever and outperforms previous baselines.
Details
Motivation: Existing methods for DocVQA follow a Retrieval Augmented Generation (RAG) pipeline but use Visual Language Models (VLMs) to embed and retrieve relevant pages as images. There is room for improvement in the efficiency and effectiveness of evidence page gathering. Method: SimpleDoc uses a lightweight retrieval-augmented framework which boosts evidence page gathering by retrieving candidates through embedding similarity, then filtering and re-ranking them based on page summaries. A VLM-based reasoner agent iteratively pulls fresh pages into working memory until the question is answered confidently. Result: SimpleDoc outperforms previous baselines by 3.2% on average across 4 DocVQA datasets while retrieving significantly fewer pages. Conclusion: SimpleDoc is a powerful and efficient framework for DocVQA that improves upon existing methods by more effectively gathering evidence pages and generating accurate answers. Abstract: Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g, images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at https://github.com/ag2ai/SimpleDoc.[74] Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems
Sanjeda Akter,Ibne Farabi Shihab,Anuj Sharma
Main category: cs.CV
TL;DR: The paper surveys the integration of LLMs with computer vision for image segmentation in ITS, discussing applications, challenges, and future directions.
Details
Motivation: To systematically review the emerging field of LLM-augmented image segmentation within ITS and highlight its potential impact on safety and efficiency. Method: Providing a taxonomy of current approaches based on prompting mechanisms and core architectures, focusing on applications like autonomous driving, traffic monitoring, and infrastructure maintenance. Result: Identified key challenges such as real-time performance and safety-critical reliability, emphasizing the need for explainable and human-centric AI. Conclusion: Successful deployment of LLM-augmented image segmentation in next-generation transportation systems requires addressing challenges and prioritizing explainability and human-centric AI. Abstract: The integration of Large Language Models (LLMs) with computer vision is profoundly transforming perception tasks like image segmentation. For intelligent transportation systems (ITS), where accurate scene understanding is critical for safety and efficiency, this new paradigm offers unprecedented capabilities. This survey systematically reviews the emerging field of LLM-augmented image segmentation, focusing on its applications, challenges, and future directions within ITS. We provide a taxonomy of current approaches based on their prompting mechanisms and core architectures, and we highlight how these innovations can enhance road scene understanding for autonomous driving, traffic monitoring, and infrastructure maintenance. Finally, we identify key challenges, including real-time performance and safety-critical reliability, and outline a perspective centered on explainable, human-centric AI as a prerequisite for the successful deployment of this technology in next-generation transportation systems.[75] FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution
Siyu Xu,Wenjie Li,Guangwei Gao,Jian Yang,Guo-Jun Qi,Chia-Wen Lin
Main category: cs.CV
TL;DR: In this paper, researchers tackle the challenge of face super-resolution under limited computational resources. They propose FADPNet which splits facial features into low- and high-frequency components processed by specialized branches. A Mamba-based block enhances low-frequency features while a CNN-based module refines high-frequency details. This approach achieves superior FSR quality with improved efficiency.
Details
Motivation: Existing face super-resolution methods treat all facial pixels equally, leading to inefficient use of computational resources and degraded performance. The authors observe that CNNs are sensitive to high-frequency features like contours and outlines, whereas Mamba captures low-frequency features such as color and texture more effectively and with lower complexity than Transformers. Method: The proposed method, FADPNet, decomposes facial features into low- and high-frequency components and processes them through dedicated branches. For low-frequency regions, a Mamba-based Low-Frequency Enhancement Block (LFEB) is introduced, combining state-space attention with squeeze-and-excitation operations. For high-frequency regions, a CNN-based Deep Position-Aware Attention (DPA) module enhances spatially-dependent structural details, complemented by a lightweight High-Frequency Refinement (HFR) module. Result: FADPNet achieves an excellent balance between face super-resolution quality and model efficiency, outperforming existing approaches in terms of both performance and computational cost. Conclusion: The Frequency-Aware Dual-Path Network (FADPNet) successfully improves face super-resolution quality while maintaining model efficiency by separately processing low- and high-frequency facial features using specialized modules. Abstract: Face super-resolution (FSR) under limited computational costs remains an open problem. Existing approaches typically treat all facial pixels equally, resulting in suboptimal allocation of computational resources and degraded FSR performance. CNN is relatively sensitive to high-frequency facial features, such as component contours and facial outlines. Meanwhile, Mamba excels at capturing low-frequency features like facial color and fine-grained texture, and does so with lower complexity than Transformers. Motivated by these observations, we propose FADPNet, a Frequency-Aware Dual-Path Network that decomposes facial features into low- and high-frequency components and processes them via dedicated branches. For low-frequency regions, we introduce a Mamba-based Low-Frequency Enhancement Block (LFEB), which combines state-space attention with squeeze-and-excitation operations to extract low-frequency global interactions and emphasize informative channels. For high-frequency regions, we design a CNN-based Deep Position-Aware Attention (DPA) module to enhance spatially-dependent structural details, complemented by a lightweight High-Frequency Refinement (HFR) module that further refines frequency-specific representations. Through the above designs, our method achieves an excellent balance between FSR quality and model efficiency, outperforming existing approaches.[76] KDMOS:Knowledge Distillation for Motion Segmentation
Chunyu Cao,Jintao Cheng,Zeyu Chen,Linfan Zhan,Rui Fan,Zhijian He,Xiaoyu Tang
Main category: cs.CV
TL;DR: The paper proposes a logits-based knowledge distillation framework for Motion Object Segmentation (MOS) to improve accuracy while maintaining real-time efficiency.
Details
Motivation: Motion Object Segmentation is crucial for autonomous driving, but existing methods struggle to balance accuracy and real-time inference. Method: A Bird's Eye View (BEV) projection-based model acts as the student and a non-projection model as the teacher. The moving and non-moving classes are decoupled with tailored distillation strategies. Dynamic upsampling and network architecture optimization are also introduced. Result: Achieves a notable IoU of 78.8% on the hidden test set of the SemanticKITTI-MOS dataset and competitive results on the Apollo dataset. There is a 7.69% reduction in parameter count. Conclusion: The proposed method significantly reduces false positives and negatives and mitigates overfitting. Abstract: Motion Object Segmentation (MOS) is crucial for autonomous driving, as it enhances localization, path planning, map construction, scene flow estimation, and future state prediction. While existing methods achieve strong performance, balancing accuracy and real-time inference remains a challenge. To address this, we propose a logits-based knowledge distillation framework for MOS, aiming to improve accuracy while maintaining real-time efficiency. Specifically, we adopt a Bird's Eye View (BEV) projection-based model as the student and a non-projection model as the teacher. To handle the severe imbalance between moving and non-moving classes, we decouple them and apply tailored distillation strategies, allowing the teacher model to better learn key motion-related features. This approach significantly reduces false positives and false negatives. Additionally, we introduce dynamic upsampling, optimize the network architecture, and achieve a 7.69% reduction in parameter count, mitigating overfitting. Our method achieves a notable IoU of 78.8% on the hidden test set of the SemanticKITTI-MOS dataset and delivers competitive results on the Apollo dataset. The KDMOS implementation is available at https://github.com/SCNU-RISLAB/KDMOS.[77] Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology
Nafiz Sadman,Farhana Zulkernine,Benjamin Kwan
Main category: cs.CV
TL;DR: 本研究探讨了BiomedCLIP的嵌入空间,并量化其在处理高度不平衡、分布外多标签医学数据集时的局限性。通过在IU-xray数据集上进行零样本推理、全微调和线性探测实验,发现零样本设置下模型对所有标签过预测,全微调改善了疾病分类,线性探测检测重叠特征。使用Grad-CAM热图展示了模型的视觉理解,并与放射科医生的15个注释进行了比较。强调了为确保现实世界中的可靠性和适用性,需要谨慎调整模型。
Details
Motivation: 探索BiomedCLIP学习到的嵌入空间以分析有意义的类别分离,并量化其在应用于高度不平衡、分布外多标签医学数据集时的局限性。 Method: 在IU-xray数据集上进行实验,评估BiomedCLIP在零样本推理、全微调和线性探测三种上下文中对图像(X光片)的分类能力。使用Grad-CAM热图展示模型的视觉理解,并与放射科医生的注释进行比较。 Result: 零样本设置下模型对所有标签过预测,导致精度和类别间可分性较差。全微调改善了不同疾病的分类,线性探测检测到了重叠特征。Grad-CAM热图展示了模型的视觉理解,并与放射科医生的注释一致。 Conclusion: 需要谨慎调整模型以确保其在现实世界中的可靠性和适用性。提供了可在GitHub上获取和维护的实验代码。 Abstract: In this paper, we construct two research objectives: i) explore the learned embedding space of BiomedCLIP, an open-source large vision language model, to analyse meaningful class separations, and ii) quantify the limitations of BiomedCLIP when applied to a highly imbalanced, out-of-distribution multi-label medical dataset. We experiment on IU-xray dataset, which exhibits the aforementioned criteria, and evaluate BiomedCLIP in classifying images (radiographs) in three contexts: zero-shot inference, full finetuning, and linear probing. The results show that the model under zero-shot settings over-predicts all labels, leading to poor precision and inter-class separability. Full fine-tuning improves classification of distinct diseases, while linear probing detects overlapping features. We demonstrate visual understanding of the model using Grad-CAM heatmaps and compare with 15 annotations by a radiologist. We highlight the need for careful adaptations of the models to foster reliability and applicability in a real-world setting. The code for the experiments in this work is available and maintained on GitHub.[78] RadFabric: Agentic AI System with Reasoning Capability for Radiology
Wenting Chen,Yi Dong,Zhaojun Ding,Yucheng Shi,Yifan Zhou,Fang Zeng,Yijun Luo,Tianyu Lin,Yihang Su,Yichen Wu,Kai Zhang,Zhen Xiang,Tianming Liu,Ninghao Liu,Lichao Sun,Yixuan Yuan,Xiang Li
Main category: cs.CV
TL;DR: RadFabric is a new framework that combines visual and textual analysis for better chest X-ray interpretation, showing significant improvements in detecting challenging pathologies.
Details
Motivation: Current automated systems for chest X-ray imaging have limitations in pathology coverage, diagnostic accuracy, and integration of visual and textual reasoning. Method: RadFabric uses specialized CXR agents for pathology detection, an Anatomical Interpretation Agent for mapping visual findings to anatomical structures, and a Reasoning Agent powered by large multimodal models for synthesizing data into diagnoses. It's built on the Model Context Protocol for modularity, interoperability, and scalability. Result: RadFabric achieves near-perfect detection accuracy for challenging pathologies like fractures (1.000) and superior overall diagnostic accuracy (0.799) compared to traditional systems (0.229 to 0.527). Conclusion: RadFabric advances AI-driven radiology by providing transparent, anatomically precise, and clinically actionable chest X-ray analysis through cross modal feature alignment and preference-driven reasoning. Abstract: Chest X ray (CXR) imaging remains a critical diagnostic tool for thoracic conditions, but current automated systems face limitations in pathology coverage, diagnostic accuracy, and integration of visual and textual reasoning. To address these gaps, we propose RadFabric, a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation. RadFabric is built on the Model Context Protocol (MCP), enabling modularity, interoperability, and scalability for seamless integration of new diagnostic agents. The system employs specialized CXR agents for pathology detection, an Anatomical Interpretation Agent to map visual findings to precise anatomical structures, and a Reasoning Agent powered by large multimodal reasoning models to synthesize visual, anatomical, and clinical data into transparent and evidence based diagnoses. RadFabric achieves significant performance improvements, with near-perfect detection of challenging pathologies like fractures (1.000 accuracy) and superior overall diagnostic accuracy (0.799) compared to traditional systems (0.229 to 0.527). By integrating cross modal feature alignment and preference-driven reasoning, RadFabric advances AI-driven radiology toward transparent, anatomically precise, and clinically actionable CXR analysis.[79] SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability
Juho Bai,Inwook Shim
Main category: cs.CV
TL;DR: This paper proposes SceneAware, a framework for pedestrian trajectory prediction that incorporates scene understanding via Vision Transformer and Multi-modal Large Language Models. It outperforms state-of-the-art methods by over 50% on ETH/UCY datasets.
Details
Motivation: Existing trajectory prediction models focus mainly on social interactions among pedestrians but often neglect the environmental context which significantly influences human movement patterns. Method: The method uses a Vision Transformer (ViT) scene encoder to process static scene images and Multi-modal Large Language Models (MLLMs) to generate binary walkability masks. It combines a Transformer-based trajectory encoder with the ViT-based scene encoder to capture temporal dynamics and spatial constraints. Collision penalty mechanisms are integrated to ensure physically plausible predictions. Implemented in both deterministic and stochastic variants. Result: Experiments on ETH/UCY benchmark datasets demonstrate that SceneAware outperforms current state-of-the-art methods with more than 50% improvement. The model performs consistently well across different types of pedestrian trajectories. Conclusion: Incorporating explicit scene information is crucial for accurate and physically plausible trajectory predictions. The proposed SceneAware approach is effective and reliable. Abstract: Accurate prediction of pedestrian trajectories is essential for applications in robotics and surveillance systems. While existing approaches primarily focus on social interactions between pedestrians, they often overlook the rich environmental context that significantly shapes human movement patterns. In this paper, we propose SceneAware, a novel framework that explicitly incorporates scene understanding to enhance trajectory prediction accuracy. Our method leverages a Vision Transformer~(ViT) scene encoder to process environmental context from static scene images, while Multi-modal Large Language Models~(MLLMs) generate binary walkability masks that distinguish between accessible and restricted areas during training. We combine a Transformer-based trajectory encoder with the ViT-based scene encoder, capturing both temporal dynamics and spatial constraints. The framework integrates collision penalty mechanisms that discourage predicted trajectories from violating physical boundaries, ensuring physically plausible predictions. SceneAware is implemented in both deterministic and stochastic variants. Comprehensive experiments on the ETH/UCY benchmark datasets show that our approach outperforms state-of-the-art methods, with more than 50\% improvement over previous models. Our analysis based on different trajectory categories shows that the model performs consistently well across various types of pedestrian movement. This highlights the importance of using explicit scene information and shows that our scene-aware approach is both effective and reliable in generating accurate and physically plausible predictions. Code is available at: https://github.com/juho127/SceneAware.[80] VideoMAR: Autoregressive Video Generatio with Continuous Tokens
Hu Yu,Biao Gong,Hangjie Yuan,DanDan Zheng,Weilong Chai,Jingdong Chen,Kecheng Zheng,Feng Zhao
Main category: cs.CV
TL;DR: VideoMAR is a novel autoregressive image-to-video model which integrates temporal causality and spatial bi-directionality, employing next-frame diffusion loss. It uses curriculum learning and progressive resolution training to manage long sequence modeling costs, replicates language model capacities for video generation, and achieves superior performance with fewer resources.
Details
Motivation: The motivation of this paper is to explore the potential of masked-based autoregressive models in video generation, addressing the under-explored area in continuous space beyond image generation. The authors aim to develop an efficient decoder-only model that can handle both temporal and spatial aspects of video generation while managing the high cost and difficulty associated with long sequence autoregressive modeling. Method: The method involves proposing VideoMAR, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens. It incorporates temporal causality and spatial bi-directionality as principles, utilizes next-frame diffusion loss for integration, and employs temporal short-to-long curriculum learning and spatial progressive resolution training. Progressive temperature strategy is used at inference time to mitigate accumulation error. VideoMAR also replicates language model capacities such as simultaneous temporal-wise KV cache and spatial-wise parallel generation. Result: On the VBench-I2V benchmark, VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters (9.3%), training data (0.5%), and GPU resources (0.2%). This demonstrates its superior efficiency and effectiveness in video generation. Conclusion: VideoMAR successfully addresses the challenges of video generation by integrating temporal causality and spatial bi-directionality, employing effective loss functions and training strategies. It achieves state-of-the-art performance with much lower resource requirements, making it a significant advancement in the field of video generation. Abstract: Masked-based autoregressive models have demonstrated promising image generation capability in continuous space. However, their potential for video generation remains under-explored. In this paper, we propose \textbf{VideoMAR}, a concise and efficient decoder-only autoregressive image-to-video model with continuous tokens, composing temporal frame-by-frame and spatial masked generation. We first identify temporal causality and spatial bi-directionality as the first principle of video AR models, and propose the next-frame diffusion loss for the integration of mask and video generation. Besides, the huge cost and difficulty of long sequence autoregressive modeling is a basic but crucial issue. To this end, we propose the temporal short-to-long curriculum learning and spatial progressive resolution training, and employ progressive temperature strategy at inference time to mitigate the accumulation error. Furthermore, VideoMAR replicates several unique capacities of language models to video generation. It inherently bears high efficiency due to simultaneous temporal-wise KV cache and spatial-wise parallel generation, and presents the capacity of spatial and temporal extrapolation via 3D rotary embeddings. On the VBench-I2V benchmark, VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters ($9.3\%$), training data ($0.5\%$), and GPU resources ($0.2\%$).[81] A multi-stage augmented multimodal interaction network for fish feeding intensity quantification
Shulong Zhang,Mingyuan Yao,Jiayin Zhao,Xiao Liu,Haihua Wang
Main category: cs.CV
TL;DR: In recirculating aquaculture systems, a Multi-stage Augmented Multimodal Interaction Network (MAINet) is proposed to quantify fish feeding intensity. It integrates image, audio and water wave data through a feature extraction framework, an Auxiliary-modality Reinforcement Primary-modality Mechanism (ARPM), and an Evidence Reasoning rule. Experimental results show MAINet's superior performance in accuracy, precision, recall and F1-Score.
Details
Motivation: Accurate assessment of fish feeding intensity is crucial for reducing feed costs and calculating optimal feeding times in recirculating aquaculture systems. Current studies have limitations in modality selection, feature extraction and fusion, and co-inference for decision making. Method: The study proposes MAINet which includes a general feature extraction framework for input data (image, audio, water wave), ARPM with CAFN and DAFN for inter-modal interaction, and Evidence Reasoning rule for fusing output results and making decisions. Result: MAINet achieves 96.76% accuracy, 96.78% precision, 96.79% recall and 96.79% F1-Score. Its performance surpasses comparison models including those with single-modality and dual-modality fusion. Ablation experiments confirm the effectiveness of the improvement strategy in enhancing model robustness and feature utilization efficiency. Conclusion: MAINet significantly improves the accuracy, applicability and reliability of multimodal fusion models for quantifying fish feeding intensity in aquaculture systems. Abstract: In recirculating aquaculture systems, accurate and effective assessment of fish feeding intensity is crucial for reducing feed costs and calculating optimal feeding times. However, current studies have limitations in modality selection, feature extraction and fusion, and co-inference for decision making, which restrict further improvement in the accuracy, applicability and reliability of multimodal fusion models. To address this problem, this study proposes a Multi-stage Augmented Multimodal Interaction Network (MAINet) for quantifying fish feeding intensity. Firstly, a general feature extraction framework is proposed to efficiently extract feature information from input image, audio and water wave datas. Second, an Auxiliary-modality Reinforcement Primary-modality Mechanism (ARPM) is designed for inter-modal interaction and generate enhanced features, which consists of a Channel Attention Fusion Network (CAFN) and a Dual-mode Attention Fusion Network (DAFN). Finally, an Evidence Reasoning (ER) rule is introduced to fuse the output results of each modality and make decisions, thereby completing the quantification of fish feeding intensity. The experimental results show that the constructed MAINet reaches 96.76%, 96.78%, 96.79% and 96.79% in accuracy, precision, recall and F1-Score respectively, and its performance is significantly higher than the comparison models. Compared with models that adopt single-modality, dual-modality fusion and different decision-making fusion methods, it also has obvious advantages. Meanwhile, the ablation experiments further verified the key role of the proposed improvement strategy in improving the robustness and feature utilization efficiency of model, which can effectively improve the accuracy of the quantitative results of fish feeding intensity.[82] One-Shot Neural Architecture Search with Network Similarity Directed Initialization for Pathological Image Classification
Renao Yan
Main category: cs.CV
TL;DR: The paper proposes a NSDI strategy and incorporates domain adaptation into one-shot NAS for pathological image analysis, improving classification and feature localization.
Details
Motivation: Existing deep learning methods for pathological image analysis are inefficient due to the direct application of computer vision models without considering the unique characteristics of pathological images. Method: Propose a Network Similarity Directed Initialization (NSDI) strategy to enhance neural architecture search stability and incorporate domain adaptation into one-shot NAS to handle variations in staining and semantic scale across pathology datasets. Result: Experiments on the BRACS dataset show superior classification performance and clinically relevant feature localization compared to existing approaches. Conclusion: NSDI and domain adaptation in one-shot NAS improve efficiency and effectiveness in pathological image analysis. Abstract: Deep learning-based pathological image analysis presents unique challenges due to the practical constraints of network design. Most existing methods apply computer vision models directly to medical tasks, neglecting the distinct characteristics of pathological images. This mismatch often leads to computational inefficiencies, particularly in edge-computing scenarios. To address this, we propose a novel Network Similarity Directed Initialization (NSDI) strategy to improve the stability of neural architecture search (NAS). Furthermore, we introduce domain adaptation into one-shot NAS to better handle variations in staining and semantic scale across pathology datasets. Experiments on the BRACS dataset demonstrate that our method outperforms existing approaches, delivering both superior classification performance and clinically relevant feature localization.[83] Meta-SurDiff: Classification Diffusion Model Optimized by Meta Learning is Reliable for Online Surgical Phase Recognition
Yufei Li,Jirui Wu,Long Tian,Liming Wang,Xiaonan Liu,Zijun Liu,Xiyang Liu
Main category: cs.CV
TL;DR: 在线手术阶段识别因其与人类生活和健康密切相关的潜在下游应用而受到广泛关注。本文提出了一种元学习优化的分类扩散模型(Meta-SurDiff),以充分利用深度生成模型和元学习的优势,实现可靠的在线手术阶段识别。通过广泛的实验验证了该方法在五个常用数据集上的有效性。
Details
Motivation: 尽管深度模型在捕捉手术视频的判别性长期依赖关系方面取得了显著进展,但在探索和建模手术视频中的不确定性方面很少考虑,这对可靠的在线手术阶段识别至关重要。 Method: 将不确定性来源分为两类:视频中的帧模糊和手术阶段之间的不平衡分布。提出了元学习优化的分类扩散模型(Meta-SurDiff),利用深度生成模型和元学习的优势,进行精确的帧级分布估计。对于由模糊视频帧引起的粗略识别,使用分类扩散模型评估识别结果的置信度;对于由不平衡阶段分布引起的粗略识别,使用基于元学习的目标函数学习扩散模型,增强不同手术阶段分类边界的鲁棒性。 Result: 通过在Cholec80、AutoLaparo、M2Cai16、OphNet和NurViD五个广泛使用的数据集上进行大量实验,验证了Meta-SurDiff在在线手术阶段识别中的有效性。这些数据集涵盖了腹腔镜手术、眼科手术和日常护理场景。 Conclusion: Meta-SurDiff在处理手术视频中的不确定性方面表现出色,能够提高在线手术阶段识别的可靠性。 Abstract: Online surgical phase recognition has drawn great attention most recently due to its potential downstream applications closely related to human life and health. Despite deep models have made significant advances in capturing the discriminative long-term dependency of surgical videos to achieve improved recognition, they rarely account for exploring and modeling the uncertainty in surgical videos, which should be crucial for reliable online surgical phase recognition. We categorize the sources of uncertainty into two types, frame ambiguity in videos and unbalanced distribution among surgical phases, which are inevitable in surgical videos. To address this pivot issue, we introduce a meta-learning-optimized classification diffusion model (Meta-SurDiff), to take full advantage of the deep generative model and meta-learning in achieving precise frame-level distribution estimation for reliable online surgical phase recognition. For coarse recognition caused by ambiguous video frames, we employ a classification diffusion model to assess the confidence of recognition results at a finer-grained frame-level instance. For coarse recognition caused by unbalanced phase distribution, we use a meta-learning based objective to learn the diffusion model, thus enhancing the robustness of classification boundaries for different surgical phases.We establish effectiveness of Meta-SurDiff in online surgical phase recognition through extensive experiments on five widely used datasets using more than four practical metrics. The datasets include Cholec80, AutoLaparo, M2Cai16, OphNet, and NurViD, where OphNet comes from ophthalmic surgeries, NurViD is the daily care dataset, while the others come from laparoscopic surgeries. We will release the code upon acceptance.[84] Egocentric Human-Object Interaction Detection: A New Benchmark and Method
Kunyuan Deng,Yi Wang,Lap-Pui Chau
Main category: cs.CV
TL;DR: This paper introduces Ego-HOIBench, a new dataset for egocentric human-object interaction (Ego-HOI) detection with over 27K images annotated. It explores adapting third-person HOI detection methods to this dataset and proposes the Hand Geometry and Interactivity Refinement (HGIR) scheme as a new baseline method which leverages hand pose and geometry to improve interaction representation and detection accuracy.
Details
Motivation: The motivation of this paper is to address the gap in existing human-object interaction (HOI) detection methods which primarily focus on third-person perspectives, by introducing a dataset and methodological approach from an egocentric view for more intuitive HOI detection. Method: The method involves creating the Ego-HOIBench dataset with egocentric images and high-quality annotations. The authors also propose the HGIR scheme, which uses hand pose and geometric information, extracting global hand geometric features and refining interaction-specific features through pose-interaction attention to enhance interaction representation. Result: The results indicate that the HGIR scheme is lightweight and effective, leading to state-of-the-art results on the Ego-HOIBench dataset when applied to HOI baselines in a plug-and-play manner. Conclusion: The conclusion highlights the successful introduction of the Ego-HOIBench dataset and the HGIR scheme as a new baseline for Ego-HOI detection, demonstrating significant improvements in capability and robustness. Abstract: Understanding the interaction between humans and objects has gained much attention in recent years. Existing human-object interaction (HOI) detection methods mainly focus on the third-person perspectives, overlooking a more intuitive way from the egocentric view of HOI, namely Ego-HOI. This paper introduces an Ego-HOIBench, a new dataset to promote the benchmarking and development of Ego-HOI detection. Our Ego-HOIBench comprises more than 27K egocentric images with high-quality hand-verb-object triplet annotations across 123 fine-grained interaction categories and locations, covering a rich diversity of scenarios, object types, and hand configurations in daily activities. In addition, we explore and adapt third-person HOI detection methods to Ego-HOIBench and illustrate the challenges of hand-occluded objects and the complexity of single- and two-hand interactions. To build a new baseline, we propose a Hand Geometry and Interactivity Refinement (HGIR) scheme, which leverages hand pose and geometric information as valuable cues for interpreting interactions. Specifically, the HGIR scheme explicitly extracts global hand geometric features from the estimated hand pose proposals and refines the interaction-specific features using pose-interaction attention. This scheme enables the model to obtain a robust and powerful interaction representation, significantly improving the Ego-HOI detection capability. Our approach is lightweight and effective, and it can be easily applied to HOI baselines in a plug-and-play manner to achieve state-of-the-art results on Ego-HOIBench. Our project is available at: https://dengkunyuan.github.io/EgoHOIBench/[85] HRGS: Hierarchical Gaussian Splatting for Memory-Efficient High-Resolution 3D Reconstruction
Changbai Li,Haodong Zhu,Hanlin Chen,Juan Zhang,Tongfei Chen,Shuo Yang,Shuwei Shao,Wenhao Dong,Baochang Zhang
Main category: cs.CV
TL;DR: HRGS提出了一种分层的高斯点绘技术,通过粗略的全局表示、分块优化以及重要性驱动的高斯剪枝等方法,在内存受限的情况下实现了高质量、高分辨率的3D场景重建。
Details
Motivation: 现有的3D Gaussian Splatting(3DGS)技术在高分辨率场景下存在内存扩展性问题,限制了其应用范围。 Method: 提出Hierarchical Gaussian Splatting (HRGS)框架,包括:1)生成粗略的全局高斯表示;2)将场景划分为多个块,并利用高分辨率数据对每个块进行细化;3)采用Gaussian partitioning和training data partitioning两种策略进行任务分布和数据管理;4)引入Importance-Driven Gaussian Pruning (IDGP),通过计算重要性分数去除贡献小的高斯点;5)结合预训练模型的法线先验以提升表面重建质量。 Result: 在三个基准数据集上的广泛实验表明,HRGS在高分辨率新视角合成(NVS)和表面重建任务中达到了最先进的性能。 Conclusion: HRGS能够在内存受限条件下实现高质量、高分辨率的3D场景重建,为相关领域提供了新的解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has made significant strides in real-time 3D scene reconstruction, but faces memory scalability issues in high-resolution scenarios. To address this, we propose Hierarchical Gaussian Splatting (HRGS), a memory-efficient framework with hierarchical block-level optimization. First, we generate a global, coarse Gaussian representation from low-resolution data. Then, we partition the scene into multiple blocks, refining each block with high-resolution data. The partitioning involves two steps: Gaussian partitioning, where irregular scenes are normalized into a bounded cubic space with a uniform grid for task distribution, and training data partitioning, where only relevant observations are retained for each block. By guiding block refinement with the coarse Gaussian prior, we ensure seamless Gaussian fusion across adjacent blocks. To reduce computational demands, we introduce Importance-Driven Gaussian Pruning (IDGP), which computes importance scores for each Gaussian and removes those with minimal contribution, speeding up convergence and reducing memory usage. Additionally, we incorporate normal priors from a pretrained model to enhance surface reconstruction quality. Our method enables high-quality, high-resolution 3D scene reconstruction even under memory constraints. Extensive experiments on three benchmarks show that HRGS achieves state-of-the-art performance in high-resolution novel view synthesis (NVS) and surface reconstruction tasks.[86] Unified Representation Space for 3D Visual Grounding
Yinuo Zheng,Lipeng Gu,Honghua Chen,Liangliang Nan,Mingqiang Wei
Main category: cs.CV
TL;DR: The paper introduces UniSpace-3D, a method that creates a unified representation space for 3D visual grounding (3DVG), bridging the gap between visual and textual features through innovative designs. It outperforms baselines by at least 2.24% on relevant datasets.
Details
Motivation: Existing methods for 3D visual grounding separately pre-train vision and text encoders, causing errors in object positioning and classification due to discrepancies between spatial geometry and semantic categories. Method: UniSpace-3D incorporates three key components: a unified representation encoder using the pre-trained CLIP model to map features into a shared space, a multi-modal contrastive learning module to further reduce modality gaps, and a language-guided query selection module for identifying object candidates based on descriptions. Result: Extensive experiments show that UniSpace-3D surpasses baseline models by at least 2.24% on the ScanRefer and Nr3D/Sr3D datasets. Conclusion: UniSpace-3D effectively bridges the gap between visual and textual features for 3D visual grounding, leading to improved performance in object identification and classification. Abstract: 3D visual grounding (3DVG) is a critical task in scene understanding that aims to identify objects in 3D scenes based on text descriptions. However, existing methods rely on separately pre-trained vision and text encoders, resulting in a significant gap between the two modalities in terms of spatial geometry and semantic categories. This discrepancy often causes errors in object positioning and classification. The paper proposes UniSpace-3D, which innovatively introduces a unified representation space for 3DVG, effectively bridging the gap between visual and textual features. Specifically, UniSpace-3D incorporates three innovative designs: i) a unified representation encoder that leverages the pre-trained CLIP model to map visual and textual features into a unified representation space, effectively bridging the gap between the two modalities; ii) a multi-modal contrastive learning module that further reduces the modality gap; iii) a language-guided query selection module that utilizes the positional and semantic information to identify object candidate points aligned with textual descriptions. Extensive experiments demonstrate that UniSpace-3D outperforms baseline models by at least 2.24% on the ScanRefer and Nr3D/Sr3D datasets. The code will be made available upon acceptance of the paper.[87] Cross-Modal Geometric Hierarchy Fusion: An Implicit-Submap Driven Framework for Resilient 3D Place Recognition
Xiaohui Jiang,Haijiang Zhu,Chadei Li,Fulin Tang,Ning An
Main category: cs.CV
TL;DR: LiDAR-based place recognition is improved by a novel framework using density-agnostic geometric reasoning, achieving state-of-the-art performance.
Details
Motivation: Existing LiDAR-based place recognition methods suffer from descriptor instability due to inconsistent point cloud density and representation fragility in structurally complex scenarios. Method: A framework that uses an implicit 3D representation based on elastic points to overcome issues with point cloud density. Occupancy grid and normal vector information are derived from this representation, and descriptors are created that fuse geometric information from both bird's-eye view and 3D segment perspectives. Result: The method achieves state-of-the-art performance across various datasets (KITTI, KITTI-360, MulRan, NCLT) and environments. It balances accuracy, runtime, and memory optimization for historical maps effectively. Conclusion: This novel approach improves LiDAR-based place recognition significantly and will be open-sourced in the future. Abstract: LiDAR-based place recognition serves as a crucial enabler for long-term autonomy in robotics and autonomous driving systems. Yet, prevailing methodologies relying on handcrafted feature extraction face dual challenges: (1) Inconsistent point cloud density, induced by ego-motion dynamics and environmental disturbances during repeated traversals, leads to descriptor instability, and (2) Representation fragility stems from reliance on single-level geometric abstractions that lack discriminative power in structurally complex scenarios. To address these limitations, we propose a novel framework that redefines 3D place recognition through density-agnostic geometric reasoning. Specifically, we introduce an implicit 3D representation based on elastic points, which is immune to the interference of original scene point cloud density and achieves the characteristic of uniform distribution. Subsequently, we derive the occupancy grid and normal vector information of the scene from this implicit representation. Finally, with the aid of these two types of information, we obtain descriptors that fuse geometric information from both bird's-eye view (capturing macro-level spatial layouts) and 3D segment (encoding micro-scale surface geometries) perspectives. We conducted extensive experiments on numerous datasets (KITTI, KITTI-360, MulRan, NCLT) across diverse environments. The experimental results demonstrate that our method achieves state-of-the-art performance. Moreover, our approach strikes an optimal balance between accuracy, runtime, and memory optimization for historical maps, showcasing excellent Resilient and scalability. Our code will be open-sourced in the future.[88] synth-dacl: Does Synthetic Defect Data Enhance Segmentation Accuracy and Robustness for Real-World Bridge Inspections?
Johannes Flotzinger,Fabian Deuser,Achref Jaziri,Heiko Neumann,Norbert Oswald,Visvanathan Ramesh,Thomas Braml
Main category: cs.CV
TL;DR: This paper presents synth-dacl, a collection of three synthetic dataset extensions designed to address class imbalance in the dacl10k bridge inspection dataset, leading to improved model performance for crack and cavity segmentation.
Details
Motivation: Adequate bridge inspection is crucial but challenging due to increasing numbers of deteriorating bridges, lack of staff, and financial resources. Automating visual inspections can improve efficiency, accuracy, and safety, but current models struggle with real-world conditions like varied image quality and textures. Method: The authors introduce synth-dacl, which consists of three synthetic dataset extensions created using synthetic concrete textures. These extensions aim to balance the class distribution in the existing dacl10k dataset and enhance model performance, particularly for fine-grained classes such as cracks and cavities. Result: Incorporating synth-dacl leads to significant improvements in model robustness across 15 perturbed test sets. Specifically, models trained on dacl10k combined with all synth-dacl extensions show a 2% increase in mean IoU, F1 score, Recall, and Precision compared to models trained solely on dacl10k. Conclusion: The synth-dacl extensions effectively address class imbalance in the dacl10k dataset, resulting in enhanced model performance for crack and cavity segmentation under various real-world conditions. Abstract: Adequate bridge inspection is increasingly challenging in many countries due to growing ailing stocks, compounded with a lack of staff and financial resources. Automating the key task of visual bridge inspection, classification of defects and building components on pixel level, improves efficiency, increases accuracy and enhances safety in the inspection process and resulting building assessment. Models overtaking this task must cope with an assortment of real-world conditions. They must be robust to variations in image quality, as well as background texture, as defects often appear on surfaces of diverse texture and degree of weathering. dacl10k is the largest and most diverse dataset for real-world concrete bridge inspections. However, the dataset exhibits class imbalance, which leads to notably poor model performance particularly when segmenting fine-grained classes such as cracks and cavities. This work introduces "synth-dacl", a compilation of three novel dataset extensions based on synthetic concrete textures. These extensions are designed to balance class distribution in dacl10k and enhance model performance, especially for crack and cavity segmentation. When incorporating the synth-dacl extensions, we observe substantial improvements in model robustness across 15 perturbed test sets. Notably, on the perturbed test set, a model trained on dacl10k combined with all synthetic extensions achieves a 2% increase in mean IoU, F1 score, Recall, and Precision compared to the same model trained solely on dacl10k.[89] Comparison of Two Methods for Stationary Incident Detection Based on Background Image
Deepak Ghimire,Joonwhoan Lee
Main category: cs.CV
TL;DR: This paper proposes two background subtraction-based schemes for detecting temporarily stationary objects, comparing them in terms of performance and complexity, and uses NCC-based image comparison for tracking detected objects with robustness to occlusion and illumination changes.
Details
Motivation: To address the detection of temporarily stationary objects in visual tracking applications, which typically use background subtraction-based methods for moving object detection. Method: Two schemes are proposed: one using a single background and another using dual backgrounds generated with different learning rates. Detected stationary objects are tracked using normalized cross correlation (NCC) based image comparison. Result: The proposed method is robust against partial occlusion, short-time full occlusion, and illumination changes, and it can operate in real time. Conclusion: The dual-background scheme offers better detection performance at increased computational cost compared to the single-background scheme. Abstract: In general, background subtraction-based methods are used to detect moving objects in visual tracking applications. In this paper, we employed a background subtraction-based scheme to detect the temporarily stationary objects. We proposed two schemes for stationary object detection, and we compare those in terms of detection performance and computational complexity. In the first approach, we used a single background, and in the second approach, we used dual backgrounds, generated with different learning rates, in order to detect temporarily stopped objects. Finally, we used normalized cross correlation (NCC) based image comparison to monitor and track the detected stationary object in a video scene. The proposed method is robust with partial occlusion, short-time fully occlusion, and illumination changes, and it can operate in real time.[90] Exploring Non-contrastive Self-supervised Representation Learning for Image-based Profiling
Siran Dai,Qianqian Xu,Peisong Wen,Yang Liu,Qingming Huang
Main category: cs.CV
TL;DR: The paper introduces SSLProfiler, a non-contrastive self-supervised learning framework for creating informative representations of cell images which won the Cell Line Transferability challenge at CVPR 2025.
Details
Motivation: Image-based cell profiling is crucial in drug discovery and can benefit from improvements in computer vision techniques. Existing self-supervised learning methods face challenges when applied to cell images due to differences between cell and natural image distributions, as well as the need to process multiple input images effectively. Method: The authors propose SSLProfiler, a non-contrastive self-supervised learning framework designed specifically for cell profiling. It includes specialized data augmentation and representation post-processing methods tailored to cell images to overcome the challenges of distribution differences and multi-image processing. Result: SSLProfiler successfully addresses the challenges in applying self-supervised learning to cell images and wins the Cell Line Transferability challenge at CVPR 2025. Conclusion: SSLProfiler demonstrates the potential of non-contrastive self-supervised learning in creating robust feature extractors for cell images, advancing the field of image-based cell profiling. Abstract: Image-based cell profiling aims to create informative representations of cell images. This technique is critical in drug discovery and has greatly advanced with recent improvements in computer vision. Inspired by recent developments in non-contrastive Self-Supervised Learning (SSL), this paper provides an initial exploration into training a generalizable feature extractor for cell images using such methods. However, there are two major challenges: 1) There is a large difference between the distributions of cell images and natural images, causing the view-generation process in existing SSL methods to fail; and 2) Unlike typical scenarios where each representation is based on a single image, cell profiling often involves multiple input images, making it difficult to effectively combine all available information. To overcome these challenges, we propose SSLProfiler, a non-contrastive SSL framework specifically designed for cell profiling. We introduce specialized data augmentation and representation post-processing methods tailored to cell images, which effectively address the issues mentioned above and result in a robust feature extractor. With these improvements, SSLProfiler won the Cell Line Transferability challenge at CVPR 2025.[91] Leader360V: The Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment
Weiming Zhang,Dingwen Xiao,Aobotao Dai,Yexin Liu,Tianbo Pan,Shiqi Wen,Lei Chen,Lin Wang
Main category: cs.CV
TL;DR: This paper introduces Leader360V, the first large-scale labeled real-world 360 video dataset for instance segmentation and tracking. It includes diverse scenes and uses an automatic labeling pipeline with three novel stages to address annotation challenges.
Details
Motivation: To overcome the lack of large-scale, labeled real-world 360 video datasets due to inherent spherical properties that make annotation costly and complex. Method: Design an automatic labeling pipeline with three stages: Initial Annotation Phase using a Semantic- and Distortion-aware Refinement module, Auto-Refine Annotation Phase correcting missing regions, and Manual Revision Phase incorporating LLMs and human annotators. Result: Extensive user studies and evaluations demonstrate the effectiveness of the labeling pipeline. Experiments confirm Leader360V significantly enhances model performance for 360 video segmentation and tracking. Conclusion: Leader360V paves the way for more scalable 360 scene understanding by enhancing model performance in 360 video segmentation and tracking. Abstract: 360 video captures the complete surrounding scenes with the ultra-large field of view of 360X180. This makes 360 scene understanding tasks, eg, segmentation and tracking, crucial for appications, such as autonomous driving, robotics. With the recent emergence of foundation models, the community is, however, impeded by the lack of large-scale, labelled real-world datasets. This is caused by the inherent spherical properties, eg, severe distortion in polar regions, and content discontinuities, rendering the annotation costly yet complex. This paper introduces Leader360V, the first large-scale, labeled real-world 360 video datasets for instance segmentation and tracking. Our datasets enjoy high scene diversity, ranging from indoor and urban settings to natural and dynamic outdoor scenes. To automate annotation, we design an automatic labeling pipeline, which subtly coordinates pre-trained 2D segmentors and large language models to facilitate the labeling. The pipeline operates in three novel stages. Specifically, in the Initial Annotation Phase, we introduce a Semantic- and Distortion-aware Refinement module, which combines object mask proposals from multiple 2D segmentors with LLM-verified semantic labels. These are then converted into mask prompts to guide SAM2 in generating distortion-aware masks for subsequent frames. In the Auto-Refine Annotation Phase, missing or incomplete regions are corrected either by applying the SDR again or resolving the discontinuities near the horizontal borders. The Manual Revision Phase finally incorporates LLMs and human annotators to further refine and validate the annotations. Extensive user studies and evaluations demonstrate the effectiveness of our labeling pipeline. Meanwhile, experiments confirm that Leader360V significantly enhances model performance for 360 video segmentation and tracking, paving the way for more scalable 360 scene understanding.[92] FRIDU: Functional Map Refinement with Guided Image Diffusion
Avigail Cohen Rimon,Mirela Ben-Chen,Or Litany
Main category: cs.CV
TL;DR: The paper proposes a new method for refining correspondence maps between shapes using an image diffusion model in the functional map space, showing competitiveness with state-of-the-art methods and highlighting guided diffusion models as a promising approach.
Details
Motivation: To improve the accuracy of correspondence maps between two shapes by treating the functional map as a 2D image and leveraging image diffusion models for refinement. Method: Train an image diffusion model in the functional map space to generate accurate maps from inaccurate initial ones, using pointwise maps as guidance during inference to encourage objectives like orthogonality and commutativity. Result: The approach is competitive with current state-of-the-art methods for map refinement, demonstrating the potential of guided diffusion models in functional map processing. Conclusion: Guided diffusion models provide a promising direction for refining functional maps, offering efficient training and effective guidance mechanisms. Abstract: We propose a novel approach for refining a given correspondence map between two shapes. A correspondence map represented as a functional map, namely a change of basis matrix, can be additionally treated as a 2D image. With this perspective, we train an image diffusion model directly in the space of functional maps, enabling it to generate accurate maps conditioned on an inaccurate initial map. The training is done purely in the functional space, and thus is highly efficient. At inference time, we use the pointwise map corresponding to the current functional map as guidance during the diffusion process. The guidance can additionally encourage different functional map objectives, such as orthogonality and commutativity with the Laplace-Beltrami operator. We show that our approach is competitive with state-of-the-art methods of map refinement and that guided diffusion models provide a promising pathway to functional map processing.[93] FGA-NN: Film Grain Analysis Neural Network
Zoubida Ameur,Frédéric Lefebvre,Philippe De Lagrange,Miloš Radosavljević
Main category: cs.CV
TL;DR: FGA-NN是一种新的基于学习的电影颗粒分析方法,能够在压缩视频时保持艺术意图,通过在编码前分析和建模电影颗粒,并在解码后合成。实验表明,FGA-NN在分析精度和合成复杂性之间取得了优越的平衡,同时具有鲁棒性和适用性。
Details
Motivation: 目前在中低比特率下压缩电影内容时,由于电影颗粒的随机性,常常会丢失电影颗粒。为了在有效压缩的同时保留艺术意图,需要在编码前对电影颗粒进行分析和建模,并在解码后进行合成。 Method: 引入了FGA-NN,这是第一个基于学习的电影颗粒分析方法,用于估计与传统合成兼容的传统电影颗粒参数。 Result: 定量和定性结果表明,FGA-NN在分析准确性和合成复杂性之间具有优越的平衡,同时还展示了其鲁棒性和适用性。 Conclusion: FGA-NN是电影颗粒分析领域的一个重要进展,它能够更好地保留压缩视频中的电影颗粒效果,从而保持艺术意图。 Abstract: Film grain, once a by-product of analog film, is now present in most cinematographic content for aesthetic reasons. However, when such content is compressed at medium to low bitrates, film grain is lost due to its random nature. To preserve artistic intent while compressing efficiently, film grain is analyzed and modeled before encoding and synthesized after decoding. This paper introduces FGA-NN, the first learning-based film grain analysis method to estimate conventional film grain parameters compatible with conventional synthesis. Quantitative and qualitative results demonstrate FGA-NN's superior balance between analysis accuracy and synthesis complexity, along with its robustness and applicability.[94] EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization
Xiaoqi Wang,Yi Wang,Lap-Pui Chau
Main category: cs.CV
TL;DR: EVA02-AT is a set of video-language foundation models based on EVA02 for egocentric video understanding tasks. It efficiently transfers an image-based CLIP model into a video encoder, introduces spatial-temporal rotary positional embeddings and joint attention, and proposes the SMS loss for multi-instance retrieval tasks. Extensive experiments show state-of-the-art performance with fewer parameters.
Details
Motivation: Existing methods for egocentric video-language understanding face challenges such as high pre-training cost, ineffective spatial-temporal encoding, and imprecise learning objectives in multi-instance retrieval. Method: 1) Single-stage pretraining to transfer an image-based CLIP model into a unified video encoder; 2) Spatial-temporal rotary positional embeddings along with joint attention for effective feature encoding; 3) Symmetric Multi-Similarity (SMS) loss for precise learning objectives in multi-instance retrieval. Result: EVA02-AT achieves state-of-the-art performance across diverse egocentric video-language tasks with fewer parameters. Models with SMS loss also show significant performance gains on multi-instance retrieval benchmarks. Conclusion: EVA02-AT addresses the key challenges in egocentric video-language understanding and outperforms existing methods with fewer parameters. The code and models are publicly available. Abstract: Egocentric video-language understanding demands both high efficiency and accurate spatial-temporal modeling. Existing approaches face three key challenges: 1) Excessive pre-training cost arising from multi-stage pre-training pipelines, 2) Ineffective spatial-temporal encoding due to manually split 3D rotary positional embeddings that hinder feature interactions, and 3) Imprecise learning objectives in soft-label multi-instance retrieval, which neglect negative pair correlations. In this paper, we introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks. EVA02-AT first efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining. Second, instead of applying rotary positional embeddings to isolated dimensions, we introduce spatial-temporal rotary positional embeddings along with joint attention, which can effectively encode both spatial and temporal information on the entire hidden dimension. This joint encoding of spatial-temporal features enables the model to learn cross-axis relationships, which are crucial for accurately modeling motion and interaction in videos. Third, focusing on multi-instance video-language retrieval tasks, we introduce the Symmetric Multi-Similarity (SMS) loss and a novel training framework that advances all soft labels for both positive and negative pairs, providing a more precise learning objective. Extensive experiments on Ego4D, EPIC-Kitchens-100, and Charades-Ego under zero-shot and fine-tuning settings demonstrate that EVA02-AT achieves state-of-the-art performance across diverse egocentric video-language tasks with fewer parameters. Models with our SMS loss also show significant performance gains on multi-instance retrieval benchmarks. Our code and models are publicly available at https://github.com/xqwang14/EVA02-AT .[95] HydroChronos: Forecasting Decades of Surface Water Change
Daniele Rege Cambrin,Eleonora Poeta,Eliana Pastor,Isaac Corley,Tania Cerquitelli,Elena Baralis,Paolo Garza
Main category: cs.CV
TL;DR: The paper presents HydroChronos, a large-scale dataset for surface water dynamics forecasting, and AquaClimaTempo UNet, a new model that outperforms baselines.
Details
Motivation: The motivation of this paper is to address the lack of comprehensive datasets and standardized benchmarks in the field of surface water dynamics forecasting, which is crucial for water resource management and climate change adaptation. Method: The method involves introducing HydroChronos, a multi-modal spatiotemporal dataset covering over three decades of data from Landsat 5 and Sentinel-2 imagery, climate data, and Digital Elevation Models. The authors also propose AquaClimaTempo UNet, a novel spatiotemporal architecture with a dedicated climate data branch, to serve as a benchmark baseline. Result: AquaClimaTempo UNet significantly outperforms a Persistence baseline by +14% and +11% F1 across change detection and direction of change classification tasks, and by +0.1 MAE on the magnitude of change regression. Conclusion: The paper concludes by conducting an Explainable AI analysis to identify key climate variables and input channels influencing surface water change, offering insights for future modeling efforts. Abstract: Forecasting surface water dynamics is crucial for water resource management and climate change adaptation. However, the field lacks comprehensive datasets and standardized benchmarks. In this paper, we introduce HydroChronos, a large-scale, multi-modal spatiotemporal dataset for surface water dynamics forecasting designed to address this gap. We couple the dataset with three forecasting tasks. The dataset includes over three decades of aligned Landsat 5 and Sentinel-2 imagery, climate data, and Digital Elevation Models for diverse lakes and rivers across Europe, North America, and South America. We also propose AquaClimaTempo UNet, a novel spatiotemporal architecture with a dedicated climate data branch, as a strong benchmark baseline. Our model significantly outperforms a Persistence baseline for forecasting future water dynamics by +14% and +11% F1 across change detection and direction of change classification tasks, and by +0.1 MAE on the magnitude of change regression. Finally, we conduct an Explainable AI analysis to identify the key climate variables and input channels that influence surface water change, providing insights to inform and guide future modeling efforts.[96] DGG-XNet: A Hybrid Deep Learning Framework for Multi-Class Brain Disease Classification with Explainable AI
Sumshun Nahar Eity,Mahin Montasir Afif,Tanisha Fairooz,Md. Mortuza Ahmmed,Md Saef Ullah Miah
Main category: cs.CV
TL;DR: DGG-XNet is a hybrid deep learning model that combines VGG16 and DenseNet121 for improved feature extraction and classification of neurological conditions, achieving high accuracy and interpretability in diagnosing brain disorders.
Details
Motivation: Accurate diagnosis of brain disorders such as Alzheimer's disease and brain tumors remains a critical challenge in medical imaging. Conventional methods based on manual MRI analysis are often inefficient and error-prone. Method: The proposed method, DGG-XNet, integrates VGG16 and DenseNet121 to enhance feature extraction and classification. DenseNet121 promotes feature reuse and efficient gradient flow through dense connectivity while VGG16 contributes strong hierarchical spatial representations. Grad-CAM is applied to visualize salient regions, enhancing model transparency. Result: Trained on a combined dataset from BraTS 2021 and Kaggle, DGG-XNet achieved a test accuracy of 91.33%, with precision, recall, and F1-score all exceeding 91%. Conclusion: These results highlight DGG-XNet's potential as an effective and interpretable tool for computer-aided diagnosis (CAD) of neurodegenerative and oncological brain disorders. Abstract: Accurate diagnosis of brain disorders such as Alzheimer's disease and brain tumors remains a critical challenge in medical imaging. Conventional methods based on manual MRI analysis are often inefficient and error-prone. To address this, we propose DGG-XNet, a hybrid deep learning model integrating VGG16 and DenseNet121 to enhance feature extraction and classification. DenseNet121 promotes feature reuse and efficient gradient flow through dense connectivity, while VGG16 contributes strong hierarchical spatial representations. Their fusion enables robust multiclass classification of neurological conditions. Grad-CAM is applied to visualize salient regions, enhancing model transparency. Trained on a combined dataset from BraTS 2021 and Kaggle, DGG-XNet achieved a test accuracy of 91.33\%, with precision, recall, and F1-score all exceeding 91\%. These results highlight DGG-XNet's potential as an effective and interpretable tool for computer-aided diagnosis (CAD) of neurodegenerative and oncological brain disorders.[97] Discrete JEPA: Learning Discrete Token Representations without Reconstruction
Junyeob Baek,Hosung Lee,Christopher Hoang,Mengye Ren,Sungjin Ahn
Main category: cs.CV
TL;DR: Discrete-JEPA is proposed to enhance symbolic abstraction and logical reasoning in image tokenization by extending latent predictive coding with semantic tokenization.
Details
Motivation: Current image tokenization methods struggle with symbolic abstraction and logical reasoning, which are crucial for systematic inference. Method: The Discrete-JEPA model extends the latent predictive coding framework incorporating semantic tokenization and complementary objectives to improve tokenization for symbolic reasoning tasks. Result: Discrete-JEPA significantly outperforms baselines on visual symbolic prediction tasks, revealing deliberate systematic patterns in the semantic token space. Conclusion: As a preliminary model, Discrete-JEPA shows promise in advancing symbolic world modeling and planning capabilities in AI systems. Abstract: The cornerstone of cognitive intelligence lies in extracting hidden patterns from observations and leveraging these principles to systematically predict future outcomes. However, current image tokenization methods demonstrate significant limitations in tasks requiring symbolic abstraction and logical reasoning capabilities essential for systematic inference. To address this challenge, we propose Discrete-JEPA, extending the latent predictive coding framework with semantic tokenization and novel complementary objectives to create robust tokenization for symbolic reasoning tasks. Discrete-JEPA dramatically outperforms baselines on visual symbolic prediction tasks, while striking visual evidence reveals the spontaneous emergence of deliberate systematic patterns within the learned semantic token space. Though an initial model, our approach promises a significant impact for advancing Symbolic world modeling and planning capabilities in artificial intelligence systems.[98] DepthSeg: Depth prompting in remote sensing semantic segmentation
Ning Zhou,Shanxiong Chen,Mingting Zhou,Haigang Sui,Lieyun Hu,Han Li,Li Hua,Qiming Zhou
Main category: cs.CV
TL;DR: The paper introduces DepthSeg, a framework for 2D remote sensing semantic segmentation that incorporates depth/height information to improve land cover mapping accuracy.
Details
Motivation: Existing semantic segmentation methods for remote sensing focus on spectral characteristics but ignore elevation differences, leading to misclassification in complex scenarios like shadow occlusion and spectral confusion. Method: DepthSeg framework includes three phases: feature extraction with a lightweight adapter for fine-tuning vision transformer encoder, depth prompting with a depth prompter to model depth/height features, and semantic prediction with a decoder that couples depth prompts with land-cover features. Result: Experiments on the LiuZhou dataset show advantages of DepthSeg in land cover mapping tasks, and ablation studies highlight the significance of depth prompts. Conclusion: Incorporating depth/height information into remote sensing semantic segmentation can mitigate spectral confusion and shadow occlusion, improving land cover classification accuracy. Abstract: Remote sensing semantic segmentation is crucial for extracting detailed land surface information, enabling applications such as environmental monitoring, land use planning, and resource assessment. In recent years, advancements in artificial intelligence have spurred the development of automatic remote sensing semantic segmentation methods. However, the existing semantic segmentation methods focus on distinguishing spectral characteristics of different objects while ignoring the differences in the elevation of the different targets. This results in land cover misclassification in complex scenarios involving shadow occlusion and spectral confusion. In this paper, we introduce a depth prompting two-dimensional (2D) remote sensing semantic segmentation framework (DepthSeg). It automatically models depth/height information from 2D remote sensing images and integrates it into the semantic segmentation framework to mitigate the effects of spectral confusion and shadow occlusion. During the feature extraction phase of DepthSeg, we introduce a lightweight adapter to enable cost-effective fine-tuning of the large-parameter vision transformer encoder pre-trained by natural images. In the depth prompting phase, we propose a depth prompter to model depth/height features explicitly. In the semantic prediction phase, we introduce a semantic classification decoder that couples the depth prompts with high-dimensional land-cover features, enabling accurate extraction of land-cover types. Experiments on the LiuZhou dataset validate the advantages of the DepthSeg framework in land cover mapping tasks. Detailed ablation studies further highlight the significance of the depth prompts in remote sensing semantic segmentation.[99] GrFormer: A Novel Transformer on Grassmann Manifold for Infrared and Visible Image Fusion
Huan Kang,Hui Li,Xiao-Jun Wu,Tianyang Xu,Rui Wang,Chunyang Cheng,Josef Kittler
Main category: cs.CV
TL;DR: This paper proposes GrFormer, a novel attention mechanism based on Grassmann manifold for infrared and visible image fusion. It constructs a low-rank subspace mapping through projection constraints to achieve multi-scale semantic fusion and develops a cross-modal fusion strategy (CMS) to integrate significant information effectively.
Details
Motivation: The motivation of this paper is the limitation of Euclidean methods in encapsulating the intrinsic topological structure when source images are located in a non-Euclidean space, leading to undesired attention output and decreased fusion performance in infrared and visible image fusion tasks. Method: The method involves constructing a low-rank subspace mapping through projection constraints on the Grassmann manifold, which compresses attention features into subspaces of varying rank levels to decouple high-frequency details and low-frequency semantics. Additionally, it includes developing a cross-modal fusion strategy (CMS) based on a covariance mask. Result: The experimental results demonstrate that the proposed network outperforms state-of-the-art methods both qualitatively and quantitatively on multiple image fusion benchmarks. Conclusion: GrFormer, with its novel attention mechanism and cross-modal fusion strategy, achieves superior performance in infrared and visible image fusion. Abstract: In the field of image fusion, promising progress has been made by modeling data from different modalities as linear subspaces. However, in practice, the source images are often located in a non-Euclidean space, where the Euclidean methods usually cannot encapsulate the intrinsic topological structure. Typically, the inner product performed in the Euclidean space calculates the algebraic similarity rather than the semantic similarity, which results in undesired attention output and a decrease in fusion performance. While the balance of low-level details and high-level semantics should be considered in infrared and visible image fusion task. To address this issue, in this paper, we propose a novel attention mechanism based on Grassmann manifold for infrared and visible image fusion (GrFormer). Specifically, our method constructs a low-rank subspace mapping through projection constraints on the Grassmann manifold, compressing attention features into subspaces of varying rank levels. This forces the features to decouple into high-frequency details (local low-rank) and low-frequency semantics (global low-rank), thereby achieving multi-scale semantic fusion. Additionally, to effectively integrate the significant information, we develop a cross-modal fusion strategy (CMS) based on a covariance mask to maximise the complementary properties between different modalities and to suppress the features with high correlation, which are deemed redundant. The experimental results demonstrate that our network outperforms SOTA methods both qualitatively and quantitatively on multiple image fusion benchmarks. The codes are available at https://github.com/Shaoyun2023.[100] Decoupled Classifier-Free Guidance for Counterfactual Diffusion Models
Tian Xia,Fabio De Sousa Ribeiro,Rajat R Rasal,Avinash Kori,Raghav Mehta,Ben Glocker
Main category: cs.CV
TL;DR: Diffusion models are used for counterfactual image generation but suffer from attribute amplification. This paper proposes Decoupled Classifier-Free Guidance (DCFG) to improve intervention fidelity and reversibility.
Details
Motivation: Current methods using classifier-free guidance (CFG) in diffusion models apply a single global weight across all conditioning variables, leading to poor identity preservation and unintended attribute changes during counterfactual image generation. Method: The proposed method, Decoupled Classifier-Free Guidance (DCFG), introduces group-wise conditioning control by partitioning attributes into intervened and invariant sets based on a causal graph and applying distinct guidance to each set. Result: Experiments on CelebA-HQ, MIMIC-CXR, and EMBED demonstrate that DCFG improves intervention fidelity, mitigates unintended changes, and enhances reversibility of counterfactual image generation. Conclusion: DCFG is a flexible and model-agnostic framework that enables more faithful and interpretable counterfactual image generation by addressing the limitations of standard CFG. Abstract: Counterfactual image generation aims to simulate realistic visual outcomes under specific causal interventions. Diffusion models have recently emerged as a powerful tool for this task, combining DDIM inversion with conditional generation via classifier-free guidance (CFG). However, standard CFG applies a single global weight across all conditioning variables, which can lead to poor identity preservation and spurious attribute changes - a phenomenon known as attribute amplification. To address this, we propose Decoupled Classifier-Free Guidance (DCFG), a flexible and model-agnostic framework that introduces group-wise conditioning control. DCFG builds on an attribute-split embedding strategy that disentangles semantic inputs, enabling selective guidance on user-defined attribute groups. For counterfactual generation, we partition attributes into intervened and invariant sets based on a causal graph and apply distinct guidance to each. Experiments on CelebA-HQ, MIMIC-CXR, and EMBED show that DCFG improves intervention fidelity, mitigates unintended changes, and enhances reversibility, enabling more faithful and interpretable counterfactual image generation.[101] Causally Steered Diffusion for Automated Video Counterfactual Generation
Nikos Spyrou,Athanasios Vlontzos,Paraskevas Pegios,Thomas Melistas,Nefeli Gkouti,Yannis Panagakis,Giorgos Papanastasiou,Sotirios A. Tsaftaris
Main category: cs.CV
TL;DR: The paper presents a causally faithful framework for counterfactual video generation using vision-language models, enabling realistic 'what-if' scenarios.
Details
Motivation: Current text-to-image latent diffusion models adapted for video editing struggle with maintaining causal relationships in content, potentially leading to unrealistic or misleading outcomes when edits affect causally dependent attributes. Method: The method uses a causally faithful framework guided by a vision-language model (VLM) that optimizes text prompts based on an assumed causal graph. It is agnostic to the underlying video editing system and does not require access to internal mechanisms or finetuning. Result: The approach was evaluated using standard video quality metrics and counterfactual-specific criteria, showing effective generation of causally faithful video counterfactuals within the learned distribution of latent diffusion models. Conclusion: This method can generate realistic 'what-if' video scenarios compatible with any black-box video editing system, holding potential for applications in healthcare and digital media. Abstract: Adapting text-to-image (T2I) latent diffusion models for video editing has shown strong visual fidelity and controllability, but challenges remain in maintaining causal relationships in video content. Edits affecting causally dependent attributes risk generating unrealistic or misleading outcomes if these relationships are ignored. In this work, we propose a causally faithful framework for counterfactual video generation, guided by a vision-language model (VLM). Our method is agnostic to the underlying video editing system and does not require access to its internal mechanisms or finetuning. Instead, we guide the generation by optimizing text prompts based on an assumed causal graph, addressing the challenge of latent space control in LDMs. We evaluate our approach using standard video quality metrics and counterfactual-specific criteria, such as causal effectiveness and minimality. Our results demonstrate that causally faithful video counterfactuals can be effectively generated within the learned distribution of LDMs through prompt-based causal steering. With its compatibility with any black-box video editing system, our method holds significant potential for generating realistic "what-if" video scenarios in diverse areas such as healthcare and digital media.[102] Compositional Attribute Imbalance in Vision Datasets
Jiayi Chen,Yanbiao Ma,Andi Zhang,Weidong Tang,Wei Dai,Bowei Liu
Main category: cs.CV
TL;DR: Visual attribute imbalance impacts model performance. This paper defines image attributes, analyzes imbalance effects, and proposes a sampling strategy with data augmentation to improve model robustness and fairness.
Details
Motivation: To address the underexplored issue of visual attribute imbalance in image classification which significantly impacts model performance and generalization. Method: Define first-level and second-level image attributes, introduce a CLIP-based framework for automatic attribute evaluation, analyze single- and compositional attribute imbalance, propose adjusting sampling probability based on attribute rarity, integrate with data augmentation techniques. Result: Extensive experiments show that the proposed method effectively mitigates attribute imbalance, improving model robustness and fairness. Conclusion: The research highlights the importance of modeling visual attribute distributions and provides a scalable solution for long-tail image classification tasks. Abstract: Visual attribute imbalance is a common yet underexplored issue in image classification, significantly impacting model performance and generalization. In this work, we first define the first-level and second-level attributes of images and then introduce a CLIP-based framework to construct a visual attribute dictionary, enabling automatic evaluation of image attributes. By systematically analyzing both single-attribute imbalance and compositional attribute imbalance, we reveal how the rarity of attributes affects model performance. To tackle these challenges, we propose adjusting the sampling probability of samples based on the rarity of their compositional attributes. This strategy is further integrated with various data augmentation techniques (such as CutMix, Fmix, and SaliencyMix) to enhance the model's ability to represent rare attributes. Extensive experiments on benchmark datasets demonstrate that our method effectively mitigates attribute imbalance, thereby improving the robustness and fairness of deep neural networks. Our research highlights the importance of modeling visual attribute distributions and provides a scalable solution for long-tail image classification tasks.[103] Toward Rich Video Human-Motion2D Generation
Ruihao Xi,Xuekuan Wang,Yongcheng Li,Shuhua Li,Zichen Wang,Yiwei Wang,Feng Wei,Cairong Zhao
Main category: cs.CV
TL;DR: 生成真实且可控的人类动作,特别是涉及丰富多角色交互的动作,仍然是一个重大挑战。为了解决数据稀缺和建模人际动力学的复杂性问题,我们首先引入了一个新的大规模丰富的视频人类动作2D数据集(Motion2D-Video-150K),然后基于该数据集,我们提出了一个新的扩散基础的丰富视频人类动作2D生成(RVHM2D)模型。RVHM2D包含了一个增强的文字条件机制,利用双文本编码器(CLIP-L/B)或T5-XXL与全局和局部特征。我们设计了两阶段训练策略:模型首先用标准扩散目标进行训练,然后用强化学习进行微调,使用基于FID的奖励来进一步提高动作的真实性和文字对齐。广泛的实验表明,RVHM2D在Motion2D-Video-150K基准上生成单个和交互式双字符场景方面取得了领先的表现。
Details
Motivation: 生成真实且可控的人类动作,特别是涉及丰富多角色交互的动作,是由于数据稀缺和建模人际动力学的复杂性而面临的重大挑战。 Method: 首先引入了一个新的大规模丰富的视频人类动作2D数据集(Motion2D-Video-150K),然后基于该数据集,提出了一个新的扩散基础的丰富视频人类动作2D生成(RVHM2D)模型。RVHM2D包含了一个增强的文字条件机制,利用双文本编码器(CLIP-L/B)或T5-XXL与全局和局部特征。采用两阶段训练策略:模型首先用标准扩散目标进行训练,然后用强化学习进行微调,使用基于FID的奖励来进一步提高动作的真实性和文字对齐。 Result: RVHM2D在Motion2D-Video-150K基准上生成单个和交互式双字符场景方面取得了领先的表现。 Conclusion: 通过提出Motion2D-Video-150K数据集和RVHM2D模型,解决了数据稀缺和建模人际动力学的复杂性问题,成功生成了真实且可控的人类动作,特别是在涉及丰富多角色交互的情况下。 Abstract: Generating realistic and controllable human motions, particularly those involving rich multi-character interactions, remains a significant challenge due to data scarcity and the complexities of modeling inter-personal dynamics. To address these limitations, we first introduce a new large-scale rich video human motion 2D dataset (Motion2D-Video-150K) comprising 150,000 video sequences. Motion2D-Video-150K features a balanced distribution of diverse single-character and, crucially, double-character interactive actions, each paired with detailed textual descriptions. Building upon this dataset, we propose a novel diffusion-based rich video human motion2D generation (RVHM2D) model. RVHM2D incorporates an enhanced textual conditioning mechanism utilizing either dual text encoders (CLIP-L/B) or T5-XXL with both global and local features. We devise a two-stage training strategy: the model is first trained with a standard diffusion objective, and then fine-tuned using reinforcement learning with an FID-based reward to further enhance motion realism and text alignment. Extensive experiments demonstrate that RVHM2D achieves leading performance on the Motion2D-Video-150K benchmark in generating both single and interactive double-character scenarios.[104] MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models
Hongyu Wang,Jiayu Xu,Ruiping Wang,Yan Feng,Yitao Zhai,Peng Pei,Xunliang Cai,Xilin Chen
Main category: cs.CV
TL;DR: The paper introduces MoTE, a method to train Mixture-of-Ternary-Experts models from dense checkpoint that achieves comparable performance with lower memory footprint.
Details
Motivation: To address the high memory footprint issue of full-precision experts in Mixture-of-Experts (MoEs) models, which poses challenges for deployment on edge devices. Method: Propose MoTE approach which trains more low-precision experts instead of fewer high-precision ones during up-cycling. Use pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Compatible with post-training quantization methods. Result: MoTE has promising scaling trend along model size and achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. With post-training quantization, MoTE outperforms MoE-LLaVA by 4.3% average accuracy given the same expert memory footprint. Conclusion: MoTE is effective and potential for memory-constrained devices. Abstract: Large multimodal Mixture-of-Experts (MoEs) effectively scale the model size to boost performance while maintaining fixed active parameters. However, previous works primarily utilized full-precision experts during sparse up-cycling. Despite they show superior performance on end tasks, the large amount of experts introduces higher memory footprint, which poses significant challenges for the deployment on edge devices. In this work, we propose MoTE, a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint. Instead of training fewer high-precision experts, we propose to train more low-precision experts during up-cycling. Specifically, we use the pre-trained FFN as a shared expert and train ternary routed experts with parameters in {-1, 0, 1}. Extensive experiments show that our approach has promising scaling trend along model size. MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint. Furthermore, our approach is compatible with post-training quantization methods and the advantage further amplifies when memory-constraint goes lower. Given the same amount of expert memory footprint of 3.4GB and combined with post-training quantization, MoTE outperforms MoE-LLaVA by a gain of 4.3% average accuracy on end tasks, demonstrating its effectiveness and potential for memory-constrained devices.[105] Model compression using knowledge distillation with integrated gradients
David E. Hernandez,Jose Chang,Torbjörn E. M. Nordling
Main category: cs.CV
TL;DR: Model compression technique using IG-augmented knowledge distillation shows significant improvement in accuracy and reduction in inference time for deployment on edge devices.
Details
Motivation: To develop an efficient model compression method that allows deployment of deep learning models on resource-constrained devices without significant loss in accuracy. Method: A novel method integrating gradients (IG) maps overlayed onto input images during training to provide student models with insights into teacher models' decision-making processes. Precomputing IG maps before training transforms runtime costs into a one-time preprocessing step. Result: Achieved 92.6% testing accuracy with a 4.1x compression factor, showing a 1.1 percentage point improvement over non-distilled models. Reduced inference time from 140 ms to 13 ms. Confirmed statistical robustness and generalisability beyond initial dataset through extensive experiments. Conclusion: This IG-based knowledge distillation framework is a viable compression technique for real-world deployment on edge devices while maintaining competitive accuracy. Abstract: Model compression is critical for deploying deep learning models on resource-constrained devices. We introduce a novel method enhancing knowledge distillation with integrated gradients (IG) as a data augmentation strategy. Our approach overlays IG maps onto input images during training, providing student models with deeper insights into teacher models' decision-making processes. Extensive evaluation on CIFAR-10 demonstrates that our IG-augmented knowledge distillation achieves 92.6% testing accuracy with a 4.1x compression factor-a significant 1.1 percentage point improvement ($p<0.001$) over non-distilled models (91.5%). This compression reduces inference time from 140 ms to 13 ms. Our method precomputes IG maps before training, transforming substantial runtime costs into a one-time preprocessing step. Our comprehensive experiments include: (1) comparisons with attention transfer, revealing complementary benefits when combined with our approach; (2) Monte Carlo simulations confirming statistical robustness; (3) systematic evaluation of compression factor versus accuracy trade-offs across a wide range (2.2x-1122x); and (4) validation on an ImageNet subset aligned with CIFAR-10 classes, demonstrating generalisability beyond the initial dataset. These extensive ablation studies confirm that IG-based knowledge distillation consistently outperforms conventional approaches across varied architectures and compression ratios. Our results establish this framework as a viable compression technique for real-world deployment on edge devices while maintaining competitive accuracy.[106] Adapting Lightweight Vision Language Models for Radiological Visual Question Answering
Aditya Shourya,Michel Dumontier,Chang Sun
Main category: cs.CV
TL;DR: 尽管数据和模型规模有限,本文通过优化的数据和多阶段微调流程,证明小型视觉-语言模型在放射学VQA任务中的稳健性能,并引入了基于显著性的诊断工具以评估模型弱点。
Details
Motivation: 当前放射学视觉问答(VQA)模型面临数据获取困难、建模复杂及评价不足的问题,需要探索轻量级模型在该领域的潜力及其优化方法。 Method: 对一个30亿参数的轻量级视觉-语言模型进行微调,采用从合成问题-答案对生成到多阶段微调的经济高效训练流程,并使用专门的放射学领域数据集(如ROCO v2.0和MedPix v2.0)。此外,引入一种基于显著性的诊断工具以评估模型性能。 Result: 该模型在开放性和封闭性问题上表现出稳健性能,与大规模模型(如LLaVA-Med)相比,在参数和数据规模较小的情况下仍取得有希望的结果。同时,诊断工具能够有效识别模型的失效模式。 Conclusion: 轻量级模型经过适当调优后可在放射学VQA任务中实现良好表现,且提出的训练流程和诊断工具为未来研究提供了有价值的参考。 Abstract: Recent advancements in vision-language systems have improved the accuracy of Radiological Visual Question Answering (VQA) Models. However, some challenges remain across each stage of model development: limited expert-labeled images hinders data procurement at scale; the intricate and nuanced patterns of radiological images make modeling inherently difficult; and the lack of evaluation evaluation efforts makes it difficult to identify cases where the model might be ill-conditioned. In this study, we fine-tune a lightweight 3B parameter vision-language model for Radiological VQA, demonstrating that small models, when appropriately tuned with curated data, can achieve robust performance across both open- and closed-ended questions. We propose a cost-effective training pipeline from synthetic question-answer pair generation to multi-stage fine-tuning on specialised radiological domain-targeted datasets (e.g., ROCO v2.0, MedPix v2.0). Our results show that despite operating at a fraction of the scale of state-of-the-art models such as LLaVA-Med, our model achieves promising performance given its small parameter size and the limited scale of training data. We introduce a lightweight saliency-based diagnostic tool that enables domain experts to inspect VQA model performance and identify ill-conditioned failure modes through saliency analysis.[107] Dense360: Dense Understanding from Omnidirectional Panoramas
Yikang Zhou,Tao Zhang,Dizhe Zhang,Shunping Ji,Xiangtai Li,Lu Qi
Main category: cs.CV
TL;DR: Multimodal Large Language Models (MLLMs) need comprehensive visual inputs for dense understanding. This paper addresses dense understanding from omnidirectional panoramas by introducing a new dataset with 160K panoramas and ERP-RoPE, a position encoding scheme to solve challenges of using equirectangular projections (ERP). Additionally, they introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding.
Details
Motivation: Existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs. The authors aim to take the first step toward dense understanding from omnidirectional panoramas. Method: The authors introduce an omnidirectional panoramas dataset featuring reliability-scored annotations and address the challenges of using ERP through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. They also introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding. Result: The introduced dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. ERP-RoPE solves the challenges of spatial continuity and latitude-dependent variation in information density. Dense360-Bench establishes a comprehensive framework for advancing dense visual-language understanding in panoramic settings. Conclusion: This work takes the first step toward dense understanding from omnidirectional panoramas by providing a new dataset, a position encoding scheme ERP-RoPE, and the first benchmark Dense360-Bench. Abstract: Multimodal Large Language Models (MLLMs) require comprehensive visual inputs to achieve dense understanding of the physical world. While existing MLLMs demonstrate impressive world understanding capabilities through limited field-of-view (FOV) visual inputs (e.g., 70 degree), we take the first step toward dense understanding from omnidirectional panoramas. We first introduce an omnidirectional panoramas dataset featuring a comprehensive suite of reliability-scored annotations. Specifically, our dataset contains 160K panoramas with 5M dense entity-level captions, 1M unique referring expressions, and 100K entity-grounded panoramic scene descriptions. Compared to multi-view alternatives, panoramas can provide more complete, compact, and continuous scene representations through equirectangular projections (ERP). However, the use of ERP introduces two key challenges for MLLMs: i) spatial continuity along the circle of latitude, and ii) latitude-dependent variation in information density. We address these challenges through ERP-RoPE, a position encoding scheme specifically designed for panoramic ERP. In addition, we introduce Dense360-Bench, the first benchmark for evaluating MLLMs on omnidirectional captioning and grounding, establishing a comprehensive framework for advancing dense visual-language understanding in panoramic settings.[108] Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection
Zhijing Wan,Zhixiang Wang,Zheng Wang,Xin Xu,Shin'ichi Satoh
Main category: cs.CV
TL;DR: This paper explores using foundation models (FMs) for one-shot subset selection in deep learning, finding they outperform traditional information extractors on fine-grained datasets but not on coarse-grained ones with noisy labels. Based on this, the authors propose RAM-APL, a method leveraging multiple FMs to improve subset selection for fine-grained image datasets.
Details
Motivation: The motivation is to find an effective way to reduce deep learning training costs by identifying informative data subsets using foundation models (FMs), which may overcome the dataset-dependent limitation of traditional information extractors (IEs). Method: The authors conduct extensive experiments to compare FM-based subset selection with traditional IE-based methods across diverse datasets and investigate the performance differences among various FMs as IEs. They then propose RAM-APL, a method tailored for fine-grained image datasets that leverages multiple FMs to enhance subset selection by exploiting their complementary strengths. Result: Foundation models consistently outperform traditional information extractors on fine-grained datasets, but their advantage diminishes on coarse-grained datasets with noisy labels. The proposed RAM-APL method achieves state-of-the-art performance on several fine-grained datasets. Conclusion: Foundation models are more suitable for subset selection tasks on fine-grained datasets compared to traditional information extractors. Leveraging multiple FMs through the RAM-APL method can further enhance the performance for such tasks. Abstract: One-shot subset selection serves as an effective tool to reduce deep learning training costs by identifying an informative data subset based on the information extracted by an information extractor (IE). Traditional IEs, typically pre-trained on the target dataset, are inherently dataset-dependent. Foundation models (FMs) offer a promising alternative, potentially mitigating this limitation. This work investigates two key questions: (1) Can FM-based subset selection outperform traditional IE-based methods across diverse datasets? (2) Do all FMs perform equally well as IEs for subset selection? Extensive experiments uncovered surprising insights: FMs consistently outperform traditional IEs on fine-grained datasets, whereas their advantage diminishes on coarse-grained datasets with noisy labels. Motivated by these finding, we propose RAM-APL (RAnking Mean-Accuracy of Pseudo-class Labels), a method tailored for fine-grained image datasets. RAM-APL leverages multiple FMs to enhance subset selection by exploiting their complementary strengths. Our approach achieves state-of-the-art performance on fine-grained datasets, including Oxford-IIIT Pet, Food-101, and Caltech-UCSD Birds-200-2011.[109] I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs
Yu Qi,Lipeng Gu,Honghua Chen,Liangliang Nan,Mingqiang Wei
Main category: cs.CV
TL;DR: 现有的3D视觉定位方法依赖于精确的文本提示来定位3D场景中的对象。然而,现实世界的语音输入常常由于口音、背景噪音和不同的语速而出现转录错误,限制了现有方法的应用。为了解决这些问题,我们提出了SpeechRefer,一个新颖的3DVG框架,旨在增强在有噪声和模糊的语音转文字转录情况下的性能。SpeechRefer通过两个关键创新:语音互补模块和对比互补模块,减少了对可能出错的转录的依赖,并确保即使在转录错误占主导的情况下也能有稳健的表现。实验表明,SpeechRefer大大提高了现有3DVG方法的性能。
Details
Motivation: 现有的3D视觉定位方法依赖于精确的文本提示,但实际应用中语音输入常因各种原因导致转录错误,这限制了这些方法的实际应用。因此,需要一种能够处理噪声和模糊语音转文字转录的新方法。 Method: 提出了一种名为SpeechRefer的新型3DVG框架,包含两个主要模块:1) 语音互补模块:捕捉语音信号中与发音相关单词的声学相似性,生成补充提案分数;2) 对比互补模块:使用对比学习将错误的文本特征与相应的语音特征对齐,从而减少对潜在错误转录的依赖。 Result: 在SpeechRefer和SpeechNr3D数据集上的广泛实验表明,SpeechRefer显著提高了现有3DVG方法的性能。 Conclusion: SpeechRefer展示了其在连接噪声语音输入和可靠的3D视觉定位之间的潜力,为更直观和实用的多模态系统铺平了道路。 Abstract: Existing 3D visual grounding methods rely on precise text prompts to locate objects within 3D scenes. Speech, as a natural and intuitive modality, offers a promising alternative. Real-world speech inputs, however, often suffer from transcription errors due to accents, background noise, and varying speech rates, limiting the applicability of existing 3DVG methods. To address these challenges, we propose \textbf{SpeechRefer}, a novel 3DVG framework designed to enhance performance in the presence of noisy and ambiguous speech-to-text transcriptions. SpeechRefer integrates seamlessly with xisting 3DVG models and introduces two key innovations. First, the Speech Complementary Module captures acoustic similarities between phonetically related words and highlights subtle distinctions, generating complementary proposal scores from the speech signal. This reduces dependence on potentially erroneous transcriptions. Second, the Contrastive Complementary Module employs contrastive learning to align erroneous text features with corresponding speech features, ensuring robust performance even when transcription errors dominate. Extensive experiments on the SpeechRefer and peechNr3D datasets demonstrate that SpeechRefer improves the performance of existing 3DVG methods by a large margin, which highlights SpeechRefer's potential to bridge the gap between noisy speech inputs and reliable 3DVG, enabling more intuitive and practical multimodal systems.[110] MOL: Joint Estimation of Micro-Expression, Optical Flow, and Landmark via Transformer-Graph-Style Convolution
Zhiwen Shao,Yifan Cheng,Feiran Li,Yong Zhou,Xuequan Lu,Yuan Xie,Lizhuang Ma
Main category: cs.CV
TL;DR: An end-to-end micro-action-aware deep learning framework with transformer, graph convolution, and vanilla convolution is proposed for facial micro-expression recognition. It outperforms state-of-the-art MER methods on benchmarks.
Details
Motivation: Facial micro-expression recognition is challenging due to transient and subtle actions. Existing methods depend on hand-crafted features, key frames or deep networks limited by small-scale datasets. Method: A novel F5C block composed of fully-connected convolution and channel correspondence convolution is proposed to extract local-global features from raw frames without key frame knowledge. Transformer-style fully-connected convolution extracts local features while maintaining global receptive fields, and graph-style channel correspondence convolution models correlations among feature patterns. MER, optical flow estimation, and facial landmark detection are jointly trained. Result: The framework outperforms state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks, works well for optical flow estimation and facial landmark detection, and can capture facial subtle muscle actions in local regions associated with MEs. Conclusion: The proposed end-to-end micro-action-aware deep learning framework effectively recognizes facial micro-expressions and captures subtle muscle actions. Abstract: Facial micro-expression recognition (MER) is a challenging problem, due to transient and subtle micro-expression (ME) actions. Most existing methods depend on hand-crafted features, key frames like onset, apex, and offset frames, or deep networks limited by small-scale and low-diversity datasets. In this paper, we propose an end-to-end micro-action-aware deep learning framework with advantages from transformer, graph convolution, and vanilla convolution. In particular, we propose a novel F5C block composed of fully-connected convolution and channel correspondence convolution to directly extract local-global features from a sequence of raw frames, without the prior knowledge of key frames. The transformer-style fully-connected convolution is proposed to extract local features while maintaining global receptive fields, and the graph-style channel correspondence convolution is introduced to model the correlations among feature patterns. Moreover, MER, optical flow estimation, and facial landmark detection are jointly trained by sharing the local-global features. The two latter tasks contribute to capturing facial subtle action information for MER, which can alleviate the impact of insufficient training data. Extensive experiments demonstrate that our framework (i) outperforms the state-of-the-art MER methods on CASME II, SAMM, and SMIC benchmarks, (ii) works well for optical flow estimation and facial landmark detection, and (iii) can capture facial subtle muscle actions in local regions associated with MEs. The code is available at https://github.com/CYF-cuber/MOL.[111] SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
Zijian Song,Xiaoxin Lin,Qiuming Huang,Guangrun Wang,Liang Lin
Main category: cs.CV
TL;DR: SIRI-Bench is a new benchmark for evaluating Vision-Language Models' spatial intelligence through video-based reasoning tasks, revealing significant struggles of state-of-the-art VLMs in spatial reasoning.
Details
Motivation: To address the underexplored systematic evaluation of Vision-Language Models' complex reasoning ability within spatial contexts. Method: Introduced SIRI-Bench, a benchmark consisting of nearly 1K video-question-answer triplets embedded in realistic 3D scenes, and developed an Automatic Scene Creation Engine to generate these scenes from abstract math problems using specialized LLM agents. Result: State-of-the-art VLMs showed significant difficulties in solving the tasks presented by SIRI-Bench, highlighting the challenge of spatial reasoning. Conclusion: The study aims to draw researchers' attention to spatially grounded reasoning and promote advancements in VLMs for visual problem-solving. Abstract: Large Language Models (LLMs) are experiencing rapid advancements in complex reasoning, exhibiting remarkable generalization in mathematics and programming. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic evaluation of their complex reasoning ability within spatial contexts remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' spatial intelligence through video-based reasoning tasks. SIRI-Bench comprises nearly 1K video-question-answer triplets, where each problem is embedded in a realistic 3D scene and captured by video. By carefully designing questions and corresponding 3D scenes, our benchmark ensures that solving the questions requires both spatial comprehension for extracting information and high-level reasoning for deriving solutions, making it a challenging benchmark for evaluating VLMs. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine. This engine, leveraging multiple specialized LLM agents, can generate realistic 3D scenes from abstract math problems, ensuring faithfulness to the original descriptions. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving.[112] VisLanding: Monocular 3D Perception for UAV Safe Landing via Depth-Normal Synergy
Zhuoyue Tan,Boyong He,Yuxiang Ji,Liaoni Wu
Main category: cs.CV
TL;DR: This paper presents VisLanding, a monocular 3D perception-based framework for safe UAV landing, which enhances safe zone identification accuracy and exhibits superior generalization and robustness.
Details
Motivation: To address the core challenge of autonomous UAV landing in complex and unknown environments. Method: Leverage the depth-normal synergy prediction capabilities of the Metric3D V2 model to construct an end-to-end safe landing zones estimation framework. Introduce a safe zone segmentation branch to transform the task into a binary semantic segmentation problem. Result: Experimental results show that VisLanding significantly improves the accuracy of safe zone identification through a depth-normal joint optimization mechanism while retaining zero-shot generalization advantages. It also enables the estimation of landing zone area. Conclusion: VisLanding enhances the accuracy of safe zone identification, shows superior generalization and robustness in cross-domain testing, and provides critical decision-making support for practical applications. Abstract: This paper presents VisLanding, a monocular 3D perception-based framework for safe UAV (Unmanned Aerial Vehicle) landing. Addressing the core challenge of autonomous UAV landing in complex and unknown environments, this study innovatively leverages the depth-normal synergy prediction capabilities of the Metric3D V2 model to construct an end-to-end safe landing zones (SLZ) estimation framework. By introducing a safe zone segmentation branch, we transform the landing zone estimation task into a binary semantic segmentation problem. The model is fine-tuned and annotated using the WildUAV dataset from a UAV perspective, while a cross-domain evaluation dataset is constructed to validate the model's robustness. Experimental results demonstrate that VisLanding significantly enhances the accuracy of safe zone identification through a depth-normal joint optimization mechanism, while retaining the zero-shot generalization advantages of Metric3D V2. The proposed method exhibits superior generalization and robustness in cross-domain testing compared to other approaches. Furthermore, it enables the estimation of landing zone area by integrating predicted depth and normal information, providing critical decision-making support for practical applications.[113] Exploring Diffusion with Test-Time Training on Efficient Image Restoration
Rongchang Lu,Tianduo Luo,Yunzhi Zhang,Conghan Yue,Pei Yang,Guibao Liu,Changyang Gu
Main category: cs.CV
TL;DR: DiffRWKVIR is a new framework that combines Test-Time Training with efficient diffusion to address challenges in image restoration, outperforming existing models in various benchmarks.
Details
Motivation: Image restoration encounters issues such as ineffective feature fusion, computational bottlenecks, and inefficient diffusion processes. Method: The DiffRWKVIR framework introduces three innovations: Omni-Scale 2D State Evolution for global contextual awareness, Chunk-Optimized Flash Processing for reducing computational overhead, and Prior-Guided Efficient Diffusion for faster training/inference. Result: Evaluated across multiple benchmarks, DiffRWKVIR surpasses SwinIR, HAT, and MambaIR/v2 in PSNR, SSIM, LPIPS, and efficiency metrics. Conclusion: DiffRWKVIR sets a new standard for adaptive, high-efficiency image restoration with optimized hardware utilization. Abstract: Image restoration faces challenges including ineffective feature fusion, computational bottlenecks and inefficient diffusion processes. To address these, we propose DiffRWKVIR, a novel framework unifying Test-Time Training (TTT) with efficient diffusion. Our approach introduces three key innovations: (1) Omni-Scale 2D State Evolution extends RWKV's location-dependent parameterization to hierarchical multi-directional 2D scanning, enabling global contextual awareness with linear complexity O(L); (2) Chunk-Optimized Flash Processing accelerates intra-chunk parallelism by 3.2x via contiguous chunk processing (O(LCd) complexity), reducing sequential dependencies and computational overhead; (3) Prior-Guided Efficient Diffusion extracts a compact Image Prior Representation (IPR) in only 5-20 steps, proving 45% faster training/inference than DiffIR while solving computational inefficiency in denoising. Evaluated across super-resolution and inpainting benchmarks (Set5, Set14, BSD100, Urban100, Places365), DiffRWKVIR outperforms SwinIR, HAT, and MambaIR/v2 in PSNR, SSIM, LPIPS, and efficiency metrics. Our method establishes a new paradigm for adaptive, high-efficiency image restoration with optimized hardware utilization.[114] DreamLight: Towards Harmonious and Consistent Image Relighting
Yong Liu,Wenpeng Xiao,Qianqian Wang,Junlin Chen,Shiyin Wang,Yitong Wang,Xinglong Wu,Yansong Tang
Main category: cs.CV
TL;DR: This paper introduces DreamLight, a model for universal image relighting that can composite subjects into new backgrounds while maintaining lighting and color tone uniformity. It supports both image-based and text-based relighting, overcoming limitations of existing methods by employing a pretrained diffusion model, Position-Guided Light Adapter (PGLA), and Spectral Foreground Fixer (SFF). Extensive comparisons show DreamLight's superior performance.
Details
Motivation: The motivation is to address the limitations of existing relighting methods which primarily focus on image-based relighting or require expensive data for intrinsic decomposition and light source information, and struggle with generating realistic light interaction effects between foreground and background. Method: DreamLight reorganizes input data into a unified format leveraging semantic prior from a pretrained diffusion model. It proposes PGLA to condense light information and modulate the foreground with direction-biased masked attention, and SFF to adaptively reorganize frequency components enhancing foreground appearance consistency. Result: Extensive comparisons and user studies demonstrate DreamLight's remarkable relighting performance, achieving natural and realistic results in both image-based and text-based relighting scenarios. Conclusion: DreamLight successfully addresses challenges in universal image relighting by introducing innovative techniques such as PGLA and SFF, providing a robust solution for seamless compositing with aesthetic uniformity. Abstract: We introduce a model named DreamLight for universal image relighting in this work, which can seamlessly composite subjects into a new background while maintaining aesthetic uniformity in terms of lighting and color tone. The background can be specified by natural images (image-based relighting) or generated from unlimited text prompts (text-based relighting). Existing studies primarily focus on image-based relighting, while with scant exploration into text-based scenarios. Some works employ intricate disentanglement pipeline designs relying on environment maps to provide relevant information, which grapples with the expensive data cost required for intrinsic decomposition and light source. Other methods take this task as an image translation problem and perform pixel-level transformation with autoencoder architecture. While these methods have achieved decent harmonization effects, they struggle to generate realistic and natural light interaction effects between the foreground and background. To alleviate these challenges, we reorganize the input data into a unified format and leverage the semantic prior provided by the pretrained diffusion model to facilitate the generation of natural results. Moreover, we propose a Position-Guided Light Adapter (PGLA) that condenses light information from different directions in the background into designed light query embeddings, and modulates the foreground with direction-biased masked attention. In addition, we present a post-processing module named Spectral Foreground Fixer (SFF) to adaptively reorganize different frequency components of subject and relighted background, which helps enhance the consistency of foreground appearance. Extensive comparisons and user study demonstrate that our DreamLight achieves remarkable relighting performance.[115] Risk Estimation of Knee Osteoarthritis Progression via Predictive Multi-task Modelling from Efficient Diffusion Model using X-ray Images
David Butler,Adrian Hilton,Gustavo Carneiro
Main category: cs.CV
TL;DR: The paper presents an interpretable machine learning method using a diffusion model for knee osteoarthritis risk estimation and future image generation, improving the state-of-the-art by 2% with faster inference time.
Details
Motivation: Existing methods for estimating knee osteoarthritis risk using medical images are complex, lack interpretability, and fail to localize anatomical knee landmarks. Method: A new interpretable machine learning method is developed which uses multi-task predictive modelling to classify future knee OA severity and predict anatomical knee landmarks from efficiently generated high-quality future images. The image generation is achieved through a diffusion model in a class-conditioned latent space. Result: The approach improves the state-of-the-art by 2%, achieving an AUC of 0.71 in predicting knee OA progression while offering ~9% faster inference time when applied to the Osteoarthritis Initiative dataset. Conclusion: This novel method provides a more interpretable way to estimate knee OA progression risk and offers visual representation of disease progression. Abstract: Medical imaging plays a crucial role in assessing knee osteoarthritis (OA) risk by enabling early detection and disease monitoring. Recent machine learning methods have improved risk estimation (i.e., predicting the likelihood of disease progression) and predictive modelling (i.e., the forecasting of future outcomes based on current data) using medical images, but clinical adoption remains limited due to their lack of interpretability. Existing approaches that generate future images for risk estimation are complex and impractical. Additionally, previous methods fail to localize anatomical knee landmarks, limiting interpretability. We address these gaps with a new interpretable machine learning method to estimate the risk of knee OA progression via multi-task predictive modelling that classifies future knee OA severity and predicts anatomical knee landmarks from efficiently generated high-quality future images. Such image generation is achieved by leveraging a diffusion model in a class-conditioned latent space to forecast disease progression, offering a visual representation of how particular health conditions may evolve. Applied to the Osteoarthritis Initiative dataset, our approach improves the state-of-the-art (SOTA) by 2\%, achieving an AUC of 0.71 in predicting knee OA progression while offering ~9% faster inference time.[116] Synthetic Data Augmentation for Table Detection: Re-evaluating TableNet's Performance with Automatically Generated Document Images
Krishna Sahukara,Zineddine Bettouche,Andreas Fischer
Main category: cs.CV
TL;DR: This paper proposes an automated LaTeX-based pipeline for generating realistic two-column pages with diverse table layouts and aligned ground-truth masks, which can be used to train TableNet and reduce manual annotation effort.
Details
Motivation: Document pages captured by smartphones or scanners often contain tables, but manual extraction is slow and error-prone. Method: An automated LaTeX-based pipeline is introduced that synthesizes realistic two-column pages with visually diverse table layouts and aligned ground-truth masks. The generated corpus augments the real-world Marmot benchmark. Result: Training TableNet on the synthetic data achieves a pixel-wise XOR error of 4.04% on the synthetic test set with a 256x256 input resolution, and 4.33% with 1024x1024. The best performance on the Marmot benchmark is 9.18% (at 256x256). Conclusion: The proposed method cuts manual annotation effort through automation. Abstract: Document pages captured by smartphones or scanners often contain tables, yet manual extraction is slow and error-prone. We introduce an automated LaTeX-based pipeline that synthesizes realistic two-column pages with visually diverse table layouts and aligned ground-truth masks. The generated corpus augments the real-world Marmot benchmark and enables a systematic resolution study of TableNet. Training TableNet on our synthetic data achieves a pixel-wise XOR error of 4.04% on our synthetic test set with a 256x256 input resolution, and 4.33% with 1024x1024. The best performance on the Marmot benchmark is 9.18% (at 256x256), while cutting manual annotation effort through automation.[117] PoseGRAF: Geometric-Reinforced Adaptive Fusion for Monocular 3D Human Pose Estimation
Ming Xu,Xu Zhang
Main category: cs.CV
TL;DR: Existing monocular 3D pose estimation methods mainly depend on joint positional features, but ignore the intrinsic directional and angular correlations within the skeleton. This often leads to implausible poses under joint occlusions or rapid motion changes. The authors propose PoseGRAF framework which includes a dual graph convolutional structure, Cross-Attention module, dynamic fusion module and an improved Transformer encoder. Experimental results show that this method outperforms state-of-the-art approaches on Human3.6M and MPI-INF-3DHP datasets and demonstrates generalizability on in-the-wild videos.
Details
Motivation: Current monocular 3D pose estimation methods primarily rely on joint positional features, while overlooking intrinsic directional and angular correlations within the skeleton. This can result in implausible poses under joint occlusions or rapid motion changes. Method: PoseGRAF framework is proposed. It constructs a dual graph convolutional structure to separately process joint and bone graphs, capturing their local dependencies. A Cross-Attention module models interdependencies between bone directions and joint features. A dynamic fusion module adaptively integrates both feature types by leveraging relational dependencies between joints and bones. An improved Transformer encoder is incorporated in a residual manner to generate the final output. Result: Experimental results on Human3.6M and MPI-INF-3DHP datasets show that the method exceeds state-of-the-art approaches. Additional evaluations on in-the-wild videos further validate its generalizability. Conclusion: PoseGRAF framework addresses the limitations of existing monocular 3D pose estimation methods by effectively capturing local dependencies, modeling interdependencies, integrating feature types adaptively, and generating accurate outputs through an improved Transformer encoder. The method has been proven to outperform state-of-the-art approaches and demonstrate generalizability. Abstract: Existing monocular 3D pose estimation methods primarily rely on joint positional features, while overlooking intrinsic directional and angular correlations within the skeleton. As a result, they often produce implausible poses under joint occlusions or rapid motion changes. To address these challenges, we propose the PoseGRAF framework. We first construct a dual graph convolutional structure that separately processes joint and bone graphs, effectively capturing their local dependencies. A Cross-Attention module is then introduced to model interdependencies between bone directions and joint features. Building upon this, a dynamic fusion module is designed to adaptively integrate both feature types by leveraging the relational dependencies between joints and bones. An improved Transformer encoder is further incorporated in a residual manner to generate the final output. Experimental results on the Human3.6M and MPI-INF-3DHP datasets show that our method exceeds state-of-the-art approaches. Additional evaluations on in-the-wild videos further validate its generalizability. The code is publicly available at https://github.com/iCityLab/PoseGRAF.[118] Align Your Flow: Scaling Continuous-Time Flow Map Distillation
Amirmojtaba Sabour,Sanja Fidler,Karsten Kreis
Main category: cs.CV
TL;DR: 本论文提出了一种新的连续时间目标和训练技术,用于提高流映射模型的性能。通过自引导和对抗性微调等方法,该模型在图像生成基准测试中表现出色,并在文本到图像合成任务中超越了现有非对抗性训练的少步采样器。
Details
Motivation: 扩散和流动模型虽然是最先进的生成模型,但需要许多采样步骤。一致性模型可以将这些模型提炼为高效的一步生成器,但其性能会随着步骤数的增加而不可避免地退化。 Method: 引入了两种新的连续时间目标和额外的训练技术,以推广现有的一致性和流动匹配目标。使用低质量模型进行引导和对抗性微调来进一步提升性能。 Result: 在ImageNet 64x64和512x512上实现了最先进的少步生成性能,并且在文本到图像合成任务中超越了所有现有的非对抗性训练的少步采样器。 Conclusion: 提出的流映射模型(Align Your Flow)在挑战性的图像生成基准测试中表现出色,使用小型高效的神经网络实现了高效生成。 Abstract: Diffusion- and flow-based models have emerged as state-of-the-art generative modeling approaches, but they require many sampling steps. Consistency models can distill these models into efficient one-step generators; however, unlike flow- and diffusion-based methods, their performance inevitably degrades when increasing the number of steps, which we show both analytically and empirically. Flow maps generalize these approaches by connecting any two noise levels in a single step and remain effective across all step counts. In this paper, we introduce two new continuous-time objectives for training flow maps, along with additional novel training techniques, generalizing existing consistency and flow matching objectives. We further demonstrate that autoguidance can improve performance, using a low-quality model for guidance during distillation, and an additional boost can be achieved by adversarial finetuning, with minimal loss in sample diversity. We extensively validate our flow map models, called Align Your Flow, on challenging image generation benchmarks and achieve state-of-the-art few-step generation performance on both ImageNet 64x64 and 512x512, using small and efficient neural networks. Finally, we show text-to-image flow map models that outperform all existing non-adversarially trained few-step samplers in text-conditioned synthesis.[119] Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching
Giacomo Meanti,Thomas Ryckeboer,Michael Arbel,Julien Mairal
Main category: cs.CV
TL;DR: This paper introduces a method for image restoration using unpaired datasets and conditional flow matching, performing well in deblurring, PSF calibration, and blind super-resolution tasks.
Details
Motivation: Traditional image restoration methods require full knowledge of the forward model or paired degraded and ground-truth images. However, in real-world scenarios, the forward model is often unknown or misspecified, and collecting paired data is costly or infeasible. Method: The method uses conditional flow matching to model the distribution of degraded observations and simultaneously learns the forward model via a distribution-matching loss. Result: It outperforms single-image blind and unsupervised approaches on deblurring and non-uniform point spread function (PSF) calibration tasks, matches state-of-the-art performance on blind super-resolution, and effectively demonstrates lens calibration with minimal data acquisition effort. Conclusion: This approach provides a practical solution for real-world image restoration problems where paired data is unavailable or difficult to obtain. Abstract: This work addresses image restoration tasks through the lens of inverse problems using unpaired datasets. In contrast to traditional approaches -- which typically assume full knowledge of the forward model or access to paired degraded and ground-truth images -- the proposed method operates under minimal assumptions and relies only on small, unpaired datasets. This makes it particularly well-suited for real-world scenarios, where the forward model is often unknown or misspecified, and collecting paired data is costly or infeasible. The method leverages conditional flow matching to model the distribution of degraded observations, while simultaneously learning the forward model via a distribution-matching loss that arises naturally from the framework. Empirically, it outperforms both single-image blind and unsupervised approaches on deblurring and non-uniform point spread function (PSF) calibration tasks. It also matches state-of-the-art performance on blind super-resolution. We also showcase the effectiveness of our method with a proof of concept for lens calibration: a real-world application traditionally requiring time-consuming experiments and specialized equipment. In contrast, our approach achieves this with minimal data acquisition effort.[120] VisText-Mosquito: A Multimodal Dataset and Benchmark for AI-Based Mosquito Breeding Site Detection and Reasoning
Md. Adnanul Islam,Md. Faiyaz Abdullah Sayeedi,Md. Asaduzzaman Shuvo,Muhammad Ziaur Rahman,Shahanur Rahman Bappy,Raiyan Rahman,Swakkhar Shatabda
Main category: cs.CV
TL;DR: This paper presents VisText-Mosquito, a multimodal dataset aiding in mosquito breeding site analysis through object detection, segmentation, and reasoning. YOLOv9s model excels in object detection with a precision of 0.92926 and mAP@50 of 0.92891. YOLOv11n-Seg performs well in segmentation with a precision of 0.91587 and mAP@50 of 0.79795. A fine-tuned BLIP model is used for reasoning generation achieving a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.87.
Details
Motivation: Mosquito-borne diseases present a significant global health risk necessitating early detection and proactive control of breeding sites to prevent outbreaks. Method: The method involves creating the VisText-Mosquito dataset which integrates visual and textual data for automated detection, segmentation, and reasoning in mosquito breeding site analysis. The YOLOv9s model is employed for object detection, YOLOv11n-Seg for segmentation, and a fine-tuned BLIP model for reasoning generation. Result: YOLOv9s achieved high precision (0.92926) and mAP@50 (0.92891) in object detection. YOLOv11n-Seg showed good performance in segmentation with precision (0.91587) and mAP@50 (0.79795). The fine-tuned BLIP model had a final loss of 0.0028, BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.87. Conclusion: The VisText-Mosquito dataset and associated model framework highlight the importance of AI-based detection in proactively addressing mosquito-borne disease risks, supporting the theme 'Prevention is Better than Cure'. All materials are publicly available on GitHub. Abstract: Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language reasoning texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For reasoning generation, our fine-tuned BLIP model achieves a final loss of 0.0028, with a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.87. This dataset and model framework emphasize the theme "Prevention is Better than Cure", showcasing how AI-based detection can proactively address mosquito-borne disease risks. The dataset and implementation code are publicly available at GitHub: https://github.com/adnanul-islam-jisun/VisText-Mosquito[121] 3DGS-IEval-15K: A Large-scale Image Quality Evaluation Database for 3D Gaussian-Splatting
Yuke Xing,Jiarui Wang,Peizhi Niu,Wenjie Huang,Guangtao Zhai,Yiling Xu
Main category: cs.CV
TL;DR: 3D Gaussian Splatting(3DGS)是一种有前景的新型视角合成方法,但其存储需求巨大。本文提出了3DGS-IEval-15K数据集,包含15,200张图像,用于评估压缩对3DGS表示的感知影响,并通过主观实验收集了60名观众的人类感知数据。该数据集为开发3DGS专用的IQA指标提供了基础。
Details
Motivation: 尽管最近的3DGS方法越来越多地结合专门的压缩模块,但缺乏一个全面的框架来评估它们的感知影响。 Method: 创建了一个名为3DGS-IEval-15K的大规模图像质量评估(IQA)数据集,包括从10个真实场景中通过6种代表性的3DGS算法渲染的15,200张图像,涵盖20个战略性选定的视点,并引入不同的压缩级别以产生各种失真效果。通过受控的主观实验,从60名观众那里收集人类感知数据。 Result: 验证了数据集的质量,通过场景多样性和MOS分布分析,建立了包含30种代表性IQA度量的综合基准,涵盖了各种类型。 Conclusion: 该数据集是迄今为止最大规模的3DGS质量评估数据集,为开发3DGS专用的IQA度量提供了基础,并提供了研究独特的视图依赖质量分布模式所需的关键数据。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a promising approach for novel view synthesis, offering real-time rendering with high visual fidelity. However, its substantial storage requirements present significant challenges for practical applications. While recent state-of-the-art (SOTA) 3DGS methods increasingly incorporate dedicated compression modules, there is a lack of a comprehensive framework to evaluate their perceptual impact. Therefore we present 3DGS-IEval-15K, the first large-scale image quality assessment (IQA) dataset specifically designed for compressed 3DGS representations. Our dataset encompasses 15,200 images rendered from 10 real-world scenes through 6 representative 3DGS algorithms at 20 strategically selected viewpoints, with different compression levels leading to various distortion effects. Through controlled subjective experiments, we collect human perception data from 60 viewers. We validate dataset quality through scene diversity and MOS distribution analysis, and establish a comprehensive benchmark with 30 representative IQA metrics covering diverse types. As the largest-scale 3DGS quality assessment dataset to date, our work provides a foundation for developing 3DGS specialized IQA metrics, and offers essential data for investigating view-dependent quality distribution patterns unique to 3DGS. The database is publicly available at https://github.com/YukeXing/3DGS-IEval-15K.[122] DDS-NAS: Dynamic Data Selection within Neural Architecture Search via On-line Hard Example Mining applied to Image Classification
Matt Poyser,Toby P. Breckon
Main category: cs.CV
TL;DR: The paper introduces DDS-NAS framework which speeds up Neural Architecture Search (NAS) training by 27x without performance loss through dynamic hard example mining and curriculum learning.
Details
Motivation: To solve the scalability challenge within Neural Architecture Search (NAS). Method: Speed up NAS training via dynamic hard example mining within a curriculum learning framework, using an autoencoder for image similarity embedding and a kd-tree structure to order images. Result: DDS-NAS framework speeds up gradient-based NAS strategies by up to 27x without loss in performance, reducing NAS training cycle duration and number of iterations for convergence. Conclusion: Dynamic hard example mining and curriculum learning can significantly enhance the efficiency of NAS training. Abstract: In order to address the scalability challenge within Neural Architecture Search (NAS), we speed up NAS training via dynamic hard example mining within a curriculum learning framework. By utilizing an autoencoder that enforces an image similarity embedding in latent space, we construct an efficient kd-tree structure to order images by furthest neighbour dissimilarity in a low-dimensional embedding. From a given query image from our subsample dataset, we can identify the most dissimilar image within the global dataset in logarithmic time. Via curriculum learning, we then dynamically re-formulate an unbiased subsample dataset for NAS optimisation, upon which the current NAS solution architecture performs poorly. We show that our DDS-NAS framework speeds up gradient-based NAS strategies by up to 27x without loss in performance. By maximising the contribution of each image sample during training, we reduce the duration of a NAS training cycle and the number of iterations required for convergence.[123] Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models
Ling Li,Yao Zhou,Yuxuan Liang,Fugee Tsung,Jiaheng Wei
Main category: cs.CV
TL;DR: GLOBE is a new method that improves geo-localization by enhancing both recognition and reasoning using diverse social media images and task-specific rewards.
Details
Motivation: To overcome the limitations of current geo-localization methods which lack interpretability and rely on constrained datasets and supervised fine-tuning. Method: Propose GLOBE, which constructs MP16-Reason dataset from social media images and uses group-relative policy optimization with task-specific rewards to enhance locatability assessment, visual clue reasoning, and geolocation accuracy. Result: GLOBE surpasses state-of-the-art open-source LVLMs in geo-localization tasks, especially in diverse visual scenes, while providing more interpretable reasoning trajectories. Conclusion: GLOBE offers a significant advancement in geo-localization by improving both accuracy and interpretability through enhanced reasoning capabilities. Abstract: Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images. We introduce GLOBE, Group-relative policy optimization for Locatability assessment and Optimized visual-clue reasoning, yielding Bi-objective geo-Enhancement for the VLM in recognition and reasoning. GLOBE incorporates task-specific rewards that jointly enhance locatability assessment, visual clue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories.[124] FocalClick-XL: Towards Unified and High-quality Interactive Segmentation
Xi Chen,Hengshuang Zhao
Main category: cs.CV
TL;DR: Interactive segmentation tool FocalClick-XL is developed by revisiting the classical coarse-to-fine design of FocalClick and introducing significant extensions. It decomposes interactive segmentation into meta-tasks capturing different levels of information, shares context- and detail-level information across interaction forms, and introduces a prompting layer for object-level encoding.
Details
Motivation: Existing methods for interactive segmentation have limitations in supporting only limited interaction forms and struggling to capture fine details. Method: The method proposes a novel pipeline called FocalClick-XL which follows a multi-stage strategy inspired by the classical coarse-to-fine design. Interactive segmentation is decomposed into meta-tasks capturing different levels of information (context, object, and detail), with each level assigned a dedicated subnet. Context- and detail-level information is shared across interaction forms while a prompting layer is introduced at the object level for encoding specific interaction types. Result: FocalClick-XL achieves state-of-the-art performance on click-based benchmarks and demonstrates remarkable adaptability to diverse interaction formats such as boxes, scribbles, and coarse masks. It can also predict alpha mattes with fine-grained details beyond generating binary masks. Conclusion: FocalClick-XL is a versatile and powerful tool for interactive segmentation that addresses the challenges of limited interaction forms and capturing fine details. Abstract: Interactive segmentation enables users to extract binary masks of target objects through simple interactions such as clicks, scribbles, and boxes. However, existing methods often support only limited interaction forms and struggle to capture fine details. In this paper, we revisit the classical coarse-to-fine design of FocalClick and introduce significant extensions. Inspired by its multi-stage strategy, we propose a novel pipeline, FocalClick-XL, to address these challenges simultaneously. Following the emerging trend of large-scale pretraining, we decompose interactive segmentation into meta-tasks that capture different levels of information -- context, object, and detail -- assigning a dedicated subnet to each level.This decomposition allows each subnet to undergo scaled pretraining with independent data and supervision, maximizing its effectiveness. To enhance flexibility, we share context- and detail-level information across different interaction forms as common knowledge while introducing a prompting layer at the object level to encode specific interaction types. As a result, FocalClick-XL achieves state-of-the-art performance on click-based benchmarks and demonstrates remarkable adaptability to diverse interaction formats, including boxes, scribbles, and coarse masks. Beyond binary mask generation, it is also capable of predicting alpha mattes with fine-grained details, making it a versatile and powerful tool for interactive segmentation.[125] YOLOv11-RGBT: Towards a Comprehensive Single-Stage Multispectral Object Detection Framework
Dahang Wan,Rongsheng Lu,Yang Fang,Xianli Lang,Shuangbao Shu,Jingjing Chen,Siyuan Shen,Ting Xu,Zecong Ye
Main category: cs.CV
TL;DR: This paper introduces YOLOv11-RGBT, a new multimodal object detection framework based on YOLOv11 that includes six multispectral fusion modes and strategies like P3 mid-fusion and multispectral controllable fine-tuning (MCF) to optimize feature fusion and boost model performance. Experiments show significant improvements in mAP on major datasets.
Details
Motivation: Existing methods for multispectral object detection face challenges such as lack of unified single-stage framework, difficulty balancing performance and fusion strategy, and unreasonable modality weight allocation. Method: The authors designed six multispectral fusion modes applicable to models from YOLOv3 to YOLOv12 and RT-DETR. They proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy after reevaluating the importance of modalities. Result: Experiments on three major datasets (LLVIP and FLIR among them) showed the framework excels with significant mAP improvements (3.41%-5.65%) on the FLIR dataset, reaching up to 47.61%. Conclusion: YOLOv11-RGBT effectively addresses existing challenges in multispectral object detection, enhancing model adaptability and robustness. Abstract: Multispectral object detection, which integrates information from multiple bands, can enhance detection accuracy and environmental adaptability, holding great application potential across various fields. Although existing methods have made progress in cross-modal interaction, low-light conditions, and model lightweight, there are still challenges like the lack of a unified single-stage framework, difficulty in balancing performance and fusion strategy, and unreasonable modality weight allocation. To address these, based on the YOLOv11 framework, we present YOLOv11-RGBT, a new comprehensive multimodal object detection framework. We designed six multispectral fusion modes and successfully applied them to models from YOLOv3 to YOLOv12 and RT-DETR. After reevaluating the importance of the two modalities, we proposed a P3 mid-fusion strategy and multispectral controllable fine-tuning (MCF) strategy for multispectral models. These improvements optimize feature fusion, reduce redundancy and mismatches, and boost overall model performance. Experiments show our framework excels on three major open-source multispectral object detection datasets, like LLVIP and FLIR. Particularly, the multispectral controllable fine-tuning strategy significantly enhanced model adaptability and robustness. On the FLIR dataset, it consistently improved YOLOv11 models' mAP by 3.41%-5.65%, reaching a maximum of 47.61%, verifying the framework and strategies' effectiveness. The code is available at: https://github.com/wandahangFY/YOLOv11-RGBT.[126] Iterative Camera-LiDAR Extrinsic Optimization via Surrogate Diffusion
Ni Ou,Zhuo Chen,Xinru Zhang,Junzheng Wang
Main category: cs.CV
TL;DR: The paper proposes an iterative framework based on surrogate diffusion to improve the accuracy, robustness and stability of camera and LiDAR extrinsic calibration methods.
Details
Motivation: Camera and LiDAR data fusion is crucial for autonomous vehicles but requires precise extrinsic calibration. Current end-to-end calibration methods lack iterative optimization capabilities. Method: The method involves an iterative refinement process where initial extrinsic parameters are refined through a denoising process using the original calibration method as a surrogate denoiser. Result: Experiments show that integrating the proposed diffusion model with existing calibration methods leads to higher accuracy, improved robustness, and greater stability compared to other techniques. Conclusion: The proposed iterative framework can enhance any calibration method's performance without architectural modifications. Abstract: Cameras and LiDAR are essential sensors for autonomous vehicles. The fusion of camera and LiDAR data addresses the limitations of individual sensors but relies on precise extrinsic calibration. Recently, numerous end-to-end calibration methods have been proposed; however, most predict extrinsic parameters in a single step and lack iterative optimization capabilities. To address the increasing demand for higher accuracy, we propose a versatile iterative framework based on surrogate diffusion. This framework can enhance the performance of any calibration method without requiring architectural modifications. Specifically, the initial extrinsic parameters undergo iterative refinement through a denoising process, in which the original calibration method serves as a surrogate denoiser to estimate the final extrinsics at each step. For comparative analysis, we selected four state-of-the-art calibration methods as surrogate denoisers and compared the results of our diffusion process with those of two other iterative approaches. Extensive experiments demonstrate that when integrated with our diffusion model, all calibration methods achieve higher accuracy, improved robustness, and greater stability compared to other iterative techniques and their single-step counterparts.[127] DiFuse-Net: RGB and Dual-Pixel Depth Estimation using Window Bi-directional Parallax Attention and Cross-modal Transfer Learning
Kunal Swami,Debtanu Gupta,Amrit Kumar Muduli,Chirag Jaiswal,Pankaj Kumar Bajpai
Main category: cs.CV
TL;DR: Depth estimation is vital for intelligent systems. Traditional methods have limitations, but dual-pixel (DP) technology offers an alternative. This paper introduces DiFuse-Net for RGB and DP based depth estimation with a new attention mechanism and encoder. It also proposes Cross-modal Transfer Learning to use large-scale datasets and contributes a new high-quality dataset.
Details
Motivation: The motivation is to overcome the limitations of traditional stereo and active depth sensors in terms of cost, power, and robustness by leveraging dual-pixel technology which is common in modern cameras. Method: DiFuse-Net is a novel modality decoupled network design that disentangles RGB and DP for depth estimation. It features WBiPAM to capture DP disparity cues, a separate encoder for RGB contextual information, and fuses these features for enhanced depth prediction. CmTL utilizes large-scale RGB-D datasets to address the lack of RGB-DP-D datasets. Result: The evaluation shows superiority over DP and stereo-based baseline methods. Conclusion: DiFuse-Net provides an effective solution for depth estimation using dual-pixel technology, supported by a newly contributed high-quality dataset. Abstract: Depth estimation is crucial for intelligent systems, enabling applications from autonomous navigation to augmented reality. While traditional stereo and active depth sensors have limitations in cost, power, and robustness, dual-pixel (DP) technology, ubiquitous in modern cameras, offers a compelling alternative. This paper introduces DiFuse-Net, a novel modality decoupled network design for disentangled RGB and DP based depth estimation. DiFuse-Net features a window bi-directional parallax attention mechanism (WBiPAM) specifically designed to capture the subtle DP disparity cues unique to smartphone cameras with small aperture. A separate encoder extracts contextual information from the RGB image, and these features are fused to enhance depth prediction. We also propose a Cross-modal Transfer Learning (CmTL) mechanism to utilize large-scale RGB-D datasets in the literature to cope with the limitations of obtaining large-scale RGB-DP-D dataset. Our evaluation and comparison of the proposed method demonstrates its superiority over the DP and stereo-based baseline methods. Additionally, we contribute a new, high-quality, real-world RGB-DP-D training dataset, named Dual-Camera Dual-Pixel (DCDP) dataset, created using our novel symmetric stereo camera hardware setup, stereo calibration and rectification protocol, and AI stereo disparity estimation method.[128] Active InSAR monitoring of building damage in Gaza during the Israel-Hamas War
Corey Scher,Jamon Van Den Hoek
Main category: cs.CV
TL;DR: The paper uses SAR data and LT-CCD approach to track weekly damage trends in the 2023 Israel-Hamas War, detecting 92.5% of UN-referenced damage with only 1.2% false positives.
Details
Motivation: To provide active monitoring of urban damage during protracted armed conflicts, specifically in the Gaza Strip due to aerial bombardment. Method: Application of long temporal-arc coherent change detection (LT-CCD) using interferometric SAR data from Sentinel-1 satellites to track damage trends weekly over the first year of the conflict. Result: Detected 92.5% of damage labels from UN reference data with a 1.2% false positive rate; revealed rapid damage increase in northern Gaza, a pause during ceasefire, and surges as conflict shifted southwards. By study end, three-fifths of all buildings were damaged or destroyed. Conclusion: This low-cost, low-latency method provides timely damage data needed for humanitarian and journalistic responses in conflict zones. Abstract: Aerial bombardment of the Gaza Strip beginning October 7, 2023 is one of the most intense bombing campaigns of the twenty-first century, driving widespread urban damage. Characterizing damage over a geographically dynamic and protracted armed conflict requires active monitoring. Synthetic aperture radar (SAR) has precedence for mapping disaster-induced damage with bi-temporal methods but applications to active monitoring during sustained crises are limited. Using interferometric SAR data from Sentinel-1, we apply a long temporal-arc coherent change detection (LT-CCD) approach to track weekly damage trends over the first year of the 2023- Israel-Hamas War. We detect 92.5% of damage labels in reference data from the United Nations with a negligible (1.2%) false positive rate. The temporal fidelity of our approach reveals rapidly increasing damage during the first three months of the war focused in northern Gaza, a notable pause in damage during a temporary ceasefire, and surges of new damage as conflict hot-spots shift from north to south. Three-fifths (191,263) of all buildings are damaged or destroyed by the end of the study. With massive need for timely data on damage in armed conflict zones, our low-cost and low-latency approach enables rapid uptake of damage information at humanitarian and journalistic organizations.[129] SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting
Ziqiao Peng,Wentao Hu,Junyuan Ma,Xiangyu Zhu,Xiaomei Zhang,Hao Zhao,Hui Tian,Jun He,Hongyan Liu,Zhaoxin Fan
Main category: cs.CV
TL;DR: SyncTalk++ is a system for creating realistic talking head videos with high synchronization. It includes a Dynamic Portrait Renderer, Face-Sync Controller, Head-Sync Stabilizer, Expression Generator, and Torso Restorer. SyncTalk++ outperforms current methods in experiments.
Details
Motivation: To address the challenge of achieving high synchronization in the synthesis of realistic, speech-driven talking head videos. Method: Introduced SyncTalk++, which features a Dynamic Portrait Renderer with Gaussian Splatting for subject identity preservation, a Face-Sync Controller that aligns lip movements with speech using a 3D facial blendshape model, a Head-Sync Stabilizer for natural head movements, an Expression Generator and a Torso Restorer for robustness to OOD audio. Result: Maintains consistency and continuity in visual details across frames and significantly improves rendering speed and quality, achieving up to 101 frames per second. Outperforms state-of-the-art methods in synchronization and realism based on extensive experiments and user studies. Conclusion: SyncTalk++ successfully addresses the issue of synchronization in creating realistic talking head videos and surpasses current methods in performance. Abstract: Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic results. To address the critical issue of synchronization, identified as the ''devil'' in creating realistic talking heads, we introduce SyncTalk++, which features a Dynamic Portrait Renderer with Gaussian Splatting to ensure consistent subject identity preservation and a Face-Sync Controller that aligns lip movements with speech while innovatively using a 3D facial blendshape model to reconstruct accurate facial expressions. To ensure natural head movements, we propose a Head-Sync Stabilizer, which optimizes head poses for greater stability. Additionally, SyncTalk++ enhances robustness to out-of-distribution (OOD) audio by incorporating an Expression Generator and a Torso Restorer, which generate speech-matched facial expressions and seamless torso regions. Our approach maintains consistency and continuity in visual details across frames and significantly improves rendering speed and quality, achieving up to 101 frames per second. Extensive experiments and user studies demonstrate that SyncTalk++ outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk++.[130] Cost-Aware Routing for Efficient Text-To-Image Generation
Qinchan,Li,Kenneth Chen,Changyue,Su,Wittawat Jitkrittum,Qi Sun,Patsorn Sangkloy
Main category: cs.CV
TL;DR: 通过自动选择最适合的文本到图像生成模型或扩散模型的不同去噪步骤,提出了一种根据提示复杂性调整计算量的方法,以实现质量与成本的最佳平衡。该方法在COCO和DiffusionDB上的实验表明,其平均质量优于单一模型。
Details
Motivation: 现有的扩散模型虽然能生成高质量图像,但计算成本高且生成过程顺序性较强,因此需要一种能够根据输入提示的复杂程度灵活调整计算量的方法,以平衡图像生成的质量和效率。 Method: 研究者提出了一种框架,可以根据输入提示的复杂性,自动将提示分配给最适合的文本到图像生成函数。这可能对应于扩散模型的不同去噪步骤数,或是完全独立的文本到图像模型。此方法通过学习为复杂的提示保留昂贵的选择(如100+去噪步骤),而对简单的提示使用更经济的选择(如小型蒸馏模型)来实现最优的质量与计算成本权衡。 Result: 在COCO和DiffusionDB数据集上的实证结果表明,通过学习路由到九个已训练好的文本到图像模型,所提出的方法可以提供比这些模型单独使用时更高的平均质量。 Conclusion: 本研究成功提出了一种能够在保证图像生成质量的同时降低计算成本的框架,展示了通过灵活选择生成模型来优化资源利用的有效性,并证明了这种方法在多个数据集上优于单一模型的表现。 Abstract: Diffusion models are well known for their ability to generate a high-fidelity image for an input prompt through an iterative denoising process. Unfortunately, the high fidelity also comes at a high computational cost due the inherently sequential generative process. In this work, we seek to optimally balance quality and computational cost, and propose a framework to allow the amount of computation to vary for each prompt, depending on its complexity. Each prompt is automatically routed to the most appropriate text-to-image generation function, which may correspond to a distinct number of denoising steps of a diffusion model, or a disparate, independent text-to-image model. Unlike uniform cost reduction techniques (e.g., distillation, model quantization), our approach achieves the optimal trade-off by learning to reserve expensive choices (e.g., 100+ denoising steps) only for a few complex prompts, and employ more economical choices (e.g., small distilled model) for less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB that by learning to route to nine already-trained text-to-image models, our approach is able to deliver an average quality that is higher than that achievable by any of these models alone.[131] Scaling-Up the Pretraining of the Earth Observation Foundation Model PhilEO to the MajorTOM Dataset
Nikolaos Dionelis,Jente Bosmans,Riccardo Musto,Giancarlo Paoletti,Simone Sarti,Giacomo Cascarano,Casper Fibaek,Luke Camilleri,Bertrand Le Saux,Nicolas Longépé
Main category: cs.CV
TL;DR: The paper discusses the scaling-up of Earth Observation Foundation Models (PhilEO Geo-Aware U-Net) on large unlabeled datasets (MajorTOM 23TB and FastTOM 2TB), explores various model variants, and evaluates their performance in downstream tasks such as road density estimation, building density regression, and land cover segmentation.
Details
Motivation: To effectively utilize the massive volumes of data produced by Earth Observation satellites, it is necessary to pretrain EO Foundation Models on large unlabeled datasets for efficient fine-tuning on downstream tasks with minimal labeled data. Method: The authors scale up the PhilEO Geo-Aware U-Net model on two large datasets - MajorTOM 23TB and FastTOM 2TB. They develop different model variants with varying numbers of parameters and architectures, including transitions from U-Net CNNs to Vision Transformers (ViT). The models are then fine-tuned on the PhilEO Bench for tasks like road density estimation, building density regression, and land cover semantic segmentation. Result: For road density regression, the PhilEO 44M MajorTOM 23TB model performs better than PhilEO Globe 0.5TB 44M across all n-shots. For most n-shots in road density estimation and building density regression, PhilEO 200M FastTOM outperforms other models. Both dataset and model scaling show effectiveness when evaluated using the PhilEO Bench. Conclusion: Scaling up both the dataset and the model architecture enhances the performance of EO Foundation Models on downstream tasks. Transitioning from U-Net CNNs to Vision Transformers also shows promising results. Abstract: Today, Earth Observation (EO) satellites generate massive volumes of data, with the Copernicus Sentinel-2 constellation alone producing approximately 1.6TB per day. To fully exploit this information, it is essential to pretrain EO Foundation Models (FMs) on large unlabeled datasets, enabling efficient fine-tuning for several different downstream tasks with minimal labeled data. In this work, we present the scaling-up of our recently proposed EO Foundation Model, PhilEO Geo-Aware U-Net, on the unlabeled 23TB dataset MajorTOM, which covers the vast majority of the Earth's surface, as well as on the specialized subset FastTOM 2TB that does not include oceans and ice. We develop and study various PhilEO model variants with different numbers of parameters and architectures. Finally, we fine-tune the models on the PhilEO Bench for road density estimation, building density pixel-wise regression, and land cover semantic segmentation, and we evaluate the performance. Our results demonstrate that for all n-shots for road density regression, the PhilEO 44M MajorTOM 23TB model outperforms PhilEO Globe 0.5TB 44M. We also show that for most n-shots for road density estimation and building density regression, PhilEO 200M FastTOM outperforms all the other models. The effectiveness of both dataset and model scaling is validated using the PhilEO Bench. We also study the impact of architecture scaling, transitioning from U-Net Convolutional Neural Networks (CNN) to Vision Transformers (ViT).[132] ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM
Yujun Wang,Jinhe Bi,Yunpu Ma,Soeren Pirk
Main category: cs.CV
TL;DR: Multimodal Large Language Models (MLLMs) often hallucinate due to over-reliance on partial cues. Methods like Visual Contrastive Decoding (VCD) and Instruction Contrastive Decoding (ICD) mitigate this issue, but their effectiveness may stem from deeper shifts in attention distribution rather than surface-level changes. This paper proposes an attention-steerable contrastive decoding framework that directly intervenes in the model's attention mechanisms, reducing hallucinations and improving performance across various benchmarks.
Details
Motivation: MLLMs suffer from hallucinations as they over-rely on partial cues leading to incorrect responses. Existing methods such as VCD and ICD have been proposed to address this issue by contrasting predictions from perturbed or negatively prefixed inputs against original outputs. Method: The authors propose an attention-steerable contrastive decoding framework which directly intervenes in the model's attention mechanisms. This approach aims to offer a more principled method for mitigating hallucinations compared to existing techniques. Result: Experiments conducted across multiple MLLM architectures and diverse decoding methods show that the proposed framework significantly reduces hallucinations and enhances performance on benchmarks like POPE, CHAIR, MMHal-Bench, and standard VQA benchmarks. Conclusion: The attention-steerable contrastive decoding framework provides a promising direction for reducing hallucinations in MLLMs by directly intervening in the model's attention mechanisms. Abstract: Multimodal Large Language Model (MLLM) often suffer from hallucinations. They over-rely on partial cues and generate incorrect responses. Recently, methods like Visual Contrastive Decoding (VCD) and Instruction Contrastive Decoding (ICD) have been proposed to mitigate hallucinations by contrasting predictions from perturbed or negatively prefixed inputs against original outputs. In this work, we uncover that methods like VCD and ICD fundamentally influence internal attention dynamics of the model. This observation suggests that their effectiveness may not stem merely from surface-level modifications to logits but from deeper shifts in attention distribution. Inspired by this insight, we propose an attention-steerable contrastive decoding framework that directly intervenes in attention mechanisms of the model to offer a more principled approach to mitigating hallucinations. Our experiments across multiple MLLM architectures and diverse decoding methods demonstrate that our approach significantly reduces hallucinations and improves the performance on benchmarks such as POPE, CHAIR, and MMHal-Bench, while simultaneously enhancing performance on standard VQA benchmarks.[133] CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion
Jiahua Ma,Yiran Qin,Yixiong Li,Xuanqi Liao,Yulan Guo,Ruimao Zhang
Main category: cs.CV
TL;DR: Causal Diffusion Policy (CDP) enhances robot learning by using historical action sequences and a caching mechanism, achieving higher accuracy and robustness in real-world conditions.