Table of Contents
cs.CL [Back]
[1] Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models
Alaa Alhamzeh,Mays Al Rebdawi
Main category: cs.CL
TL;DR: 本文探讨了三种最先进的大语言模型(GPT-4o、Llama 3.1 和 Gemma 2)在金融交流中注释论点质量的能力,并引入了一种对抗性攻击来分析模型反应并确保模型的公平性和鲁棒性。
Details
Motivation: 金融论点在塑造投资决策和公众对金融机构的信任方面起着关键作用,但其质量评估在文献中研究甚少。 Method: 使用FinArgQuality数据集,评估LLM生成的注释在多个运行中的一致性,并引入对抗性攻击来分析模型反应。 Result: 基于LLM的注释在跨多个运行的注释者间一致性高于人类注释者,但模型仍表现出不同程度的性别偏见。 Conclusion: 本文提供了这些结果的多方面分析,并提出了指导未来研究的实用建议,以实现更可靠、成本效益更高和偏见意识更强的注释方法。 Abstract: Financial arguments play a critical role in shaping investment decisions and public trust in financial institutions. Nevertheless, assessing their quality remains poorly studied in the literature. In this paper, we examine the capabilities of three state-of-the-art LLMs GPT-4o, Llama 3.1, and Gemma 2 in annotating argument quality within financial communications, using the FinArgQuality dataset. Our contributions are twofold. First, we evaluate the consistency of LLM-generated annotations across multiple runs and benchmark them against human annotations. Second, we introduce an adversarial attack designed to inject gender bias to analyse models responds and ensure model's fairness and robustness. Both experiments are conducted across three temperature settings to assess their influence on annotation stability and alignment with human labels. Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts, though the models still exhibit varying degrees of gender bias. We provide a multifaceted analysis of these outcomes and offer practical recommendations to guide future research toward more reliable, cost-effective, and bias-aware annotation methodologies.[2] TurQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection
Tarık Saraç,Selin Mergen,Mucahid Kutlu
Main category: cs.CL
TL;DR: 本文提出了一种用于科学网络话语检测的委员会辩论方法,在检测对科学研究的引用方面表现最佳。
Details
Motivation: 为了识别推文中的科学网络话语,包括科学主张、研究引用和科学实体的提及,开发了一种更有效的辩论方法。 Method: 提出了一种新颖的委员会辩论方法,通过多个大型语言模型(LLMs)模拟结构化学术讨论,以确定给定的推文是否包含科学主张、对科学研究的引用或科学实体的提及。探讨了三种辩论方法:单一辩论、团队辩论和委员会辩论。 Result: 委员会辩论方法在开发测试集中表现优于其他方法;然而,该方法在识别科学主张(10个中的第8名)和科学实体提及(10个中的第9名)方面并不突出,但在检测对科学研究的引用方面排名第一。 Conclusion: 尽管该方法在识别科学主张和科学实体方面排名较低,但在检测对科学研究的引用方面排名第一,因此委员会辩论方法在特定任务中表现出色。 Abstract: In this paper, we present our work developed for the scientific web discourse detection task (Task 4a) of CheckThat! 2025. We propose a novel council debate method that simulates structured academic discussions among multiple large language models (LLMs) to identify whether a given tweet contains (i) a scientific claim, (ii) a reference to a scientific study, or (iii) mentions of scientific entities. We explore three debating methods: i) single debate, where two LLMs argue for opposing positions while a third acts as a judge; ii) team debate, in which multiple models collaborate within each side of the debate; and iii) council debate, where multiple expert models deliberate together to reach a consensus, moderated by a chairperson model. We choose council debate as our primary model as it outperforms others in the development test set. Although our proposed method did not rank highly for identifying scientific claims (8th out of 10) or mentions of scientific entities (9th out of 10), it ranked first in detecting references to scientific studies.[3] Heartificial Intelligence: Exploring Empathy in Language Models
Victoria Williams,Benjamin Rosman
Main category: cs.CL
TL;DR: 研究发现大型语言模型在认知共情上优于人类,但在情感共情方面表现较差,这表明它们在提供虚拟陪伴和情感支持方面有潜力,但缺乏人类的情感共鸣。
Details
Motivation: 随着大型语言模型在专业和个人场景中的广泛应用,它们经常被用作虚拟助手和伴侣。为了更好地理解它们在人类互动中的潜力和局限性,研究它们的认知和情感共情能力是必要的。 Method: 使用标准化心理学测试,对多个小型(SLMs)和大型(LLMs)语言模型的认知共情和情感共情进行了测试,并与人类参与者(包括心理学学生)的表现进行了比较。 Result: 研究发现,LLMs在认知共情任务中始终优于人类(包括心理学学生),但在情感共情方面,无论是小型还是大型语言模型,都显著低于人类参与者。 Conclusion: 研究结果突显了语言模型在模拟认知共情方面的快速发展,表明其在提供有效的虚拟陪伴和个性化情感支持方面具有巨大潜力。此外,它们的高认知但较低的情感共情能力可以在不产生情绪疲劳或偏见的情况下提供客观且一致的情感支持。 Abstract: Large language models have become increasingly common, used by millions of people worldwide in both professional and personal contexts. As these models continue to advance, they are frequently serving as virtual assistants and companions. In human interactions, effective communication typically involves two types of empathy: cognitive empathy (understanding others' thoughts and emotions) and affective empathy (emotionally sharing others' feelings). In this study, we investigated both cognitive and affective empathy across several small (SLMs) and large (LLMs) language models using standardized psychological tests. Our results revealed that LLMs consistently outperformed humans - including psychology students - on cognitive empathy tasks. However, despite their cognitive strengths, both small and large language models showed significantly lower affective empathy compared to human participants. These findings highlight rapid advancements in language models' ability to simulate cognitive empathy, suggesting strong potential for providing effective virtual companionship and personalized emotional support. Additionally, their high cognitive yet lower affective empathy allows objective and consistent emotional support without running the risk of emotional fatigue or bias.[4] Real-time News Story Identification
Tadej Škvorc,Nikola Ivačič,Sebastjan Hribar,Marko Robnik-Šikonja
Main category: cs.CL
TL;DR: 本文提出了一种用于新闻监测系统的实时故事识别方法,结合了文本表示技术、聚类算法和在线主题建模方法,以实时将新闻文章分配到特定的故事中。
Details
Motivation: 为了提高阅读体验,许多新闻网站将新闻组织成主题集合。故事识别的目标是将每篇新闻文章分配到其报道的具体故事中,但现有的文本聚类和主题建模方法无法满足基于特定事件、地点和人物的分组需求。 Method: 本文结合了文本表示技术、聚类算法和在线主题建模方法,使用多种文本表示方法提取故事识别所需的具体事件和命名实体,并对在线主题建模方法如BERTopic、DBStream和TextClust进行了适应性改进。 Result: 在斯洛文尼亚媒体一个月的新闻数据集上评估了提出的方法,结果显示该实时方法能够产生合理的分组结果,并得到了人工评估者的认可。 Conclusion: 本文展示了一种有效的实时故事识别方法,能够实时将新闻文章分配到其报道的具体故事中,适用于新闻监测系统。 Abstract: To improve the reading experience, many news sites organize news into topical collections, called stories. In this work, we present an approach for implementing real-time story identification for a news monitoring system that automatically collects news articles as they appear online and processes them in various ways. Story identification aims to assign each news article to a specific story that the article is covering. The process is similar to text clustering and topic modeling, but requires that articles be grouped based on particular events, places, and people, rather than general text similarity (as in clustering) or general (predefined) topics (as in topic modeling). We present an approach to story identification that is capable of functioning in real time, assigning articles to stories as they are published online. In the proposed approach, we combine text representation techniques, clustering algorithms, and online topic modeling methods. We combine various text representation methods to extract specific events and named entities necessary for story identification, showing that a mixture of online topic-modeling approaches such as BERTopic, DBStream, and TextClust can be adapted for story discovery. We evaluate our approach on a news dataset from Slovene media covering a period of 1 month. We show that our real-time approach produces sensible results as judged by human evaluators.[5] TT-XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning
Kristian Miok,Blaz Škrlj,Daniela Zaharie,Marko Robnik Šikonja
Main category: cs.CL
TL;DR: 本文提出TT-XAI框架,通过领域感知关键词提炼和大型语言模型推理,提升临床决策支持中人工智能的可信度和可审计性。
Details
Motivation: 临床语言模型在应用于长且非结构化的电子健康记录(EHRs)时常常难以提供可信赖的预测和解释 Method: 通过领域感知关键词提炼和与大型语言模型(LLMs)的推理,首先将原始出院记录提炼为简洁的关键词表示,以增强BERT分类器的性能并改进局部解释保真度;其次,使用关键词引导的提示生成链式临床解释。 Result: 所有评估方式均一致支持关键词增强方法,确认提炼增强了机器和人类的可解释性 Conclusion: TT-XAI提供了一种可扩展的路径,以实现临床决策支持中值得信赖且可审计的人工智能(AI) Abstract: Clinical language models often struggle to provide trustworthy predictions and explanations when applied to lengthy, unstructured electronic health records (EHRs). This work introduces TT-XAI, a lightweight and effective framework that improves both classification performance and interpretability through domain-aware keyword distillation and reasoning with large language models (LLMs). First, we demonstrate that distilling raw discharge notes into concise keyword representations significantly enhances BERT classifier performance and improves local explanation fidelity via a focused variant of LIME. Second, we generate chain-of-thought clinical explanations using keyword-guided prompts to steer LLMs, producing more concise and clinically relevant reasoning. We evaluate explanation quality using deletion-based fidelity metrics, self-assessment via LLaMA-3 scoring, and a blinded human study with domain experts. All evaluation modalities consistently favor the keyword-augmented method, confirming that distillation enhances both machine and human interpretability. TT-XAI offers a scalable pathway toward trustworthy, auditable AI in clinical decision support.[6] Distilling Knowledge from Large Language Models: A Concept Bottleneck Model for Hate and Counter Speech Recognition
Roberto Labadie-Tamayo,Djordje Slijepčević,Xihui Chen,Adrian Jaques Böck,Andreas Babic,Liz Freimann,Christiane Atzmüller Matthias Zeppelzauer
Main category: cs.CL
TL;DR: SCBM is a transparent, adjective-based method for hate and counter speech recognition that achieves strong performance and high interpretability across multiple datasets.
Details
Motivation: The rapid increase in hate speech on social media necessitates automated detection methods that are not only effective but also transparent and interpretable, unlike traditional black-box models. Method: SCBM uses adjectives as human-interpretable bottleneck concepts, leveraging large language models to map input texts into an abstract adjective-based representation, which is then classified. Additionally, the method integrates this representation with transformer embeddings. Result: SCBM achieved an average macro-F1 score of 0.69 across five datasets, outperforming recent literature results on four of them. It also provided high interpretability and a 1.8% performance boost on average when combined with transformer embeddings. Conclusion: The proposed SCBM method provides a transparent and interpretable approach for hate and counter speech recognition, showing effectiveness across multiple datasets and potential adaptability to other NLP tasks. Abstract: The rapid increase in hate speech on social media has exposed an unprecedented impact on society, making automated methods for detecting such content important. Unlike prior black-box models, we propose a novel transparent method for automated hate and counter speech recognition, i.e., "Speech Concept Bottleneck Model" (SCBM), using adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to map input texts to an abstract adjective-based representation, which is then sent to a light-weight classifier for downstream tasks. Across five benchmark datasets spanning multiple languages and platforms (e.g., Twitter, Reddit, YouTube), SCBM achieves an average macro-F1 score of 0.69 which outperforms the most recently reported results from the literature on four out of five datasets. Aside from high recognition accuracy, SCBM provides a high level of both local and global interpretability. Furthermore, fusing our adjective-based concept representation with transformer embeddings, leads to a 1.8% performance increase on average across all datasets, showing that the proposed representation captures complementary information. Our results demonstrate that adjective-based concept representations can serve as compact, interpretable, and effective encodings for hate and counter speech recognition. With adapted adjectives, our method can also be applied to other NLP tasks.[7] MLLM-CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis
Haiyun Guo,ZhiYan Hou,Yu Chen,Jinghan He,Yandu Sun,Yuzhe Zhou,Shujing Guo,Kuan Zhu,Jinqiao Wang
Main category: cs.CL
TL;DR: 本文介绍了MLLM-CTBench,这是一个用于持续指令调优的全面评估基准,旨在解决多模态大语言模型领域缺乏系统基准的问题。
Details
Motivation: 为了适应现实应用中不断变化的需求,多模态大语言模型(MLLMs)依赖于持续指令调优。然而,由于缺乏严格和系统的基准,该领域的发展受到阻碍。因此,本文提出了MLLM-CTBench来填补这一空白。 Method: 作者构建了一个全面的评估基准,包括多维评估、算法和训练范式的全面评估以及精心策划的任务。他们结合最终答案准确性与细粒度的推理链(CoT)质量评估,对8种持续学习算法进行了基准测试,并比较了强化学习与监督微调范式。 Result: 研究结果包括:(i)具有更强一般能力的模型在持续学习中表现出更大的鲁棒性;(ii)推理链比最终答案退化得更慢,支持分层遗忘假说;(iii)持续学习算法的有效性高度依赖于模型能力和任务顺序;(iv)在强化学习设置中,引入KL散度约束有助于保持策略稳定性,并在缓解遗忘方面起关键作用。 Conclusion: MLLM-CTBench为多模态大语言模型的持续指令调优建立了严格的标准,并为算法设计和评估提供了实用指导。 Abstract: Multimodal Large Language Models (MLLMs) rely on continual instruction tuning to adapt to the evolving demands of real-world applications. However, progress in this area is hindered by the lack of rigorous and systematic benchmarks. To address this gap, we present MLLM-CTBench, a comprehensive evaluation benchmark with three key contributions: (1) Multidimensional Evaluation: We combine final answer accuracy with fine-grained CoT reasoning quality assessment, enabled by a specially trained CoT evaluator; (2) Comprehensive Evaluation of Algorithms and Training Paradigms: We benchmark eight continual learning algorithms across four major categories and systematically compare reinforcement learning with supervised fine-tuning paradigms; (3) Carefully Curated Tasks: We select and organize 16 datasets from existing work, covering six challenging domains. Our key findings include: (i) Models with stronger general capabilities exhibit greater robustness to forgetting during continual learning; (ii) Reasoning chains degrade more slowly than final answers, supporting the hierarchical forgetting hypothesis; (iii) The effectiveness of continual learning algorithms is highly dependent on both model capability and task order; (iv) In reinforcement learning settings, incorporating KL-divergence constraints helps maintain policy stability and plays a crucial role in mitigating forgetting. MLLM-CTBench establishes a rigorous standard for continual instruction tuning of MLLMs and offers practical guidance for algorithm design and evaluation.[8] Evaluating Contrast Localizer for Identifying Causal Unitsin Social & Mathematical Tasks in Language Models
Yassine Jamaa,Badr AlKhamissi,Satrajit Ghosh,Martin Schrimpf
Main category: cs.CL
TL;DR: This study challenges the effectiveness of contrast-based localizers in identifying crucial units for specific tasks in language and vision-language models, suggesting a need for improved methods.
Details
Motivation: To understand the causal relevance of units in large language and vision-language models for specific tasks like Theory of Mind and mathematical reasoning. Method: The researchers used a neuroscientific contrast localizer to identify causally relevant units in various LLMs and VLMs, employing contrastive stimulus sets and targeted ablations to evaluate causal roles. Result: Surprisingly, low-activation units sometimes caused more significant performance drops than highly activated ones, and units from the mathematical localizer often impaired ToM performance more than those from the ToM localizer. Conclusion: The study's findings question the causal relevance of contrast-based localizers in identifying task-specific units for Theory of Mind and mathematical reasoning, suggesting the need for more comprehensive approaches. Abstract: This work adapts a neuroscientific contrast localizer to pinpoint causally relevant units for Theory of Mind (ToM) and mathematical reasoning tasks in large language models (LLMs) and vision-language models (VLMs). Across 11 LLMs and 5 VLMs ranging in size from 3B to 90B parameters, we localize top-activated units using contrastive stimulus sets and assess their causal role via targeted ablations. We compare the effect of lesioning functionally selected units against low-activation and randomly selected units on downstream accuracy across established ToM and mathematical benchmarks. Contrary to expectations, low-activation units sometimes produced larger performance drops than the highly activated ones, and units derived from the mathematical localizer often impaired ToM performance more than those from the ToM localizer. These findings call into question the causal relevance of contrast-based localizers and highlight the need for broader stimulus sets and more accurately capture task-specific units.[9] Objective Metrics for Evaluating Large Language Models Using External Data Sources
Haoze Du,Richard Li,Edward Gehringer
Main category: cs.CL
TL;DR: 这篇论文提出了一种新的评估大语言模型的方法,通过自动化和透明化的评估流程,减少主观偏差,提高评估的可扩展性和适用性。
Details
Motivation: 评估大语言模型(LLMs)的性能是一个关键且具有挑战性的任务,尤其是在避免主观评估时。 Method: 通过使用明确定义的基准、事实数据集和结构化评估流程,实现评估过程的自动化和透明化,减少对人工解释的依赖。 Result: 该方法确保了在各种任务中对LLM输出的评估具有一致性、可重复性,并最小化了偏差。 Conclusion: 该论文提出了一种利用从不同学期的课程文本材料中提取的主观指标来评估大语言模型输出的框架,旨在解决主观评估方法的局限性,并提供一种可扩展的性能评估解决方案。 Abstract: Evaluating the performance of Large Language Models (LLMs) is a critical yet challenging task, particularly when aiming to avoid subjective assessments. This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters to assess LLM outputs across various tasks. By utilizing well-defined benchmarks, factual datasets, and structured evaluation pipelines, the approach ensures consistent, reproducible, and bias-minimized measurements. The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation while ensuring alignment with real-world applications. This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.[10] MinionsLLM: a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language
Andres Garcia Rincon,Eliseo Ferrante
Main category: cs.CL
TL;DR: MinionsLLM框架结合大型语言模型与行为树,通过两种合成数据生成方法优化自然语言控制,显著提升了多智能体系统任务性能。
Details
Motivation: 为了在任意用户定义环境中实现自然语言对多智能体系统的控制,并提升语法有效性和任务相关性。 Method: MinionsLLM引入了两种合成数据集生成方法(方法A和方法B),以优化大型语言模型的语法有效性和语义任务相关性。 Result: 使用Google的Gemma 3模型家族进行验证,方法B将语法有效性提升至92.6%,任务性能平均提高了33%。 Conclusion: MinionsLLM框架通过整合大型语言模型、行为树和形式语法,实现了对多智能体系统的自然语言控制,并展示了在资源受限场景下部署紧凑型本地大型语言模型的前景。 Abstract: This paper presents MinionsLLM, a novel framework that integrates Large Language Models (LLMs) with Behavior Trees (BTs) and Formal Grammars to enable natural language control of multi-agent systems within arbitrary, user-defined environments. MinionsLLM provides standardized interfaces for defining environments, agents, and behavioral primitives, and introduces two synthetic dataset generation methods (Method A and Method B) to fine-tune LLMs for improved syntactic validity and semantic task relevance. We validate our approach using Google's Gemma 3 model family at three parameter scales (1B, 4B, and 12B) and demonstrate substantial gains: Method B increases syntactic validity to 92.6% and achieves a mean task performance improvement of 33% over baseline. Notably, our experiments show that smaller models benefit most from fine-tuning, suggesting promising directions for deploying compact, locally hosted LLMs in resource-constrained multi-agent control scenarios. The framework and all resources are released open-source to support reproducibility and future research.[11] The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
Denis Janiak,Jakub Binkowski,Albert Sawczyn,Bogdan Gabrys,Ravid Schwartz-Ziv,Tomasz Kajdanowicz
Main category: cs.CL
TL;DR: This paper shows that current methods for detecting hallucinations in large language models are poorly evaluated by traditional metrics like ROUGE, which misaligns with human judgment, and suggests the need for better evaluation frameworks.
Details
Motivation: The motivation for this study is the growing concern over the hallucination tendencies of large language models (LLMs) and the realization that existing evaluation metrics, such as ROUGE, do not align well with human judgments, potentially leading to misleading performance estimates. Method: The authors conducted comprehensive human studies to evaluate the performance of hallucination detection methods, comparing traditional metrics like ROUGE with human-aligned metrics such as LLM-as-Judge. They also analyzed the effectiveness of simple heuristics, such as response length, in detecting hallucinations. Result: The study found that ROUGE has high recall but extremely low precision, leading to inaccurate performance estimates. Several established detection methods showed performance drops of up to 45.9% when evaluated using human-aligned metrics. Additionally, it was found that simple heuristics based on response length could rival complex detection techniques. Conclusion: The paper concludes that current hallucination detection methods for LLMs are not effectively evaluated by traditional metrics like ROUGE, and emphasizes the need for more semantically aware and robust evaluation frameworks to ensure the trustworthiness of LLM outputs. Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9\% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods, ultimately ensuring the trustworthiness of LLM outputs.[12] Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions
Farah Atif,Nursultan Askarbekuly,Kareem Darwish,Monojit Choudhury
Main category: cs.CL
TL;DR: This paper introduces FiqhQA, a benchmark for evaluating Large Language Models' ability to generate Islamic rulings according to specific Sunni schools of thought, highlighting the importance of task-specific evaluation and caution in deploying these models for religious applications.
Details
Motivation: The motivation behind this study is the increasing use of Large Language Models (LLMs) in answering questions across various domains, including religious ones, despite their reliability and accuracy remaining largely unexamined in these contexts. Method: The authors introduced a novel benchmark called FiqhQA for evaluating LLMs on their ability to generate Islamic rulings specific to the four major Sunni schools of thought in both Arabic and English. They conducted zero-shot and abstention experiments to assess the accuracy and abstention behavior of various LLMs. Result: The experiments revealed significant variation among LLMs in terms of accuracy and abstention behavior. GPT-4o outperformed other models in accuracy, while Gemini and Fanar demonstrated better abstention behavior. All models showed a performance drop in Arabic, indicating limitations in religious reasoning for non-English languages. Conclusion: The study highlights the need for task-specific evaluation and cautious deployment of LLMs in religious applications, particularly in Islamic jurisprudence. Abstract: Despite the increasing usage of Large Language Models (LLMs) in answering questions in a variety of domains, their reliability and accuracy remain unexamined for a plethora of domains including the religious domains. In this paper, we introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English. Unlike prior work, which either overlooks the distinctions between religious school of thought or fails to evaluate abstention behavior, we assess LLMs not only on their accuracy but also on their ability to recognize when not to answer. Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought. While GPT-4o outperforms all other models in accuracy, Gemini and Fanar demonstrate superior abstention behavior critical for minimizing confident incorrect answers. Notably, all models exhibit a performance drop in Arabic, highlighting the limitations in religious reasoning for languages other than English. To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic jurisprudence queries. Our findings underscore the need for task-specific evaluation and cautious deployment of LLMs in religious applications.[13] Putnam-AXIOM: A Functional and Static Benchmark
Aryan Gulati,Brando Miranda,Eric Chen,Emily Xia,Kai Fronsdal,Bruno Dumont,Elyas Obbad,Sanmi Koyejo
Main category: cs.CL
TL;DR: 论文提出了新的数学推理基准测试集Putnam-AXIOM,用以评估大语言模型在抗污染环境下的数学推理能力。
Details
Motivation: 当前LLM的数学推理基准测试正接近饱和,并且受到训练集污染的影响。因此需要更有效的基准测试方法。 Method: 引入了Putnam-AXIOM和Putnam-AXIOM Variation两个基准测试集,并评估了多个模型在这两个测试集上的表现。 Result: 在Putnam-AXIOM Original测试集上,最强的模型o1-preview得分为41.9%,但在Putnam-AXIOM Variation上的得分下降了19.6%。其余18个模型也显示出相同的下降趋势。 Conclusion: Putnam-AXIOM提供了一个严格且抗污染的评估框架,用于评估LLM的高级数学推理能力。 Abstract: Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview -- the strongest evaluated model -- scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement "boxed" accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.[14] CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation
Shuzhou Yuan,William LaCroix,Hardik Ghoshal,Ercong Nie,Michael Färber
Main category: cs.CL
TL;DR: CoDAE는 Chain-of-Thought(CoT) 데이터 증강을 통해 대규모 언어 모델(LLM)을 교육적 목적에 맞게 적응시키는 프레임워크로, 실제 대화 데이터를 수집하고 증강하여 모델을 미세 조정한 결과, 교육적 안내의 적절성과 추론 지원 능력을 향상시키고 초기 답변 공개를 방지하는 효과를 보였다.
Details
Motivation: 기존의 대규모 언어 모델(LLM)은 교육 환경에서 여러 문제점이 있다. 예를 들어, 답을 너무 쉽게 제공하거나 학습자의 불확실성에 따라 대답을 조정하지 못하며, 감정적으로 조작 가능한 프롬프트에 취약하다. 이러한 문제를 해결하기 위한 목적이 있다. Method: CoDAE라는 프레임워크를 설계하고 실제 대화 데이터를 수집한 후 Chain-of-Thought(CoT) 프롬프팅을 사용하여 데이터를 증강시켰으며, 특정 문제를 해결하기 위한 대화 사례를 설계하여 데이터 세트를 미세 조정한 후 여러 자동 평가 지표와 LLM-as-a-judge 평가 방법을 사용했다. Result: CoDAE를 통해 미세 조정된 모델은 시뮬레이션된 교육 시나리오에서 보다 적절한 교육적 안내를 제공하고, 추론 과정을 지원하며, 초기 답변 공개를 효과적으로 방지했다. Conclusion: CoDAE의 적용으로 인해 미세 조정된 모델은 보다 효과적인 교육적 안내를 제공하고, 추론 과정을 지원하며, 초기 답변 공개를 방지할 수 있다. Abstract: Large Language Models (LLMs) are increasingly employed as AI tutors due to their scalability and potential for personalized instruction. However, off-the-shelf LLMs often underperform in educational settings: they frequently reveal answers too readily, fail to adapt their responses to student uncertainty, and remain vulnerable to emotionally manipulative prompts. To address these challenges, we introduce CoDAE, a framework that adapts LLMs for educational use through Chain-of-Thought (CoT) data augmentation. We collect real-world dialogues between students and a ChatGPT-based tutor and enrich them using CoT prompting to promote step-by-step reasoning and pedagogically aligned guidance. Furthermore, we design targeted dialogue cases to explicitly mitigate three key limitations: over-compliance, low response adaptivity, and threat vulnerability. We fine-tune four open-source LLMs on different variants of the augmented datasets and evaluate them in simulated educational scenarios using both automatic metrics and LLM-as-a-judge assessments. Our results show that models fine-tuned with CoDAE deliver more pedagogically appropriate guidance, better support reasoning processes, and effectively resist premature answer disclosure.[15] Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery
Jiatong Li,Weida Wang,Qinggang Zhang,Junxian Li,Di Zhang,Changmeng Zheng,Shufei Zhang,Xiaoyong Wei,Qing Li
Main category: cs.CL
TL;DR: Mol-R1 enhances the reasoning capabilities of large language models in molecule discovery by combining novel dataset distillation and iterative training strategies.
Details
Motivation: Long-CoT reasoning models struggle with knowledge-intensive domains like molecule discovery due to complexity and lack of high-quality annotations. This work aims to bridge that gap by improving reasoning performance in such domains. Method: Mol-R1 employs Prior Regulation via In-context Distillation (PRID) to curate a high-quality reasoning dataset and uses Molecular Iterative Adaptation (MoIA), combining Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), to enhance reasoning capabilities for molecule discovery. Result: The Mol-R1 framework demonstrates superior performance over existing baselines in text-based molecule reasoning generation tasks. Conclusion: Mol-R1 improves the reasoning performance and explainability of explicit long-CoT LLMs in text-based molecule generation tasks, outperforming existing baselines. Abstract: Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines.[16] Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Saketh Reddy Vemula,Dipti Mishra Sharma,Parameswari Krishnamurthy
Main category: cs.CL
TL;DR: This study finds that while morphological alignment helps with syntax-based tasks, the tokenizer algorithm significantly influences performance, with Unigram tokenizers outperforming others and hybrid tokenizers improving BPE-based models.
Details
Motivation: Prior work showed conflicting findings about whether morphologically aligned tokenization improves performance, especially for languages with complex morphology. This paper aims to investigate this by analyzing multiple languages and tokenizer algorithms. Method: A comprehensive evaluation of language models across Telugu, Hindi, and English was conducted, focusing on morphological alignment and tokenization quality. A dataset with gold morpheme segmentations was created for Telugu, and different tokenizer algorithms (Byte-pair Encoding vs. Unigram) were assessed. Result: Better morphological alignment correlates moderately with improved performance on syntax-based tasks like Parts-of-Speech tagging, Named Entity Recognition, and Dependency Parsing. However, the choice of tokenizer algorithm (e.g., Unigram vs. BPE) has a more significant impact on performance. Hybrid tokenizers incorporating morphological segmentation improve performance within the BPE framework, while intrinsic metrics like Corpus Token Count and Rényi entropy do not correlate with downstream performance. Conclusion: Morphological alignment moderately improves performance in syntax-based tasks, but the tokenizer algorithm has a more significant impact on overall performance, with Unigram tokenizers outperforming others, and hybrid tokenizers improving performance within the BPE framework. Abstract: Prior work on language modeling showed conflicting findings about whether morphologically aligned approaches to tokenization improve performance, particularly for languages with complex morphology. To investigate this, we select a typologically diverse set of languages: Telugu (agglutinative), Hindi (primarily fusional with some agglutination), and English (fusional). We conduct a comprehensive evaluation of language models -- starting from tokenizer training and extending through the finetuning and downstream task evaluation. To account for the consistent performance differences observed across tokenizer variants, we focus on two key factors: morphological alignment and tokenization quality. To assess morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal that better morphological alignment correlates positively -- though moderately -- with performance in syntax-based tasks such as Parts-of-Speech tagging, Named Entity Recognition and Dependency Parsing. However, we also find that the tokenizer algorithm (Byte-pair Encoding vs. Unigram) plays a more significant role in influencing downstream performance than morphological alignment alone. Naive Unigram tokenizers outperform others across most settings, though hybrid tokenizers that incorporate morphological segmentation significantly improve performance within the BPE framework. In contrast, intrinsic metrics like Corpus Token Count (CTC) and R\'enyi entropy showed no correlation with downstream performance.[17] Enhancing Small LLM Alignment through Margin-Based Objective Modifications under Resource Constraints
Daren Yao,Jinsong Yuan,Ruike Chen
Main category: cs.CL
TL;DR: 本文提出了一种改进的小型语言模型对齐方法,在资源受限情况下显著提升性能。
Details
Motivation: 小型大语言模型(LLMs)在性能差距较大时难以对齐人类偏好,本文旨在解决这一问题。 Method: 提出了Adaptive Margin-Sigmoid Loss和APO-hinge-zero方法,结合了基于间隔的目标和选择性更新机制。 Result: 在AlpacaEval中,APO-hinge-zero方法的胜率提高了+2.0点,长度控制胜率提高了+1.4点;在MT-Bench中,该方法在STEM和人文任务中表现尤为出色。 Conclusion: 本文提出两种基于DPO的轻量级变体,以改善小型语言模型在资源受限情况下的对齐性能。 Abstract: Small large language models (LLMs) often face difficulties in aligning output to human preferences, particularly when operating under severe performance gaps. In this work, we propose two lightweight DPO-based variants -- Adaptive Margin-Sigmoid Loss and APO-hinge-zero -- to better address underperformance scenarios by introducing margin-based objectives and selective update mechanisms. Our APO-hinge-zero method, which combines hinge-induced hard-example mining with the chosen-focused optimization of APO-zero, achieves strong results. In AlpacaEval, APO-hinge-zero improves the win rate by +2.0 points and the length-controlled win rate by +1.4 points compared to the APO-zero baseline. In MT-Bench, our methods maintain competitive performance in diverse categories, particularly excelling in STEM and Humanities tasks. These results demonstrate that simple modifications to preference-based objectives can significantly enhance small LLM alignment under resource constraints, offering a practical path toward more efficient deployment.[18] Momentum Point-Perplexity Mechanics in Large Language Models
Lorenzo Tomaz,Judd Rosenblatt,Thomas Berry Jones,Diogo Schwerz de Lucena
Main category: cs.CL
TL;DR: 通过物理学方法研究大语言模型推理过程中隐藏状态变化,发现一种类似能量守恒的规律,并提出一种控制方法Jacobian steering,以提升模型输出质量。
Details
Motivation: 为了更好地理解和控制大语言模型在推理过程中的行为,提高模型的可解释性和可控性。 Method: 使用物理学的方法分析模型隐藏状态随token的变化,并提出了一种新的控制方法Jacobian steering。 Result: 发现模型中存在类似能量守恒的现象,训练后的模型进入更快、更果断的状态,Jacobian steering方法能有效提升输出的语义质量。 Conclusion: 通过物理学视角分析大语言模型,为模型的可解释性、异常检测和低风险控制提供了新的理论基础和方法支持。 Abstract: We take a physics-based approach to studying how the internal hidden states of large language models change from token to token during inference. Across 20 open-source transformer models (135M-3B parameters), we find that a quantity combining the rate of change in hidden states and the model's next-token certainty, analogous to energy in physics, remains nearly constant. Random-weight models conserve this "energy" more tightly than pre-trained ones, while training shifts models into a faster, more decisive regime with greater variability. Using this "log-Lagrangian" view, we derive a control method called Jacobian steering, which perturbs hidden states in the minimal way needed to favor a target token. This approach maintained near-constant energy in two tested models and produced continuations rated higher in semantic quality than the models' natural outputs. Viewing transformers through this mechanics lens offers a principled basis for interpretability, anomaly detection, and low-risk steering. This could help make powerful models more predictable and aligned with human intent.[19] Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression
Jadie Adams,Brian Hu,Emily Veenhuis,David Joy,Bharadwaj Ravichandran,Aaron Bray,Anthony Hoogs,Arslan Basharat
Main category: cs.CL
TL;DR: This paper proposes a steerable pluralistic alignment method for large language models that adapts to diverse user preferences using few-shot comparative regression, outperforming existing methods and advancing ethical AI.
Details
Motivation: Traditional alignment methods like RLHF use scalar rewards that only capture average user preferences, while pluralistic alignment aims to reflect diverse preferences across multiple attributes. Method: A steerable pluralistic model based on few-shot comparative regression that leverages in-context learning and reasoning using fine-grained attributes. Result: The approach outperforms multiple baseline and state-of-the-art methods and is interpretable, attribute-compatible, and model-agnostic. Conclusion: The proposed few-shot comparative regression approach advances pluralistic alignment of LLMs, enabling fairer and more representative AI by adapting to diverse user preferences across attributes. Abstract: Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback (RLHF). However, these methods use scalar rewards that can only reflect user preferences on average. Pluralistic alignment instead seeks to capture diverse user preferences across a set of attributes, moving beyond just helpfulness and harmlessness. Toward this end, we propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences. Our approach leverages in-context learning and reasoning, grounded in a set of fine-grained attributes, to compare response options and make aligned choices. To evaluate our algorithm, we also propose two new steerable pluralistic benchmarks by adapting the Moral Integrity Corpus (MIC) and the HelpSteer2 datasets, demonstrating the applicability of our approach to value-aligned decision-making and reward modeling, respectively. Our few-shot comparative regression approach is interpretable and compatible with different attributes and LLMs, while outperforming multiple baseline and state-of-the-art methods. Our work provides new insights and research directions in pluralistic alignment, enabling a more fair and representative use of LLMs and advancing the state-of-the-art in ethical AI.[20] DeCAL Tokenwise Compression
Sameer Panwar
Main category: cs.CL
TL;DR: DeCAL 是一种高效的 tokenwise 压缩方法,通过优化编码器实现高质量压缩,在多个任务中表现良好。
Details
Motivation: 为了在使用预计算密集表示的场景下实现高效的 tokenwise 压缩,从而节省计算和存储资源。 Method: DeCAL 使用经过去噪预训练的编码器-解码器语言模型,通过修改编码器来优化压缩质量。 Result: DeCAL 在 2x 压缩时可以匹配未压缩模型的性能,最多 8x 压缩时性能仅有轻微下降,适用于问答、摘要和多向量检索任务。 Conclusion: DeCAL 提供了一种新的 tokenwise 压缩方法,在许多下游任务中实现了显著的压缩效率,同时保持了较高的性能。 Abstract: This paper introduces DeCAL, a new method for tokenwise compression. DeCAL uses an encoder-decoder language model pretrained with denoising to learn to produce high-quality, general-purpose compressed representations by the encoder. DeCAL applies small modifications to the encoder, with the emphasis on maximizing compression quality, even at the expense of compute. We show that DeCAL at 2x compression can match uncompressed on many downstream tasks, with usually only minor dropoff in metrics up to 8x compression, among question-answering, summarization, and multi-vector retrieval tasks. DeCAL offers significant savings where pre-computed dense representations can be utilized, and we believe the approach can be further developed to be more broadly applicable.[21] DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives
Sehwan Moon,Aram Lee,Jeong Eun Kim,Hee-Ju Kang,Il-Seon Shin,Sung-Wan Kim,Jae-Min Kim,Min Jhon,Ju-Wan Kim
Main category: cs.CL
TL;DR: DepressLLM利用自传体叙述和SToPS模块实现可解释的抑郁症预测,展示了人工智能在精神病学中的潜力。
Details
Motivation: 抑郁症预测因缺乏大规模、高质量和严格标注的数据集而受到阻碍。 Method: 本研究引入了DepressLLM,使用一个包含3,699条自传体叙述的新语料库进行训练和评估,并通过其Score-guided Token Probability Summation(SToPS)模块提供改进的分类性能和可靠的置信度估计。 Result: DepressLLM实现了0.789的AUC,在置信度≥0.95的样本上提升至0.904。此外,DepressLLM在异构数据集上表现出稳健性,并通过精神科审查发现了模型和数据的关键局限性。 Conclusion: DepressLLM的研究结果证明,可解释的人工智能可以实现更早的抑郁症诊断,并强调了医学人工智能在精神病学领域的前景。 Abstract: Advances in large language models (LLMs) have enabled a wide range of applications. However, depression prediction is hindered by the lack of large-scale, high-quality, and rigorously annotated datasets. This study introduces DepressLLM, trained and evaluated on a novel corpus of 3,699 autobiographical narratives reflecting both happiness and distress. DepressLLM provides interpretable depression predictions and, via its Score-guided Token Probability Summation (SToPS) module, delivers both improved classification performance and reliable confidence estimates, achieving an AUC of 0.789, which rises to 0.904 on samples with confidence $\geq$ 0.95. To validate its robustness to heterogeneous data, we evaluated DepressLLM on in-house datasets, including an Ecological Momentary Assessment (EMA) corpus of daily stress and mood recordings, and on public clinical interview data. Finally, a psychiatric review of high-confidence misclassifications highlighted key model and data limitations that suggest directions for future refinements. These findings demonstrate that interpretable AI can enable earlier diagnosis of depression and underscore the promise of medical AI in psychiatry.[22] Optimizing Retrieval-Augmented Generation (RAG) for Colloquial Cantonese: A LoRA-Based Systematic Review
David Santandreu Calonge,Linda Smail
Main category: cs.CL
TL;DR: This review explores the use of Low-Rank Adaptation (LoRA) in Retrieval-Augmented Generation (RAG) systems to improve performance in low-resource dialects like Cantonese, showing that it can reduce parameters without sacrificing quality, but challenges remain in capturing linguistic nuances and scaling effectively.
Details
Motivation: Retrieval-Augmented Generation (RAG) systems like Qwen3, DeepSeek, and Kimi face challenges in generating authentic Cantonese colloquial expressions due to limited annotated data and linguistic variability. The motivation of this review is to explore how Parameter-Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), can optimize these systems for better performance in low-resource and dialectal contexts. Method: The study systematically analyzed recent advances in PEFT methods, particularly LoRA, and their integration into RAG frameworks. It evaluated variants of LoRA, synthetic data generation, user feedback integration, and adaptive parameter allocation. Benchmarking was conducted to assess retrieval and generation accuracy, computational efficiency, retrieval precision, linguistic authenticity, and scalability. Result: Findings indicate that dynamic and ensemble LoRA adaptations significantly reduce trainable parameters without compromising retrieval accuracy or generation quality in dialectal contexts. Selective parameter freezing and nonlinear adaptation methods offer better efficiency-accuracy trade-offs. However, limitations persist in preserving fine-grained linguistic nuances, especially in low-resource settings, and the integration of real-time user feedback and domain-specific data remains underdeveloped. Conclusion: Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), shows promise in optimizing Retrieval-Augmented Generation (RAG) systems for domain-specific language tasks. While these methods effectively reduce trainable parameters and maintain performance in tasks involving dialects like Cantonese, challenges remain in preserving linguistic nuances, integrating real-time user feedback, and ensuring robustness at scale. Future research should focus on improving dialectal authenticity, dynamic adaptation, and scalable fine-tuning pipelines. Abstract: This review examines recent advances in Parameter-Efficient Fine-Tuning (PEFT), with a focus on Low-Rank Adaptation (LoRA), to optimize Retrieval-Augmented Generation (RAG) systems like Qwen3, DeepSeek, and Kimi. These systems face challenges in understanding and generating authentic Cantonese colloquial expressions due to limited annotated data and linguistic variability. The review evaluates the integration of LoRA within RAG frameworks, benchmarks PEFT methods for retrieval and generation accuracy, identify domain adaptation strategies under limited data, and compares fine-tuning techniques aimed at improving semantic fidelity under data-scarce conditions. A systematic analysis of recent studies employing diverse LoRA variants, synthetic data generation, user feedback integration, and adaptive parameter allocation was conducted to assess their impact on computational efficiency, retrieval precision, linguistic authenticity, and scalability. Findings reveal that dynamic and ensemble LoRA adaptations significantly reduce trainable parameters without sacrificing retrieval accuracy and generation quality in dialectal contexts. However, limitations remain in fully preserving fine-grained linguistic nuances, especially for low-resource settings like Cantonese. The integration of real-time user feedback and domain-specific data remains underdeveloped, limiting model adaptability and personalization. While selective parameter freezing and nonlinear adaptation methods offer better trade-offs between efficiency and accuracy, their robustness at scale remains an open challenge. This review highlights the promise of PEFT-enhanced RAG systems for domain-specific language tasks and calls for future work targeting dialectal authenticity, dynamic adaptation, and scalable fine-tuning pipelines.[23] InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling
Peiji Li,Jiasheng Ye,Yongkang Chen,Yichuan Ma,Zijie Yu,Kedi Chen,Ganqu Cui,Haozhan Li,Jiacheng Chen,Chengqi Lyu,Wenwei Zhang,Linyang Li,Qipeng Guo,Dahua Lin,Bowen Zhou,Kai Chen
Main category: cs.CL
TL;DR: The paper presents InternBootcamp, an open-source framework for enhancing large language model reasoning through diverse task environments and automated evaluation, resulting in significant performance improvements.
Details
Motivation: The motivation is to address the gap in real-world reasoning scenarios that narrow-domain benchmarks cannot capture, thus enhancing the complex reasoning capabilities of large language models. Method: The paper introduces InternBootcamp, an open-source framework with 1000+ diverse task environments, which includes automated generation of training/testing cases and integrated verification modules for evaluation. Result: The result is the development of InternBootcamp, which accelerates framework development through automated workflows and manual validation, leading to improved performance in reasoning tasks and state-of-the-art results on Bootcamp-EVAL. Conclusion: The paper concludes that InternBootcamp is a valuable framework for improving the reasoning capabilities of large language models through diverse task environments and automated training and evaluation processes. Abstract: Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbf{task scaling}, over two orders of magnitude, offering a promising route towards capable reasoning generalist.[24] Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents
Zheng Wu,Heyuan Huang,Yanjia Yang,Yuanyi Song,Xingyu Lou,Weiwen Liu,Weinan Zhang,Jun Wang,Zhuosheng Zhang
Main category: cs.CL
TL;DR: This paper introduces IFRAgent, a framework that improves mobile-use agents by aligning both explicit and implicit human intentions, achieving better performance in task automation.
Details
Motivation: Previous mobile-use agents focused only on explicit human intentions, neglecting implicit preferences, which limits personalization. This work aims to address this gap by aligning both intention types. Method: The study introduces IFRAgent, which uses intention flow recognition from human demonstrations to build a query-level vector library of SOPs and a user-level habit repository, employing retrieval-augmented generation and a query rewriter for personalization. Result: IFRAgent achieves an average 6.79% absolute improvement (32.06% relative) in intention alignment rate and a 5.30% average improvement (26.34% relative) in step completion rates over baselines. Conclusion: IFRAgent enhances the alignment between mobile-use agents and human intent by analyzing both explicit and implicit intention flows, outperforming baselines significantly. Abstract: As multimodal large language models advance rapidly, the automation of mobile tasks has become increasingly feasible through the use of mobile-use agents that mimic human interactions from graphical user interface. To further enhance mobile-use agents, previous studies employ demonstration learning to improve mobile-use agents from human demonstrations. However, these methods focus solely on the explicit intention flows of humans (e.g., step sequences) while neglecting implicit intention flows (e.g., personal preferences), which makes it difficult to construct personalized mobile-use agents. In this work, to evaluate the \textbf{I}ntention \textbf{A}lignment \textbf{R}ate between mobile-use agents and humans, we first collect \textbf{MobileIAR}, a dataset containing human-intent-aligned actions and ground-truth actions. This enables a comprehensive assessment of the agents' understanding of human intent. Then we propose \textbf{IFRAgent}, a framework built upon \textbf{I}ntention \textbf{F}low \textbf{R}ecognition from human demonstrations. IFRAgent analyzes explicit intention flows from human demonstrations to construct a query-level vector library of standard operating procedures (SOP), and analyzes implicit intention flows to build a user-level habit repository. IFRAgent then leverages a SOP extractor combined with retrieval-augmented generation and a query rewriter to generate personalized query and SOP from a raw ambiguous query, enhancing the alignment between mobile-use agents and human intent. Experimental results demonstrate that IFRAgent outperforms baselines by an average of 6.79\% (32.06\% relative improvement) in human intention alignment rate and improves step completion rates by an average of 5.30\% (26.34\% relative improvement). The codes are available at https://github.com/MadeAgents/Quick-on-the-Uptake.[25] LLaMA-Based Models for Aspect-Based Sentiment Analysis
Jakub Šmíd,Pavel Přibáň,Pavel Král
Main category: cs.CL
TL;DR: This paper shows that fine-tuning LLMs, especially Orca~2, improves performance on ABSA tasks, but highlights limitations in zero-shot and few-shot scenarios.
Details
Motivation: Although LLMs show potential for various tasks, their performance in compound ABSA tasks is not well understood, especially when compared to fine-tuned models. Method: The study evaluates open-source LLMs, particularly LLaMA-based models, fine-tuned for compound aspect-based sentiment analysis (ABSA) across four tasks and eight English datasets. Error analysis is also conducted. Result: Fine-tuned Orca~2 outperforms state-of-the-art results in all evaluated ABSA tasks. However, all models perform poorly in zero-shot and few-shot settings compared to fully fine-tuned approaches. Conclusion: LLMs fine-tuned for ABSA, particularly the Orca~2 model, demonstrate superior performance in ABSA tasks, but face challenges in zero-shot and few-shot scenarios. Abstract: While large language models (LLMs) show promise for various tasks, their performance in compound aspect-based sentiment analysis (ABSA) tasks lags behind fine-tuned models. However, the potential of LLMs fine-tuned for ABSA remains unexplored. This paper examines the capabilities of open-source LLMs fine-tuned for ABSA, focusing on LLaMA-based models. We evaluate the performance across four tasks and eight English datasets, finding that the fine-tuned Orca~2 model surpasses state-of-the-art results in all tasks. However, all models struggle in zero-shot and few-shot scenarios compared to fully fine-tuned ones. Additionally, we conduct error analysis to identify challenges faced by fine-tuned models.[26] UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection
Jakub Šmíd,Pavel Přibáň,Pavel Král
Main category: cs.CL
TL;DR: The paper proposes a system for cross-lingual emotion detection that achieves top rankings in the WASSA-2024 shared task by using advanced language models and techniques like LoRA and machine translation.
Details
Motivation: To effectively detect emotions and identify triggering words across multiple languages as part of a shared task. Method: The approach involves fine-tuning quantized large language models and multilingual Transformer-based models with enhancements from machine translation and trigger word switching. Result: The system ranked 1st in numerical trigger words detection, 3rd in binary trigger words detection, and 7th in emotion detection. Conclusion: The system performs exceptionally well in the WASSA-2024 Cross-lingual Emotion Detection Shared Task, securing top rankings in different subtasks. Abstract: This paper presents our system built for the WASSA-2024 Cross-lingual Emotion Detection Shared Task. The task consists of two subtasks: first, to assess an emotion label from six possible classes for a given tweet in one of five languages, and second, to predict words triggering the detected emotions in binary and numerical formats. Our proposed approach revolves around fine-tuning quantized large language models, specifically Orca~2, with low-rank adapters (LoRA) and multilingual Transformer-based models, such as XLM-R and mT5. We enhance performance through machine translation for both subtasks and trigger word switching for the second subtask. The system achieves excellent performance, ranking 1st in numerical trigger words detection, 3rd in binary trigger words detection, and 7th in emotion detection.[27] Prompt-Based Approach for Czech Sentiment Analysis
Jakub Šmíd,Pavel Přibáň
Main category: cs.CL
TL;DR: This paper proposes prompt-based methods for aspect-based sentiment analysis in Czech and shows that these methods outperform traditional approaches, especially when training data is limited.
Details
Motivation: The motivation is to explore more effective methods for aspect-based sentiment analysis in Czech and to investigate the potential of prompt-based methods over traditional fine-tuning. Method: The authors employ sequence-to-sequence models and prompt-based methods to solve aspect-based sentiment tasks and conduct experiments in zero-shot and few-shot learning scenarios. Result: Prompt-based methods demonstrate superior performance compared to traditional fine-tuning, and pre-training on target domain data significantly improves zero-shot learning outcomes. Conclusion: The paper concludes that prompt-based methods outperform traditional fine-tuning in aspect-based sentiment analysis and sentiment classification in Czech, especially in zero-shot and few-shot scenarios. Abstract: This paper introduces the first prompt-based methods for aspect-based sentiment analysis and sentiment classification in Czech. We employ the sequence-to-sequence models to solve the aspect-based tasks simultaneously and demonstrate the superiority of our prompt-based approach over traditional fine-tuning. In addition, we conduct zero-shot and few-shot learning experiments for sentiment classification and show that prompting yields significantly better results with limited training examples compared to traditional fine-tuning. We also demonstrate that pre-training on data from the target domain can lead to significant improvements in a zero-shot scenario.[28] LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement
Rajmohan C,Sarthak Harne,Arvind Agarwal
Main category: cs.CL
TL;DR: 本文提出了一种利用任务分解和迭代自反馈技术改进大型语言模型文本到表格生成的高效系统。
Details
Motivation: 由于大型语言模型在处理模糊或领域特定数据、保持表格结构、管理长输入和解决数值推理方面存在困难,因此需要一种新方法来改进从非结构化文本到结构化数据的转换过程。 Method: 该系统采用两种关键技术:将文本到表格的任务分解为可管理的子任务,并通过迭代自反馈优化生成的表格。 Result: 实验结果表明,与现有基线方法相比,该方法在两个公开的复杂文本到表格生成数据集上均取得了显著的成果。 Conclusion: 本文提出了一种利用大型语言模型进行文本到表格生成的高效系统,该系统通过任务分解和迭代自反馈技术提高了生成表格的质量。 Abstract: Transforming unstructured text into structured data is a complex task, requiring semantic understanding, reasoning, and structural comprehension. While Large Language Models (LLMs) offer potential, they often struggle with handling ambiguous or domain-specific data, maintaining table structure, managing long inputs, and addressing numerical reasoning. This paper proposes an efficient system for LLM-driven text-to-table generation that leverages novel prompting techniques. Specifically, the system incorporates two key strategies: breaking down the text-to-table task into manageable, guided sub-tasks and refining the generated tables through iterative self-feedback. We show that this custom task decomposition allows the model to address the problem in a stepwise manner and improves the quality of the generated table. Furthermore, we discuss the benefits and potential risks associated with iterative self-feedback on the generated tables while highlighting the trade-offs between enhanced performance and computational cost. Our methods achieve strong results compared to baselines on two complex text-to-table generation datasets available in the public domain.[29] TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
Armel Zebaze,Benoît Sagot,Rachel Bawden
Main category: cs.CL
TL;DR: 论文提出TopXGen,利用大语言模型生成高质量低资源语言数据,通过反向翻译提升低资源语言翻译的性能。
Details
Motivation: 低资源语言(LRLs)的翻译质量不如高资源语言,现有的方法受限于平行数据集的规模、质量和多样性。因此需要一种新的方法来生成多样化的高质量低资源语言数据。 Method: 论文提出了一种基于大语言模型(LLM)的方法TopXGen,用于生成高质量、主题多样化的低资源语言数据,然后通过反向翻译生成平行文本,用于上下文学习和微调。 Result: TopXGen通过生成多样化的高质量低资源语言目标文本,并进行反向翻译,显著提升了上下文学习和微调中的翻译性能。 Conclusion: 论文得出结论,TopXGen能够有效提升低资源语言翻译的质量,尤其是在微调和上下文学习中的表现显著提高。 Abstract: LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present \textsc{TopXGen}, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that \textsc{TopXGen} boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at https://github.com/ArmelRandy/topxgen.[30] Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults
Bram van Dijk,Tiberon Kuiper,Sirin Aoulad si Ahmed,Armel Levebvre,Jake Johnson,Jan Duin,Simon Mooijaart,Marco Spruit
Main category: cs.CL
TL;DR: This study shows that generic multilingual ASR models work better than fine-tuned ones for recognizing speech in older Dutch adults, suggesting they can generalize well without specialized training. Architectural modifications like truncation help balance speed and accuracy, though some issues like hallucinations remain.
Details
Motivation: Reliable Automatic Speech Recognition (ASR) for underrepresented groups, such as older adults, is a bottleneck in developing effective voice-controlled clinical interfaces like chatbots. Method: The study evaluated state-of-the-art ASR models using speech data from older Dutch adults interacting with the Welzijn.AI chatbot. It compared generic multilingual models with models fine-tuned for Dutch spoken by older adults, while also analyzing processing speed. Result: Generic multilingual ASR models performed better than fine-tuned models, indicating strong generalization capabilities. Truncating architectures improved the accuracy-speed trade-off, although some cases showed high Word Error Rates (WER) due to hallucinations. Conclusion: Generic multilingual ASR models outperform fine-tuned models for older Dutch adults' speech recognition, indicating their ability to generalize well without specific training. Modifications like truncating architectures can help balance accuracy and processing speed. Abstract: Voice-controlled interfaces can support older adults in clinical contexts, with chatbots being a prime example, but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the Welzijn.AI chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to realistic datasets. Furthermore, our results suggest that truncating existing architectures is helpful in balancing the accuracy-speed trade-off, though we also identify some cases with high WER due to hallucinations.[31] A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models
Lingzhe Zhang,Liancheng Fang,Chiming Duan,Minghua He,Leyi Pan,Pei Xiao,Shiyu Huang,Yunpeng Zhai,Xuming Hu,Philip S. Yu,Aiwei Liu
Main category: cs.CL
TL;DR: This paper surveys parallel text generation techniques, categorizing them into AR-based and Non-AR-based methods, and analyzes their trade-offs in speed, quality, and efficiency to guide future research in improving LLM inference performance.
Details
Motivation: The motivation stems from the inherent limitation of autoregressive text generation in Large Language Models (LLMs), which generates tokens sequentially and limits generation speed. The paper aims to address the lack of comprehensive analysis on parallel text generation techniques and how they improve inference performance. Method: The paper provides a systematic survey of parallel text generation methods, categorizing them into autoregressive (AR)-based and Non-AR-based paradigms. It analyzes the theoretical trade-offs of each approach in terms of speed, quality, and efficiency, and compares them with alternative acceleration strategies. Result: The paper presents a taxonomy of parallel text generation methods and provides a detailed examination of the core techniques in each category. It evaluates their theoretical trade-offs, identifies recent advancements, and outlines open challenges and future research directions. Conclusion: The paper concludes that parallel text generation techniques can significantly improve the inference efficiency of text generation tasks by overcoming the sequential limitations of autoregressive models. It highlights the potential for combining different methods and outlines future research directions to address existing challenges. Abstract: As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation.[32] IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization
Yuzhuo Bai,Shitong Duan,Muhua Huang,Jing Yao,Zhenghao Liu,Peng Zhang,Tun Lu,Xiaoyuan Yi,Maosong Sun,Xing Xie
Main category: cs.CL
TL;DR: IROTE is a novel method for enhancing LLMs' ability to embody specific traits stably and transferably across various tasks, addressing the limitations of existing approaches through optimized textual self-reflection.
Details
Motivation: Existing methods for eliciting traits in LLMs suffer from superficiality, leading to unstable and shallow mimicry of traits, which limits their applicability in personalized LLMs and social simulations. Method: IROTE uses an in-context approach that generates and optimizes textual self-reflection based on psychological theories, maximizing an information-theoretic objective to enhance trait-driven behavior without fine-tuning. Result: IROTE successfully generates compact and evocative self-reflections that enable LLMs to consistently exhibit target traits across a range of tasks, demonstrating superior performance over strong baselines. Conclusion: IROTE is effective in enabling LLMs to stably and consistently embody target traits across diverse tasks, outperforming existing methods. Abstract: Trained on various human-authored corpora, Large Language Models (LLMs) have demonstrated a certain capability of reflecting specific human-like traits (e.g., personality or values) by prompting, benefiting applications like personalized LLMs and social simulations. However, existing methods suffer from the superficial elicitation problem: LLMs can only be steered to mimic shallow and unstable stylistic patterns, failing to embody the desired traits precisely and consistently across diverse tasks like humans. To address this challenge, we propose IROTE, a novel in-context method for stable and transferable trait elicitation. Drawing on psychological theories suggesting that traits are formed through identity-related reflection, our method automatically generates and optimizes a textual self-reflection within prompts, which comprises self-perceived experience, to stimulate LLMs' trait-driven behavior. The optimization is performed by iteratively maximizing an information-theoretic objective that enhances the connections between LLMs' behavior and the target trait, while reducing noisy redundancy in reflection without any fine-tuning, leading to evocative and compact trait reflection. Extensive experiments across three human trait systems manifest that one single IROTE-generated self-reflection can induce LLMs' stable impersonation of the target trait across diverse downstream tasks beyond simple questionnaire answering, consistently outperforming existing strong baselines.[33] Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation
Weibin Liao,Tianlong Wang,Yinghao Zhu,Yasha Wang,Junyi Gao,Liantao Ma
Main category: cs.CL
TL;DR: The paper proposes Magical, an asymmetric LoRA architecture tailored for Medical Lay Language Generation (MLLG), which improves semantic fidelity and enables diverse lay-style generation by employing a shared matrix $A$ and multiple isolated matrices $B$, outperforming existing methods while reducing trainable parameters.
Details
Motivation: Standard LoRA struggles with semantic fidelity and diverse lay-style generation when applied to multi-source heterogeneous MLLG datasets. This limitation necessitates the development of a more effective architecture tailored for such complex data scenarios. Method: Magical employs a shared matrix $A$ for abstractive summarization and multiple isolated matrices $B$ for diverse lay-style generation. It introduces a Semantic Invariance Constraint to preserve semantic fidelity and uses a Recommendation-guided Switch to prompt the LLM to switch between different matrices $B$. Result: Magical consistently outperforms prompt-based methods, vanilla LoRA, and its recent variants on three real-world lay language generation datasets while reducing trainable parameters by 31.66%. Conclusion: The proposed Magical method, an asymmetric LoRA architecture, effectively addresses the limitations of standard LoRA in handling multi-source heterogeneous datasets for Medical Lay Language Generation (MLLG). It ensures semantic fidelity and supports diverse lay-style generation, outperforming other methods while reducing trainable parameters. Abstract: Medical Lay Language Generation (MLLG) plays a vital role in improving the accessibility of complex scientific content for broader audiences. Recent literature to MLLG commonly employ parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA) to fine-tuning large language models (LLMs) using paired expert-lay language datasets. However, LoRA struggles with the challenges posed by multi-source heterogeneous MLLG datasets. Specifically, through a series of exploratory experiments, we reveal that standard LoRA fail to meet the requirement for semantic fidelity and diverse lay-style generation in MLLG task. To address these limitations, we propose Magical, an asymmetric LoRA architecture tailored for MLLG under heterogeneous data scenarios. Magical employs a shared matrix $A$ for abstractive summarization, along with multiple isolated matrices $B$ for diverse lay-style generation. To preserve semantic fidelity during the lay language generation process, Magical introduces a Semantic Invariance Constraint to mitigate semantic subspace shifts on matrix $A$. Furthermore, to better adapt to diverse lay-style generation, Magical incorporates the Recommendation-guided Switch, an externally interface to prompt the LLM to switch between different matrices $B$. Experimental results on three real-world lay language generation datasets demonstrate that Magical consistently outperforms prompt-based methods, vanilla LoRA, and its recent variants, while also reducing trainable parameters by 31.66%.[34] SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLMs
Haotian Chen,Qingqing Long,Meng Xiao,Xiao Luo,Wei Ju,Chengrui Wang,Xuezhi Wang,Yuanchun Zhou,Hengshu Zhu
Main category: cs.CL
TL;DR: 本文提出了 SciRerankBench,首个用于评估 RAG-LLMs 中重排序器的基准,旨在提升科学文献问答的性能。
Details
Motivation: 尽管 RAG-LLMs 在科学文献问答中取得显著进展,但其潜力和局限性仍未被探索。 Method: 构建了一个面向科学重排序的 RAG 基准(SciRerankBench),涵盖五个科学领域,并开发了三种类型的问答上下文对(Q-C-A)来评估重排序器性能。 Result: 通过对 13 个广泛使用的重排序器在五类 LLM 上的系统评估,提供了其相对优势和局限性的详细见解。 Conclusion: SciRerankBench 是首个专门用于评估 RAG-LLMs 中重排序器的基准,为未来开发提供了有价值的观察和指导。 Abstract: Scientific literature question answering is a pivotal step towards new scientific discoveries. Recently, \textit{two-stage} retrieval-augmented generated large language models (RAG-LLMs) have shown impressive advancements in this domain. Such a two-stage framework, especially the second stage (reranker), is particularly essential in the scientific domain, where subtle differences in terminology may have a greatly negative impact on the final factual-oriented or knowledge-intensive answers. Despite this significant progress, the potential and limitations of these works remain unexplored. In this work, we present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning five scientific subjects. To rigorously assess the reranker performance in terms of noise resilience, relevance disambiguation, and factual consistency, we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI), and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely used rerankers on five families of LLMs, we provide detailed insights into their relative strengths and limitations. To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs, which provides valuable observations and guidance for their future development.[35] DevNous: An LLM-Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation
Stavros Doropoulos,Stavros Vologiannidis,Ioannis Magnisalis
Main category: cs.CL
TL;DR: DevNous automates the conversion of informal team conversations into structured IT governance artifacts, offering strong performance on a new public benchmark dataset.
Details
Motivation: The manual translation of unstructured team dialogue into structured artifacts for IT project governance is a critical bottleneck, highlighting the need for automation. Method: DevNous uses a Large Language Model-based multi-agent expert system that integrates into team chat environments to identify actionable intents and manage multi-turn workflows. It was evaluated on a benchmark of 160 annotated conversational turns with multi-label ground truth. Result: DevNous achieved an exact match turn accuracy of 81.3% and a multiset F1-Score of 0.845 on the benchmark dataset, demonstrating its effectiveness. Conclusion: DevNous offers a viable solution for automating the translation of unstructured team dialogue into structured IT governance artifacts, supported by strong empirical results on a new benchmark dataset. Abstract: The manual translation of unstructured team dialogue into the structured artifacts required for Information Technology (IT) project governance is a critical bottleneck in modern information systems management. We introduce DevNous, a Large Language Model-based (LLM) multi-agent expert system, to automate this unstructured-to-structured translation process. DevNous integrates directly into team chat environments, identifying actionable intents from informal dialogue and managing stateful, multi-turn workflows for core administrative tasks like automated task formalization and progress summary synthesis. To quantitatively evaluate the system, we introduce a new benchmark of 160 realistic, interactive conversational turns. The dataset was manually annotated with a multi-label ground truth and is publicly available. On this benchmark, DevNous achieves an exact match turn accuracy of 81.3\% and a multiset F1-Score of 0.845, providing strong evidence for its viability. The primary contributions of this work are twofold: (1) a validated architectural pattern for developing ambient administrative agents, and (2) the introduction of the first robust empirical baseline and public benchmark dataset for this challenging problem domain.[36] Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering
Yunfeng Ning,Mayi Xu,Jintao Wen,Qiankun Pi,Yuanyuan Zhu,Ming Zhong,Jiawei Jiang,Tieyun Qian
Main category: cs.CL
TL;DR: ARoG是一种隐私保护的RAG框架,通过关系中心抽象和结构导向抽象策略,实现了在不泄露实体语义的情况下,基于知识图谱的高效知识检索。
Details
Motivation: 现有的RAG系统在使用私有知识图谱时存在隐私风险,而ARoG能够在保护实体语义隐私的同时,实现有效的知识检索。 Method: 提出了一种新的ARoG框架,包含关系中心抽象和结构导向抽象策略,分别用于将匿名实体转换为可检索信息,并将自然语言问题转化为结构化的抽象概念路径,从而实现高效检索。 Result: 在三个数据集上的实验表明,ARoG在隐私保护RAG场景中表现出色,检索性能和隐私鲁棒性均较强。 Conclusion: ARoG通过两种抽象策略,在保护隐私的前提下,有效提升了基于知识图谱的检索性能,且具有较强的隐私鲁棒性。 Abstract: LLMs often suffer from hallucinations and outdated or incomplete knowledge. RAG is proposed to address these issues by integrating external knowledge like that in KGs into LLMs. However, leveraging private KGs in RAG systems poses significant privacy risks due to the black-box nature of LLMs and potential insecure data transmission, especially when using third-party LLM APIs lacking transparency and control. In this paper, we investigate the privacy-protected RAG scenario for the first time, where entities in KGs are anonymous for LLMs, thus preventing them from accessing entity semantics. Due to the loss of semantics of entities, previous RAG systems cannot retrieve question-relevant knowledge from KGs by matching questions with the meaningless identifiers of anonymous entities. To realize an effective RAG system in this scenario, two key challenges must be addressed: (1) How can anonymous entities be converted into retrievable information. (2) How to retrieve question-relevant anonymous entities. Hence, we propose a novel ARoG framework including relation-centric abstraction and structure-oriented abstraction strategies. For challenge (1), the first strategy abstracts entities into high-level concepts by dynamically capturing the semantics of their adjacent relations. It supplements meaningful semantics which can further support the retrieval process. For challenge (2), the second strategy transforms unstructured natural language questions into structured abstract concept paths. These paths can be more effectively aligned with the abstracted concepts in KGs, thereby improving retrieval performance. To guide LLMs to effectively retrieve knowledge from KGs, the two strategies strictly protect privacy from being exposed to LLMs. Experiments on three datasets demonstrate that ARoG achieves strong performance and privacy-robustness.[37] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
Junjie Ye,Changhao Jiang,Zhengyin Du,Yufei Xu,Xuesong Yao,Zhiheng Xi,Xiaoran Fan,Qi Zhang,Xuanjing Huang,Jiecao Chen
Main category: cs.CL
TL;DR: This paper proposes an automated environment construction pipeline and a verifiable reward mechanism for training large language models (LLMs) in tool use, leading to significant performance improvements without affecting their general abilities.
Details
Motivation: The lack of efficient reinforcement learning frameworks tailored for tool use in large language models (LLMs) limits progress, as stable training environments and verifiable reward mechanisms are difficult to design. Method: The study introduces an automated pipeline for constructing training environments, which includes scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. It also introduces a verifiable reward mechanism that evaluates tool use precision and task execution completeness. These are combined with trajectory data and integrated with standard RL algorithms. Result: Experiments showed that the proposed approach significantly improves LLMs' tool-use performance across varying scales, inference modes, and training algorithms, without compromising general capabilities. Conclusion: The proposed automated environment construction pipeline and verifiable reward mechanism significantly enhance LLMs' tool-use performance by improving context understanding and reasoning, driven by updates to the lower-layer MLP parameters. Abstract: Effective tool use is essential for large language models (LLMs) to interact meaningfully with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models' tool-use performance without degrading their general capabilities, regardless of inference modes or training algorithms. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models.[38] TiMoE: Time-Aware Mixture of Language Experts
Robin Faro,Dongyang Fan,Tamar Alphaidze,Martin Jaggi
Main category: cs.CL
TL;DR: TiMoE improves large language models' temporal accuracy by using time-segmented training and causal routing, preventing reliance on future data while maintaining performance.
Details
Motivation: LLMs trained on fixed web snapshots risk using outdated or temporally leaked information; TiMoE addresses this by ensuring causal validity in predictions. Method: Pre-training GPT-style experts on disjoint two-year slices of a 2013-2024 corpus and combining them using TiMoE, which masks experts trained on future data during inference. Result: TiMoE matches or exceeds the best single-period expert performance and reduces future-knowledge errors by up to 15% across eight NLP tasks and the new TSQA benchmark. Conclusion: TiMoE demonstrates that modular, time-segmented pre-training with causal routing helps maintain chronological grounding in LLMs without significantly sacrificing performance. Abstract: Large language models (LLMs) are typically trained on fixed snapshots of the web, which means that their knowledge becomes stale and their predictions risk temporal leakage: relying on information that lies in the future relative to a query. We tackle this problem by pre-training from scratch a set of GPT-style experts on disjoint two-year slices of a 2013-2024 corpus and combining them through TiMoE, a Time-aware Mixture of Language Experts. At inference time, TiMoE masks all experts whose training window ends after the query timestamp and merges the remaining log-probabilities in a shared space, guaranteeing strict causal validity while retaining the breadth of multi-period knowledge. We also release TSQA, a 10k-question benchmark whose alternatives are explicitly labelled as past, future or irrelevant, allowing fine-grained measurement of temporal hallucinations. Experiments on eight standard NLP tasks plus TSQA show that a co-adapted TiMoE variant matches or exceeds the best single-period expert and cuts future-knowledge errors by up to 15%. Our results demonstrate that modular, time-segmented pre-training paired with causal routing is a simple yet effective path toward LLMs that stay chronologically grounded without sacrificing general performance much. We open source our code at TiMoE (Github): https://github.com/epfml/TiMoE[39] An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
Yuren Hao,Xiang Wan,Chengxiang Zhai
Main category: cs.CL
TL;DR: This paper introduces a new framework and dataset, PutnamGAP, to evaluate the robustness of large language models in mathematical reasoning by testing them on linguistically and parametrically varied problems, revealing significant performance issues in existing models.
Details
Motivation: To move beyond conventional methods and better evaluate the mathematical reasoning capabilities of LLMs by measuring their sensitivity to non-mathematical perturbations. Method: A systematic framework was developed to assess LLMs' mathematical-reasoning robustness by stress-testing them on mathematically equivalent problems with linguistic and parametric variations. This led to the creation of the PutnamGAP benchmark dataset. Result: Evaluations on 18 models showed significant performance degradation on problem variants, with OpenAI's O3 model experiencing a notable drop in scores. Conclusion: The proposed evaluation methodology effectively deepens understanding of LLMs' robustness and provides insights for improving their mathematical reasoning capabilities. Abstract: In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 49 % on the originals but drops by 4 percentage points on surface variants, and by 10.5 percentage points on core-step-based variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.[40] Steering Towards Fairness: Mitigating Political Bias in LLMs
Afrozah Nadeem,Mark Dras,Usman Naseem
Main category: cs.CL
TL;DR: 本文提出了一个用于探测和减轻解码器型大语言模型中政治意识形态偏见的新框架,通过隐藏层激活分析发现偏见并提出缓解策略。
Details
Motivation: 大型语言模型在现实世界广泛应用,但它们可能编码和再现意识形态偏见,特别是在政治和经济维度上,因此需要研究其偏见的编码和缓解机制。 Method: 基于政治指南针测试(PCT),使用对比对来提取和比较隐藏层激活,引入了一个全面的激活提取流水线,进行逐层分析。 Result: 结果表明,解码器型大语言模型在各层中系统性地编码表示偏见,并可通过基于引导向量的方法进行有效缓解。 Conclusion: 本文提出了一种用于探测和减轻解码器型大语言模型中意识形态偏见的框架,为深入理解政治偏见提供了新视角,并提供了一种基于表示的去偏见方法。 Abstract: Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases, particularly along political and economic dimensions. In this paper, we propose a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), our method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions.[41] BiasGym: Fantastic Biases and How to Find (and Remove) Them
Sekh Mainul Islam,Nadav Borenstein,Siddhesh Milind Pawar,Haeun Yu,Arnav Arora,Isabelle Augenstein
Main category: cs.CL
TL;DR: 本文介绍了一种名为 BiasGym 的新框架,用于分析和缓解大型语言模型中的偏见,包括注入偏见的 BiasInject 和用于识别和引导偏见行为组件的 BiasScope。
Details
Motivation: 理解大型语言模型中的偏见和刻板印象对于制定有效的缓解策略至关重要,但偏见行为通常较为微妙且难以隔离,这使得系统分析和去偏见变得非常具有挑战性。 Method: BiasGym 包括 BiasInject 和 BiasScope 两个部分,前者通过基于标记的微调注入特定偏见,后者利用这些注入信号识别并引导负责偏见行为的组件。 Result: BiasGym 能够实现一致的偏见引发以进行机制分析,支持有针对性的去偏见化,且不会降低下游任务的性能,并能推广到训练期间未见过的偏见。 Conclusion: BiasGym 是一种有效减少大型语言模型中偏见的方法,同时保持对下游任务的性能,并可用于安全干预和可解释性研究。 Abstract: Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being `reckless drivers') and in probing fictional associations (e.g., people from a country having `blue skin'), showing its utility for both safety interventions and interpretability research.[42] Weakly Supervised Fine-grained Span-Level Framework for Chinese Radiology Report Quality Assurance
Kaiyu Wang,Lin Mu,Zhiyao Yang,Ximing Li,Xiaotang Zhou Wanfu Gao,Huimao Zhang
Main category: cs.CL
TL;DR: 本文提出Sqator,一种基于细粒度文本跨度分析的自动放射学报告质量评估工具,实验表明其评分与专家判断一致且效果良好。
Details
Motivation: 传统方法依赖资深医生人工评分,劳动成本高且评分可能不准确,因此需要自动化方法提升效率和准确性。 Method: 提出Span-level Quality Assurance EvaluaTOR (Sqator),通过分析初级报告和资深报告之间的修订跨度重要性来自动评分。 Result: 在12,013份放射学报告上进行实验,结果显示Sqator取得了具有竞争力的质量评分结果,且与资深医生判断一致。 Conclusion: Sqator能够自动评估放射学报告的质量,通过细粒度文本跨度分析,实现了与资深医生判断相一致的质量评分。 Abstract: Quality Assurance (QA) for radiology reports refers to judging whether the junior reports (written by junior doctors) are qualified. The QA scores of one junior report are given by the senior doctor(s) after reviewing the image and junior report. This process requires intensive labor costs for senior doctors. Additionally, the QA scores may be inaccurate for reasons like diagnosis bias, the ability of senior doctors, and so on. To address this issue, we propose a Span-level Quality Assurance EvaluaTOR (Sqator) to mark QA scores automatically. Unlike the common document-level semantic comparison method, we try to analyze the semantic difference by exploring more fine-grained text spans. Unlike the common document-level semantic comparison method, we try to analyze the semantic difference by exploring more fine-grained text spans. Specifically, Sqator measures QA scores by measuring the importance of revised spans between junior and senior reports, and outputs the final QA scores by merging all revised span scores. We evaluate Sqator using a collection of 12,013 radiology reports. Experimental results show that Sqator can achieve competitive QA scores. Moreover, the importance scores of revised spans can be also consistent with the judgments of senior doctors.[43] Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models
Haeun Yu,Seogyeong Jeong,Siddhesh Pawar,Jisu Shin,Jiho Jin,Junho Myung,Alice Oh,Isabelle Augenstein
Main category: cs.CL
TL;DR: 本论文提出了Culturescope,一种基于机械解释性的方法,用于探究大型语言模型(LLMs)内部的文化知识空间,并分析其文化偏见,特别是西方主导偏见和文化扁平化现象。
Details
Motivation: 随着LLMs在不同文化背景中的广泛应用,了解其对文化(误)表征的内部机制影响变得尤为重要。现有研究仅限于外部评估,缺乏对LLMs内部机制的深入探讨。 Method: 作者提出Culturescope方法,利用修补技术提取LLMs中的文化知识,并引入文化扁平化评分来衡量内在文化偏见。同时研究了LLMs如何内化西方主导偏见和文化扁平化。 Result: 实验结果显示,LLMs在其文化知识空间中确实编码了西方主导偏见和文化扁平化现象。此外,资源较少的文化较少受到文化偏见的影响,可能是因为其训练资源有限。 Conclusion: 该研究为未来减少文化偏见并增强LLMs文化理解的研究奠定了基础,并公开了实验所用的代码和数据。 Abstract: The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs' representations impacts their cultural understanding. Prior work only performs extrinsic evaluation of LLMs' cultural competence, without accounting for how LLMs' internal mechanisms lead to cultural (mis)representation. To bridge this gap, we propose Culturescope, the first mechanistic interpretability-based method that probes the internal representations of LLMs to elicit the underlying cultural knowledge space. CultureScope utilizes a patching method to extract the cultural knowledge. We introduce a cultural flattening score as a measure of the intrinsic cultural biases. Additionally, we study how LLMs internalize Western-dominance bias and cultural flattening, which allows us to trace how cultural biases emerge within LLMs. Our experimental results reveal that LLMs encode Western-dominance bias and cultural flattening in their cultural knowledge space. We find that low-resource cultures are less susceptible to cultural biases, likely due to their limited training resources. Our work provides a foundation for future research on mitigating cultural biases and enhancing LLMs' cultural understanding. Our codes and data used for experiments are publicly available.[44] ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs
Keyu Chen,Zhifeng Shen,Daohai Yu,Haoqian Wu,Wei Wen,Jianfeng He,Ruizhi Qiao,Xing Sun
Main category: cs.CL
TL;DR: This paper proposes Adaptive Serial-Parallel Decoding (ASPD) to accelerate large language model inference by identifying and exploiting parallelizable structures in model outputs, achieving significant speedups without compromising response quality.
Details
Motivation: The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges due to their autoregressive decoding paradigm, which is inherently sequential. Method: The paper proposes an Adaptive Serial-Parallel Decoding (ASPD) framework that automatically extracts and validates parallelizable structures from autoregressive model responses and uses a Hybrid Decoding Engine to enable seamless transitions between serial and parallel decoding modes. Result: Extensive evaluations show that ASPD achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models. Conclusion: ASPD provides a groundbreaking benchmark for efficient LLM parallel inference, enabling deployment in latency-sensitive applications without compromising generation quality. Abstract: The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel decoding modes while maintaining a reusable KV cache, maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant acceleration without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.[45] Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning
Mahmoud Salhab,Shameed Sait,Mohammad Abusheikh,Hasan Abusheikh
Main category: cs.CL
TL;DR: This paper proposes a scalable training pipeline combining weakly supervised learning and supervised fine-tuning to build a robust Arabic ASR model, achieving state-of-the-art results in the multi-dialectal Arabic ASR challenge.
Details
Motivation: Developing accurate Arabic ASR systems is challenging due to limited labeled data and linguistic complexity from diverse dialects. Method: A scalable training pipeline that combines weakly supervised learning with supervised fine-tuning is used. The model is pretrained on 15,000 hours of weakly labeled speech, followed by continual supervised fine-tuning using a mixture of filtered weakly labeled data and a small, high-quality annotated dataset. Result: The approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge. Conclusion: weak supervision combined with fine-tuning can effectively overcome data scarcity and deliver high-quality ASR for low-resource, dialect-rich languages like Arabic. Abstract: Automatic speech recognition (ASR) plays a vital role in enabling natural human-machine interaction across applications such as virtual assistants, industrial automation, customer support, and real-time transcription. However, developing accurate ASR systems for low-resource languages like Arabic remains a significant challenge due to limited labeled data and the linguistic complexity introduced by diverse dialects. In this work, we present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model. In the first stage, we pretrain the model on 15,000 hours of weakly labeled speech covering both Modern Standard Arabic (MSA) and various Dialectal Arabic (DA) variants. In the subsequent stage, we perform continual supervised fine-tuning using a mixture of filtered weakly labeled data and a small, high-quality annotated dataset. Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge. These findings highlight the effectiveness of weak supervision paired with fine-tuning in overcoming data scarcity and delivering high-quality ASR for low-resource, dialect-rich languages.[46] Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation
Khondoker Ittehadul Islam,Gabriele Sarti
Main category: cs.CL
TL;DR: This paper evaluates multilingual language models on a manually translated Bangla multi-step reasoning dataset, showing that while reasoning context benefits complex questions, models struggle with Bangla reasoning steps.
Details
Motivation: The motivation is to evaluate the performance of language models on multi-step reasoning tasks in low-resource languages like Bangla, as previous evaluations have predominantly focused on high-resource languages such as English. Method: The paper introduces a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. It conducts a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original and translated datasets. Result: The results show that reasoning context is beneficial for more challenging non-binary questions, but models struggle to effectively employ relevant Bangla reasoning steps. Conclusion: The paper concludes that reasoning steps contribute differently across models and languages, with models struggling to effectively utilize Bangla reasoning steps despite the benefits of reasoning context for more challenging non-binary questions. Abstract: Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models' predictions, highlighting different trends across models and languages.[47] Train Long, Think Short: Curriculum Learning for Efficient Reasoning
Hasan Abed Al Kader Hammoud,Kumail Alhamoud,Abed Hammoud,Elie Bou-Zeid,Marzyeh Ghassemi,Bernard Ghanem
Main category: cs.CL
TL;DR: This paper introduces a curriculum learning strategy with GRPO to train large language models more efficiently, starting with generous token budgets and gradually reducing them to improve reasoning performance and computational efficiency.
Details
Motivation: The motivation is to improve upon existing approaches that use fixed-length training budgets, which do not account for the natural progression from exploration to compression in learning. Method: The method involves a curriculum learning strategy that starts with generous token budgets and gradually tightens them during training. It uses Group Relative Policy Optimization (GRPO) with a reward function balancing task correctness, length efficiency, and formatting adherence. Result: Experiments on datasets like GSM8K, MATH500, and others show that the curriculum-based approach outperforms fixed-budget baselines, achieving better accuracy and token efficiency. Conclusion: The proposed curriculum learning strategy using GRPO effectively enhances the reasoning abilities of large language models, achieving higher accuracy and improved token efficiency compared to fixed-budget baselines. Abstract: Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models. Our code and checkpoints are released at: https://github.com/hammoudhasan/curriculum_grpo.[48] Jointly Generating and Attributing Answers using Logits of Document-Identifier Tokens
Lucas Albarede,Jose Moreno,Lynda Tamine,Luce Lefeuvre
Main category: cs.CL
TL;DR: LoDIT 是一种在 RAG 中联合生成答案与忠实归因的新方法,它通过利用生成过程中特定标记的 logits 来提高模型可信度,同时在效率和鲁棒性方面表现出色。
Details
Motivation: 尽管大型语言模型(LLMs)表现出色,但其容易产生幻觉的问题严重影响了其可信度。现有的方法主要关注答案和归因的正确性,而 LoDIT 则聚焦于在生成答案时忠实反映模型的决策过程。 Method: LoDIT 通过在生成过程中利用特定标记的 logits 来估计每个文档对答案的贡献,并将这些贡献聚合为文档归因,从而实现答案生成与归因的联合优化。 Result: 在面向可信度的归因文本生成基准 Trust-Align 上,LoDIT 在多个指标上显著优于现有最先进模型,并且在延迟和不同设置下表现出良好的效率和鲁棒性。 Conclusion: LoDIT 提供了一种高效且稳健的方法,能够同时生成答案并忠实归因,从而提高大型语言模型的可信度。 Abstract: Despite their impressive performances, Large Language Models (LLMs) remain prone to hallucination, which critically undermines their trustworthiness. While most of the previous work focused on tackling answer and attribution correctness, a recent line of work investigated faithfulness, with a focus on leveraging internal model signals to reflect a model's actual decision-making process while generating the answer. Nevertheless, these methods induce additional latency and have shown limitations in directly aligning token generation with attribution generation. In this paper, we introduce LoDIT, a method that jointly generates and faithfully attributes answers in RAG by leveraging specific token logits during generation. It consists of two steps: (1) marking the documents with specific token identifiers and then leveraging the logits of these tokens to estimate the contribution of each document to the answer during generation, and (2) aggregating these contributions into document attributions. Experiments on a trustworthiness-focused attributed text-generation benchmark, Trust-Align, show that LoDIT significantly outperforms state-of-the-art models on several metrics. Finally, an in-depth analysis of LoDIT shows both its efficiency in terms of latency and its robustness in different settings.[49] Retrospective Sparse Attention for Efficient Long-Context Generation
Seonghwan Choi,Beomseok Kang,Dongwon Jo,Jae-Joon Kim
Main category: cs.CL
TL;DR: RetroAttention is a novel KV cache update technique that improves long-context task performance by retrospectively revising past attention outputs with newly arrived KV entries, resulting in increased accuracy and KV exposure.
Details
Motivation: The motivation is to address the limitations of existing KV cache compression methods that focus on input contexts and fail to handle cumulative attention errors during long decoding. Method: RetroAttention maintains a lightweight output cache and revises past attention outputs using newly arrived KV entries from subsequent decoding steps. Result: Extensive experiments show that RetroAttention outperforms state-of-the-art KV compression methods, increasing effective KV exposure by up to 1.6x and accuracy by up to 21.9%. Conclusion: RetroAttention breaks the fixed-attention-output paradigm by continually correcting prior approximations, leading to improved performance in long-context tasks. Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to efficiently access more relevant context, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9\%.[50] LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA
Adrián Gude,Roi Santos-Ríos,Francisco Prado-Valiño,Ana Ezquerro,Jesús Vilares
Main category: cs.CL
TL;DR: 本研究开发了一种零样本的表格问答流程,利用大型语言模型生成功能代码,从表格数据中提取相关信息。
Details
Motivation: 参与SemEval 2025 Task 8,专注于表格问答任务。 Method: 开发了一个零样本的流程,利用大型语言模型生成功能代码,从表格数据中提取相关信息。 Result: 成功生成代码,通过识别最相关列和分析数据类型提高提取准确性,并通过迭代优化流程增强鲁棒性。 Conclusion: 零样本代码生成是一种有效的表格问答方法,尽管缺乏任务特定的微调,但在测试阶段仍取得了53分中的33分的成绩。 Abstract: This paper describes our participation in SemEval 2025 Task 8, focused on Tabular Question Answering. We developed a zero-shot pipeline that leverages an Large Language Model to generate functional code capable of extracting the relevant information from tabular data based on an input question. Our approach consists of a modular pipeline where the main code generator module is supported by additional components that identify the most relevant columns and analyze their data types to improve extraction accuracy. In the event that the generated code fails, an iterative refinement process is triggered, incorporating the error feedback into a new generation prompt to enhance robustness. Our results show that zero-shot code generation is a valid approach for Tabular QA, achieving rank 33 of 53 in the test phase despite the lack of task-specific fine-tuning.[51] A Survey on Training-free Alignment of Large Language Models
Birong Pan,Yongqi Li,Weiyu Zhang,Wenpeng Lu,Mayi Xu,Shen Zhou,Yuanyuan Zhu,Ming Zhong,Tieyun Qian
Main category: cs.CL
TL;DR: This paper reviews training-free alignment methods for large language models, identifying their advantages over traditional approaches and offering guidance for future research and application.
Details
Motivation: The motivation for this study stems from the limitations of traditional alignment methods for large language models, which are resource-intensive and may lead to knowledge degradation. The authors aim to explore training-free alignment techniques as a more adaptable and efficient alternative. Method: The paper conducts a systematic review of training-free alignment methods, categorizing them into pre-decoding, in-decoding, and post-decoding stages. It analyzes these methods from the perspective of large language models and multimodal variants, discussing their mechanisms, limitations, challenges, and future directions. Result: The result of the study is a comprehensive and systematic review of training-free alignment techniques, organized by decoding stages and offering insights into their mechanisms, limitations, and potential for future development. Conclusion: This paper concludes that training-free alignment techniques offer a promising alternative to traditional fine-tuning methods for aligning large language models with human values and standards, while posing fewer resource and accessibility challenges. Abstract: The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques--leveraging in-context learning, decoding-time adjustments, and post-generation corrections--offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.[52] LLM-as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback
Chen Xu,Zhenyu Lv,Tian Lan,Xianyang Wang,Luyao Ji,Leyang Cui,Minqiang Yang,Jian Shen,Qunxi Dong,Xiuling Liu,Juan Wang,Bin Hu
Main category: cs.CL
TL;DR: 本研究开发了一种基于LLM的治疗师培训新模式,通过构建错误导向的对话-反馈数据集MATE,实现了高质量的反馈机制,展现出在治疗师培训中的潜力。
Details
Motivation: 由于大型语言模型在患者直接应用中存在伦理和安全问题,因此转向开发其作为监督模型来培训真实治疗师;同时,治疗过程中的常见错误具有普遍性和可识别性,可作为有效反馈触发点。 Method: 构建了一个包含错误行为指南和针对性纠正策略的数据集MATE,并通过人类参与的对话-反馈机制训练监督模型,最终用于真实治疗师的培训。 Result: 自动化、人类和下游评估的详细实验结果表明,基于MATE数据集微调的模型能够有效提供符合临床指南的反馈。 Conclusion: 模型在MATE数据集上微调后,能够根据临床指南提供高质量反馈,表明其在治疗师培训场景中的显著潜力。 Abstract: Although large language models (LLMs) hold significant promise in psychotherapy, their direct application in patient-facing scenarios raises ethical and safety concerns. Therefore, this work shifts towards developing an LLM as a supervisor to train real therapists. In addition to the privacy of clinical therapist training data, a fundamental contradiction complicates the training of therapeutic behaviors: clear feedback standards are necessary to ensure a controlled training system, yet there is no absolute "gold standard" for appropriate therapeutic behaviors in practice. In contrast, many common therapeutic mistakes are universal and identifiable, making them effective triggers for targeted feedback that can serve as clearer evidence. Motivated by this, we create a novel therapist-training paradigm: (1) guidelines for mistaken behaviors and targeted correction strategies are first established as standards; (2) a human-in-the-loop dialogue-feedback dataset is then constructed, where a mistake-prone agent intentionally makes standard mistakes during interviews naturally, and a supervisor agent locates and identifies mistakes and provides targeted feedback; (3) after fine-tuning on this dataset, the final supervisor model is provided for real therapist training. The detailed experimental results of automated, human and downstream assessments demonstrate that models fine-tuned on our dataset MATE, can provide high-quality feedback according to the clinical guideline, showing significant potential for the therapist training scenario.[53] MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions
Zeyu Huang,Juyuan Wang,Longfeng Chen,Boyi Xiao,Leng Cai,Yawen Zeng,Jin Xu
Main category: cs.CL
TL;DR: 本文提出了MVISU-Bench基准和Aider模块,显著提高了移动代理在处理用户复杂需求方面的成功率。
Details
Motivation: 现有评估基准与现实世界脱节,无法充分满足用户的多样化和复杂需求,因此需要一个新的基准和改进的模块。 Method: 基于用户问卷数据,确定了五种任务类型,并围绕这些任务构建了MVISU-Bench基准。此外,开发了Aider模块以减轻风险并明确用户意图。 Result: Aider模块在MVISU-Bench上的整体成功率提高了19.55%,对于不道德指令提高了53.52%,对于交互式指令提高了29.41%。 Conclusion: 本文提出了MVISU-Bench,一个包含404个任务的双语基准,并介绍了Aider模块,该模块显著提高了移动代理的成功率,特别是在处理不道德和交互式指令方面。 Abstract: Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users' automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present \textbf{MVISU-Bench}, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55\% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52\% and 29.41\% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations.[54] READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Maxim Divilkovskiy,Vitaly Malygin,Sergey Zlobin,Sultan Isali,Vasily Kalugin,Stanislav Ilyushin,Nuriza Aitassova,Yi Fei,Zeng Weidi
Main category: cs.CL
TL;DR: 本文提出了一种名为READER的无损推测解码方法,通过利用文本中的自重复来提高大型语言模型(LLM)推理的效率,尤其在大批量处理任务中表现出色,无需额外训练即可实现超过40%的速度提升。
Details
Motivation: 大型语言模型(LLM)的推理过程因逐词生成而难以加速,因此需要更高效的推理方法。现有的方法中,训练额外的草案模型效果最好,但仍有改进空间。 Method: 引入READER算法,通过统计搜索获取推测解码树的扩展部分,优化关键值(KV)缓存的使用,以提高大批量处理时的性能。 Result: READER在没有额外训练的情况下,相比现有推测解码方法表现更优,在搜索任务(如检索增强生成)中实现了超过10倍的加速效果。 Conclusion: READER是一种有效的LLM推理加速方法,特别适用于大批量处理场景,且在搜索相关任务中具有显著优势。 Abstract: Large Language Models (LLMs) generate tokens autoregressively, with each token depending on the preceding context. This sequential nature makes the inference process inherently difficult to accelerate, posing a significant challenge for efficient deployment. In recent years, various methods have been proposed to address this issue, with the most effective approaches often involving the training of additional draft models. In this paper, we introduce READER (Retrieval-Assisted Drafter for Efficient LLM Inference), a novel lossless speculative decoding method that enhances model-based approaches by leveraging self-repetitions in the text. Our algorithm expands the speculative decoding tree using tokens obtained through statistical search. This work focuses on large batch sizes (>= 8), an underexplored yet important area for industrial applications. We also analyze the key-value (KV) cache size during speculative decoding and propose an optimization to improve performance for large batches. As a result, READER outperforms existing speculative decoding methods. Notably, READER requires no additional training and can reuse pre-trained speculator models, increasing the speedup by over 40\%. Our method demonstrates particularly strong performance on search-based tasks, such as retrieval-augmented generation, where we achieve more than 10x speedup.[55] CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization
Xinge Ye,Rui Wang,Yuchuan Wu,Victor Ma,Feiteng Fang,Fei Huang,Yongbin Li
Main category: cs.CL
TL;DR: This paper proposes Comparative Policy Optimization (CPO) and the CharacterArena evaluation framework to address the challenges of subjective evaluation criteria and unstable reward signals in open-ended subjective tasks like role-playing dialogue, resulting in improved dialogue quality.
Details
Motivation: Reinforcement Learning Fine-Tuning struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches face dual challenges: subjective evaluation criteria and unstable reward signals. Method: Comparative Policy Optimization (CPO) and CharacterArena evaluation framework. Result: Empirical results confirm that CPO leads to substantial improvements in dialogue quality. Conclusion: CPO effectively mitigates reward ambiguity and leads to substantial improvements in dialogue quality. Abstract: Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers (e.g., code generation, mathematical reasoning), yet struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals.Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization (CPO). CPO redefines the reward evaluation paradigm by shifting from sample-wise scoring to comparative group-wise scoring.Building on the same principle, we introduce the CharacterArena evaluation framework, which comprises two stages:(1) Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level Comparative Evaluation. By operationalizing subjective scoring via objective trajectory comparisons, CharacterArena minimizes contextual bias and enables more robust and fair performance evaluation. Empirical results on CharacterEval, CharacterBench, and CharacterArena confirm that CPO effectively mitigates reward ambiguity and leads to substantial improvements in dialogue quality.[56] Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages
Imalsha Puranegedara,Themira Chathumina,Nisal Ranathunga,Nisansa de Silva,Surangika Ranathunga,Mokanarangan Thayaparan
Main category: cs.CL
TL;DR: 研究提出了一种新的多语言模型架构,通过融合中间层表示,提升了低资源语言的性能。
Details
Motivation: 由于大型语言模型在低资源语言上的表现因英语中心的训练而显著下降,因此需要一种新方法来提升这些语言的性能。 Method: 提出了一种融合所有中间层的方法,包括全局Softmax权重和Transformer Softmax模型来学习token特定权重,并将这些表示映射到大模型的嵌入空间。 Result: 在XNLI、IndicXNLI、僧伽罗新闻分类和亚马逊评论数据集上的评估显示,该方法显著优于LangBridge基线,特别是在低资源语言上,僧伽罗语分类准确率从71.66%提升到75.86%。 Conclusion: 该研究提出了一种新的多语言模型架构,通过融合中间层表示,显著提升了低资源语言的性能,为更高效和公平的多语言大模型提供了可扩展的解决方案。 Abstract: Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM's embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.[57] Link Prediction for Event Logs in the Process Industry
Anastasia Zhukova,Thomas Walton,Christian E. Matt,Bela Gipp
Main category: cs.CL
TL;DR: This paper introduces a novel record linking approach using adapted CDCR models with reasoning capabilities to enhance data quality and connectivity in the process industry's fragmented shift logs.
Details
Motivation: Fragmented event logs in shift books hinder the recommendation of past solutions in the process industry. This study aims to improve knowledge management by linking related records effectively. Method: The authors framed record linking as a cross-document coreference resolution (CDCR) task, enhanced with natural language inference (NLI) and semantic text similarity (STS), and tailored it for the process industry's text formats. Result: The proposed RL model outperformed NLI- and STS-driven baselines by 28% (11.43 points) and 27% (11.21 points), respectively, demonstrating its effectiveness in improving data connectivity. Conclusion: The study successfully adapted CDCR models for record linking in the process industry, significantly outperforming NLI- and STS-driven baselines, thereby enhancing data quality and connectivity in shift logs. Abstract: Knowledge management (KM) is vital in the process industry for optimizing operations, ensuring safety, and enabling continuous improvement through effective use of operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records, e.g., entries documenting issues related to equipment or processes and the corresponding solutions, may remain disconnected. This fragmentation hinders the recommendation of previous solutions to the users. To address this problem, we investigate record linking (RL) as link prediction, commonly studied in graph-based machine learning, by framing it as a cross-document coreference resolution (CDCR) task enhanced with natural language inference (NLI) and semantic text similarity (STS) by shifting it into the causal inference (CI). We adapt CDCR, traditionally applied in the news domain, into an RL model to operate at the passage level, similar to NLI and STS, while accommodating the process industry's specific text formats, which contain unstructured text and structured record attributes. Our RL model outperformed the best versions of NLI- and STS-driven baselines by 28% (11.43 points) and 27% (11.21 points), respectively. Our work demonstrates how domain adaptation of the state-of-the-art CDCR models, enhanced with reasoning capabilities, can be effectively tailored to the process industry, improving data quality and connectivity in shift logs.[58] AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators
Jason Chou,Ao Liu,Yuchi Deng,Zhiying Zeng,Tao Zhang,Haotian Zhu,Jianwei Cai,Yue Mao,Chenchen Zhang,Lingyun Tan,Ziyan Xu,Bohui Zhai,Hengyi Liu,Speed Zhu,Wiggin Zhou,Fengzong Lian
Main category: cs.CL
TL;DR: 本研究提出AutoCodeGen方法及AutoCodeBench系列数据集,用于评估LLMs在多语言代码生成任务中的表现。
Details
Motivation: 现有代码生成基准存在依赖人工标注、主要集中在Python以及多语言基准难度有限和语言分布不均的问题。 Method: 提出AutoCodeGen方法,通过LLMs生成测试输入,通过多语言沙箱获得测试输出,结合反向问题生成和多步骤过滤,生成高质量数据集。 Result: 评估显示即使是先进的LLMs在复杂、多样和多语言任务上仍面临挑战。 Conclusion: AutoCodeBench系列希望作为有价值的资源,激励社区关注更具挑战性和实用性的多语言代码生成场景。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining test outputs through a multilingual sandbox, while achieving high data quality through reverse-order problem generation and multiple filtering steps. Using this novel method, we introduce AutoCodeBench, a large-scale code generation benchmark comprising 3,920 problems evenly distributed across 20 programming languages. It is specifically designed to evaluate LLMs on challenging, diverse, and practical multilingual tasks. We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks. Besides, we introduce AutoCodeBench-Complete, specifically designed for base models to assess their few-shot code generation capabilities. We hope the AutoCodeBench series will serve as a valuable resource and inspire the community to focus on more challenging and practical multilingual code generation scenarios.[59] SinLlama -- A Large Language Model for Sinhala
H. W. K. Aravinda,Rashad Sirajudeen,Samith Karunathilake,Nisansa de Silva,Surangika Ranathunga,Rishemjit Kaur
Main category: cs.CL
TL;DR: 本文通过扩展多语言大模型Llama-3-8B,首次为低资源语言僧伽罗语(Sinhala)提供了基于解码器的开源大模型SinLlama,并在文本分类任务中显著优于基础模型。
Details
Motivation: 低资源语言如僧伽罗语常被开源大语言模型忽视,本文旨在填补这一空白,提升僧伽罗语在大模型中的支持能力。 Method: 本文通过向Llama-3-8B模型的分词器中添加僧伽罗语专属词汇,并使用清洗后的1000万僧伽罗语文本语料进行持续预训练,构建出SinLlama模型。 Result: SinLlama在三个文本分类任务的指令微调后,显著优于Llama-3-8B的基础版和指令版模型。 Conclusion: SinLlama是首个支持僧伽罗语的开源大语言模型,展示了在低资源语言上提升模型性能的有效方法。 Abstract: Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.[60] OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
Weixuan Wang,Dongge Han,Daniel Madrigal Diaz,Jin Xu,Victor Rühle,Saravan Rajmohan
Main category: cs.CL
TL;DR: The paper introduces OdysseyBench and HomerAgents to evaluate LLM agents on long-horizon workflows across diverse office applications, addressing the gap in existing benchmarks that focus on atomic tasks.
Details
Motivation: Existing benchmarks focus on atomic tasks, lacking the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios involving autonomous agents powered by large language models. Method: The study introduces OdysseyBench, a comprehensive benchmark with two splits (OdysseyBench+ and OdysseyBench-Neo) and proposes HomerAgents, a multi-agent framework for generating long-horizon workflow benchmarks. Result: OdysseyBench effectively challenges state-of-the-art LLM agents, providing a more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. Conclusion: OdysseyBench serves as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. Abstract: Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.[61] Complex Logical Instruction Generation
Mian Zhang,Shujian Liu,Sixun Dong,Ming Yin,Yebowen Hu,Xun Wang,Steven Ma,Song Wang,Sathish Reddy Indurthi,Haoyun Deng,Zhiyu Zoey Chen,Kaiqiang Song
Main category: cs.CL
TL;DR: 研究提出LogicIFGen和LogicIFEval,用于评估大语言模型在复杂逻辑指令下的表现,结果显示现有模型仍有显著不足。
Details
Motivation: 随着任务变得越来越复杂,自然语言指令中的逻辑结构变得愈加精细,但大语言模型在遵循此类指令上的表现尚未被充分探索。 Method: 提出了LogicIFGen和LogicIFEval,前者是一个可扩展的自动化框架,用于从代码函数生成可验证的指令;后者是一个包含426个逻辑丰富指令的基准测试集。 Result: 实验表明,当前最先进的大语言模型在LogicIFEval基准测试中仍面临困难,显示出在指令跟随能力上的重大缺陷。 Conclusion: 当前最先进的大语言模型在遵循复杂的逻辑指令方面仍存在显著缺陷,大多数模型只能遵循不到60%的指令。 Abstract: Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF[62] Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models
Wen Wang,Bozhen Fang,Chenchen Jing,Yongliang Shen,Yangyi Shen,Qiuyu Wang,Hao Ouyang,Hao Chen,Chunhua Shen
Main category: cs.CL
TL;DR: This paper identifies the issue of temporal oscillation in diffusion large language models and proposes two new methods to improve model outputs by leveraging temporal consistency across denoising steps, resulting in significant performance gains.
Details
Motivation: Current dLLM decoding strategies ignore intermediate predictions, even though correct answers may appear during the iterative denoising process but get overwritten later. This work aims to exploit these intermediate outputs to improve model performance. Method: The authors introduce two methods: Temporal Self-Consistency Voting, a training-free decoding strategy that aggregates predictions across denoising steps, and Temporal Consistency Reinforcement, a post-training method using Temporal Semantic Entropy (TSE) as a reward signal to improve generation stability. Result: Empirical results show significant improvements across multiple benchmarks, including a 24.7% improvement on the Countdown dataset using negative TSE reward alone, and further gains when combining TSE with accuracy reward, achieving improvements of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown. Conclusion: The study highlights the overlooked potential of temporal dynamics in diffusion large language models (dLLMs) and proposes two effective methods to enhance output stability and accuracy by leveraging temporal consistency. Abstract: Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.cs.CV [Back]
[63] Evaluation of State-of-the-Art Deep Learning Techniques for Plant Disease and Pest Detection
Saptarshi Banerjee,Tausif Mallick,Amlan Chakroborty,Himadri Nath Saha,Nityananda T. Takur
Main category: cs.CV
TL;DR: This paper reviews AI-based methods for detecting plant diseases and pests, showing that modern techniques like vision transformers outperform traditional approaches in accuracy and efficiency.
Details
Motivation: Detecting plant diseases and pests is crucial for improving crop yields and reducing economic losses. AI and machine learning offer more efficient and accurate detection methods compared to manual identification. Method: The study reviews and categorizes modern AI-based techniques for plant disease and pest detection into five groups: hyperspectral imaging, non-visualization techniques, visualization approaches, modified deep learning models, and transformer architectures. Result: Modern AI techniques, particularly vision transformers, achieve high accuracy (e.g., over 99.3% with HvT) and outperform older methods like MobileNetV3 in speed and precision. Conclusion: The study concludes that AI-based methods, especially vision transformers like HvT, are superior for plant disease and pest detection compared to traditional approaches. Challenges and future research directions are also outlined. Abstract: Addressing plant diseases and pests is critical for enhancing crop production and preventing economic losses. Recent advances in artificial intelligence (AI), machine learning (ML), and deep learning (DL) have significantly improved the precision and efficiency of detection methods, surpassing the limitations of manual identification. This study reviews modern computer-based techniques for detecting plant diseases and pests from images, including recent AI developments. The methodologies are organized into five categories: hyperspectral imaging, non-visualization techniques, visualization approaches, modified deep learning architectures, and transformer models. This structured taxonomy provides researchers with detailed, actionable insights for selecting advanced state-of-the-art detection methods. A comprehensive survey of recent work and comparative studies demonstrates the consistent superiority of modern AI-based approaches, which often outperform older image analysis methods in speed and accuracy. In particular, vision transformers such as the Hierarchical Vision Transformer (HvT) have shown accuracy exceeding 99.3% in plant disease detection, outperforming architectures like MobileNetV3. The study concludes by discussing system design challenges, proposing solutions, and outlining promising directions for future research.[64] ImageDDI: Image-enhanced Molecular Motif Sequence Representation for Drug-Drug Interaction Prediction
Yuqin He,Tengfei Ma,Chaoyi Li,Pengsen Ma,Hongxin Xiang,Jianmin Wang,Yiping Liu,Bosheng Song,Xiangxiang Zeng
Main category: cs.CV
TL;DR: ImageDDI is a novel framework for predicting drug-drug interactions by integrating functional motif sequences and molecular image information, outperforming existing methods.
Details
Motivation: Accurately identifying and predicting drug-drug interactions (DDIs) is crucial to mitigate adverse health effects. Existing methods face limitations in functional motif-based representation learning as DDIs are caused by motif interactions rather than overall drug structures. Method: ImageDDI tokenizes molecules into functional motifs, combines motifs of drug pairs into a sequence, uses a transformer-based encoder, and applies Adaptive Feature Fusion to enhance spatial representation using global molecular image information. Result: Experimental results show that ImageDDI outperforms existing state-of-the-art methods and demonstrates competitive performance in 2D and 3D image-enhanced scenarios. Conclusion: ImageDDI outperforms state-of-the-art methods in DDI prediction and achieves competitive performance in both 2D and 3D image-enhanced scenarios. Abstract: To mitigate the potential adverse health effects of simultaneous multi-drug use, including unexpected side effects and interactions, accurately identifying and predicting drug-drug interactions (DDIs) is considered a crucial task in the field of deep learning. Although existing methods have demonstrated promising performance, they suffer from the bottleneck of limited functional motif-based representation learning, as DDIs are fundamentally caused by motif interactions rather than the overall drug structures. In this paper, we propose an Image-enhanced molecular motif sequence representation framework for \textbf{DDI} prediction, called ImageDDI, which represents a pair of drugs from both global and local structures. Specifically, ImageDDI tokenizes molecules into functional motifs. To effectively represent a drug pair, their motifs are combined into a single sequence and embedded using a transformer-based encoder, starting from the local structure representation. By leveraging the associations between drug pairs, ImageDDI further enhances the spatial representation of molecules using global molecular image information (e.g. texture, shadow, color, and planar spatial relationships). To integrate molecular visual information into functional motif sequence, ImageDDI employs Adaptive Feature Fusion, enhancing the generalization of ImageDDI by dynamically adapting the fusion process of feature representations. Experimental results on widely used datasets demonstrate that ImageDDI outperforms state-of-the-art methods. Moreover, extensive experiments show that ImageDDI achieved competitive performance in both 2D and 3D image-enhanced scenarios compared to other models.[65] Designing Object Detection Models for TinyML: Foundations, Comparative Analysis, Challenges, and Emerging Solutions
Christophe EL Zeinaty,Wassim Hamidouche,Glenn Herrou,Daniel Menard
Main category: cs.CV
TL;DR: This survey paper analyzes optimization techniques for deploying object detection models on resource-constrained IoT devices, focusing on TinyML and techniques like quantization, pruning, knowledge distillation, and neural architecture search. It bridges the gap between academic research and real-world deployment and provides a public repository for tracking developments.
Details
Motivation: The motivation for this paper is the challenge of deploying object detection on resource-constrained IoT devices due to the computational load of deep learning-based models and the rapid proliferation of IoT devices. The authors aim to address a gap in existing survey papers, which often overlook the optimization challenges associated with deploying object detection models in TinyML environments. Method: The paper provides a detailed analysis of key optimization techniques for deploying object detection models on resource-constrained devices, including quantization, pruning, knowledge distillation, and neural architecture search. It also explores both theoretical approaches and practical implementations, and compares key performance indicators of existing object detection implementations on microcontroller devices. Result: The result of this paper is a comprehensive survey of optimization techniques for deploying object detection models on resource-constrained devices, including a comparison of key performance indicators of existing object detection implementations on microcontroller devices. The authors also provide a public repository to track developments in the field. Conclusion: The paper concludes that TinyML offers a promising solution for deploying object detection models on resource-constrained IoT devices, and that optimization techniques such as quantization, pruning, knowledge distillation, and neural architecture search are essential for achieving efficient and real-time processing at the edge. Abstract: Object detection (OD) has become vital for numerous computer vision applications, but deploying it on resource-constrained IoT devices presents a significant challenge. These devices, often powered by energy-efficient microcontrollers, struggle to handle the computational load of deep learning-based OD models. This issue is compounded by the rapid proliferation of IoT devices, predicted to surpass 150 billion by 2030. TinyML offers a compelling solution by enabling OD on ultra-low-power devices, paving the way for efficient and real-time processing at the edge. Although numerous survey papers have been published on this topic, they often overlook the optimization challenges associated with deploying OD models in TinyML environments. To address this gap, this survey paper provides a detailed analysis of key optimization techniques for deploying OD models on resource-constrained devices. These techniques include quantization, pruning, knowledge distillation, and neural architecture search. Furthermore, we explore both theoretical approaches and practical implementations, bridging the gap between academic research and real-world edge artificial intelligence deployment. Finally, we compare the key performance indicators (KPIs) of existing OD implementations on microcontroller devices, highlighting the achieved maturity level of these solutions in terms of both prediction accuracy and efficiency. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/christophezei/Optimizing-Object-Detection-Models-for-TinyML-A-Comprehensive-Survey.[66] Neural Tangent Knowledge Distillation for Optical Convolutional Networks
Jinlin Xiang,Minho Choi,Yubo Zhang,Zhihao Zhou,Arka Majumdar,Eli Shlizerman
Main category: cs.CV
TL;DR: 本文提出了一种通用的混合光学神经网络训练和优化流程,旨在提高精度并缩小仿真与实际系统间的差距,适用于多种任务和硬件配置。
Details
Motivation: 混合光学神经网络(ONNs)由于训练时的精度差距和仿真与实际系统间的差异,应用受限。现有方法缺乏跨任务和硬件设计的泛化能力。 Method: 提出了一种任务和硬件无关的流程,包含基于用户约束估计模型精度的方法、神经切线知识蒸馏(NTKD)用于训练,以及在制造后对数字后端进行微调。 Result: 所提出的流程在多个数据集(如MNIST, CIFAR, Carvana Masking)上验证,显示了一致的性能提升,并适用于不同光学系统设计。 Conclusion: 实验结果表明,该方法在多个数据集和硬件配置上均能提升ONN性能,并支持其在实际系统中的部署。 Abstract: Hybrid Optical Neural Networks (ONNs, typically consisting of an optical frontend and a digital backend) offer an energy-efficient alternative to fully digital deep networks for real-time, power-constrained systems. However, their adoption is limited by two main challenges: the accuracy gap compared to large-scale networks during training, and discrepancies between simulated and fabricated systems that further degrade accuracy. While previous work has proposed end-to-end optimizations for specific datasets (e.g., MNIST) and optical systems, these approaches typically lack generalization across tasks and hardware designs. To address these limitations, we propose a task-agnostic and hardware-agnostic pipeline that supports image classification and segmentation across diverse optical systems. To assist optical system design before training, we estimate achievable model accuracy based on user-specified constraints such as physical size and the dataset. For training, we introduce Neural Tangent Knowledge Distillation (NTKD), which aligns optical models with electronic teacher networks, thereby narrowing the accuracy gap. After fabrication, NTKD also guides fine-tuning of the digital backend to compensate for implementation errors. Experiments on multiple datasets (e.g., MNIST, CIFAR, Carvana Masking) and hardware configurations show that our pipeline consistently improves ONN performance and enables practical deployment in both pre-fabrication simulations and physical implementations.[67] MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling
Qian Wang,Ziqi Huang,Ruoxi Jia,Paul Debevec,Ning Yu
Main category: cs.CV
TL;DR: MAViS是一种创新的长序列视频生成框架,通过多智能体协作和模块化设计,实现了高质量的视频故事创作,并提供包含叙事和背景音乐的多模态输出。
Details
Motivation: 现有的长序列视频生成框架在辅助能力、视觉质量和表现力方面存在显著限制,因此需要一种更高效和模块化的框架来提升整体性能。 Method: 提出了一种端到端的多智能体协同框架MAViS,采用3E原则(探索、检查、增强)进行多阶段协作,包括剧本编写、镜头设计、角色建模、关键帧生成、视频动画和音频生成。 Result: 实验结果表明,MAViS在辅助能力、视觉质量和视频表现力方面达到了最先进的水平,并能够通过简单的用户提示生成高质量、富有表现力的长序列视频故事。 Conclusion: MAViS是第一个能够提供多模态设计输出(包含叙事和背景音乐的视频)的长序列视频生成框架,具备良好的辅助能力、视觉质量和视频表现力。 Abstract: Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, an end-to-end multi-agent collaborative framework for long-sequence video storytelling. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle -- Explore, Examine, and Enhance -- to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief user prompt, MAViS is capable of producing high-quality, expressive long-sequence video storytelling, enriching inspirations and creativity for users. To the best of our knowledge, MAViS is the only framework that provides multimodal design output -- videos with narratives and background music.[68] MuGa-VTON: Multi-Garment Virtual Try-On via Diffusion Transformers with Prompt Customization
Ankan Deria,Dwarikanath Mahapatra,Behzad Bozorgtabar,Mohna Chakraborty,Snehashis Chakraborty,Sudipta Roy
Main category: cs.CV
TL;DR: 本文提出MuGa-VTON,一种统一的多服装扩散框架,能够生成高质量的虚拟试穿图像,同时保留人物身份和服装细节,支持基于提示的定制化修改。
Details
Motivation: 现有虚拟试穿方法通常分别处理上装和下装,依赖大量预处理,且难以保留人物特定特征(如纹身、配饰和体型),导致真实感和灵活性受限。 Method: 提出了MuGa-VTON框架,包含三个关键模块:服装表示模块(GRM)、人员表示模块(PRM)和A-DiT融合模块,通过扩散变压器集成服装、人员和文本提示特征。 Result: 在VITON-HD和DressCode基准测试中,MuGa-VTON在定性和定量评估中均优于现有方法,生成高保真且保留身份的结果。 Conclusion: MuGa-VTON提供了一种高效的虚拟试穿解决方案,通过统一的多服装扩散框架,在保持个人身份和服装保真度方面优于现有方法,适用于实际的时尚零售和个人化应用。 Abstract: Virtual try-on seeks to generate photorealistic images of individuals in desired garments, a task that must simultaneously preserve personal identity and garment fidelity for practical use in fashion retail and personalization. However, existing methods typically handle upper and lower garments separately, rely on heavy preprocessing, and often fail to preserve person-specific cues such as tattoos, accessories, and body shape-resulting in limited realism and flexibility. To this end, we introduce MuGa-VTON, a unified multi-garment diffusion framework that jointly models upper and lower garments together with person identity in a shared latent space. Specifically, we proposed three key modules: the Garment Representation Module (GRM) for capturing both garment semantics, the Person Representation Module (PRM) for encoding identity and pose cues, and the A-DiT fusion module, which integrates garment, person, and text-prompt features through a diffusion transformer. This architecture supports prompt-based customization, allowing fine-grained garment modifications with minimal user input. Extensive experiments on the VITON-HD and DressCode benchmarks demonstrate that MuGa-VTON outperforms existing methods in both qualitative and quantitative evaluations, producing high-fidelity, identity-preserving results suitable for real-world virtual try-on applications.[69] CObL: Toward Zero-Shot Ordinal Layering without User Prompting
Aneel Damaraju,Dean Hazineh,Todd Zickler
Main category: cs.CV
TL;DR: The paper introduces Concurrent Object Layers (CObL), a diffusion-based architecture that generates object layers in parallel and generalizes to real-world images without prior knowledge of object numbers.
Details
Motivation: Vision benefits from grouping pixels into objects and understanding their spatial relationships; CObL captures this by comprising an occlusion-ordered stack of 'object layers'. Method: CObL uses a diffusion-based architecture to generate a stack of object layers in parallel, using Stable Diffusion as a prior for natural objects. Result: CObL zero-shot generalizes to photographs of real-world tabletops with varying numbers of novel objects and reconstructs multiple occluded objects. Conclusion: CObL is able to reconstruct multiple occluded objects without user prompting and without knowing the number of objects beforehand, and it is not limited to the world it was trained in. Abstract: Vision benefits from grouping pixels into objects and understanding their spatial relationships, both laterally and in depth. We capture this with a scene representation comprising an occlusion-ordered stack of "object layers," each containing an isolated and amodally-completed object. To infer this representation from an image, we introduce a diffusion-based architecture named Concurrent Object Layers (CObL). CObL generates a stack of object layers in parallel, using Stable Diffusion as a prior for natural objects and inference-time guidance to ensure the inferred layers composite back to the input image. We train CObL using a few thousand synthetically-generated images of multi-object tabletop scenes, and we find that it zero-shot generalizes to photographs of real-world tabletops with varying numbers of novel objects. In contrast to recent models for amodal object completion, CObL reconstructs multiple occluded objects without user prompting and without knowing the number of objects beforehand. Unlike previous models for unsupervised object-centric representation learning, CObL is not limited to the world it was trained in.[70] Re:Verse -- Can Your VLM Read a Manga?
Aaditya Baranwal,Madhav Kataria,Naitik Agrawal,Yogesh S Rawat,Shruti Vyas
Main category: cs.CV
TL;DR: 本文提出了一种新的评估框架,用于研究视觉语言模型在处理长篇视觉叙事时的能力与局限,揭示了当前模型在时间因果关系和跨画面连贯性方面的不足,并为未来提升模型的叙事智能提供了方法和见解。
Details
Motivation: 当前视觉语言模型(VLMs)在表面识别与深层叙事推理之间存在显著差距。尽管现有的多模态模型在单个画面解释上表现出色,但在时间因果关系和跨画面连贯性方面表现不佳,而这正是连贯故事理解的核心要求。 Method: 本文提出了一种新的评估框架,结合了细粒度多模态标注、跨模态嵌入分析和检索增强评估。具体方法包括:(i)通过与轻小说文本对齐的视觉元素链接到叙事结构的严格标注协议;(ii)在多种推理范式下进行全面评估,包括直接推理和检索增强生成;(iii)跨模态相似性分析揭示当前VLM联合表示中的根本性错位。 Result: 通过对Re:Zero漫画11章308个标注画面的评估,本文发现当前模型在生成性叙事、上下文对话定位和时间推理方面存在系统性失败。模型在非线性叙事、角色一致性和跨长序列因果推理上表现尤为薄弱。 Conclusion: 当前视觉语言模型(VLMs)在处理离散视觉叙事时缺乏真正的故事层面的智能,尤其在非线性叙事、角色一致性和跨序列因果推理方面存在困难。本文为评估叙事智能奠定了基础,并提供了深入理解离散视觉叙事的实际方法论。 Abstract: Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations. Our methodology includes (i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text, (ii) comprehensive evaluation across multiple reasoning paradigms, including direct inference and retrieval-augmented generation, and (iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs' joint representations. Applying this framework to Re:Zero manga across 11 chapters with 308 annotated panels, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning. Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into the capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models.[71] VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models
Mansi Phute,Ravikumar Balakrishnan
Main category: cs.CV
TL;DR: This paper introduces VISOR, a new method for controlling the behavior of Vision Language Models (VLMs) using visual inputs alone, offering improved practicality and exposing new security concerns.
Details
Motivation: Existing methods for behavioral control of VLMs are either easily detectable or require invasive access to model internals, limiting their applicability. This work aims to develop a more practical and discreet method for controlling VLM behavior. Method: VISOR uses optimized visual inputs to steer VLM behavior by crafting universal steering images that induce target activation patterns. Result: VISOR achieves behavioral control comparable to existing methods using a single 150KB steering image, with up to 25% improvement for negative steering tasks, while maintaining high performance on unrelated tasks. Conclusion: The paper concludes that VISOR offers a novel, effective method for behavioral control of VLMs without requiring internal model access, while also highlighting the security vulnerabilities this introduces. Abstract: Vision Language Models (VLMs) are increasingly being used in a broad range of applications, bringing their security and behavioral control to the forefront. While existing approaches for behavioral control or output redirection, like system prompting in VLMs, are easily detectable and often ineffective, activation-based steering vectors require invasive runtime access to model internals--incompatible with API-based services and closed-source deployments. We introduce VISOR (Visual Input-based Steering for Output Redirection), a novel method that achieves sophisticated behavioral control through optimized visual inputs alone. By crafting universal steering images that induce target activation patterns, VISOR enables practical deployment across all VLM serving modalities while remaining imperceptible compared to explicit textual instructions. We validate VISOR on LLaVA-1.5-7B across three critical alignment tasks: refusal, sycophancy and survival instinct. A single 150KB steering image matches steering vector performance within 1-2% for positive behavioral shifts while dramatically exceeding it for negative steering--achieving up to 25% shifts from baseline compared to steering vectors' modest changes. Unlike system prompting (3-4% shifts), VISOR provides robust bidirectional control while maintaining 99.9% performance on 14,000 unrelated MMLU tasks. Beyond eliminating runtime overhead and model access requirements, VISOR exposes a critical security vulnerability: adversaries can achieve sophisticated behavioral manipulation through visual channels alone, bypassing text-based defenses. Our work fundamentally re-imagines multimodal model control and highlights the urgent need for defenses against visual steering attacks.[72] Training Kindai OCR with parallel textline images and self-attention feature distance-based loss
Anh Le,Asanobu Kitamoto
Main category: cs.CV
TL;DR: This research improves OCR accuracy for historical Kindai documents by using parallel textline images and a distance-based objective function, effectively reducing transcription effort and enhancing representation learning.
Details
Motivation: The motivation behind this research is to address the labor-intensive transcription of Kindai documents, which limits annotated data for OCR training due to data scarcity. Method: The researchers used parallel textline images (original Kindai text paired with contemporary Japanese fonts) and introduced a distance-based objective function to minimize the gap between self-attention features. They employed Euclidean distance and Maximum Mean Discrepancy (MMD) as domain adaptation metrics in their experiments. Result: The method reduced the character error rate (CER) by 2.23% using Euclidean distance and by 3.94% using MMD compared to a Transformer-based OCR baseline. It also enhanced the discriminative quality of self-attention representations. Conclusion: The study concludes that leveraging parallel textline images and using a distance-based objective function significantly improves OCR performance for historical Kindai documents by reducing character error rates and enhancing self-attention representation discrimination. Abstract: Kindai documents, written in modern Japanese from the late 19th to early 20th century, hold significant historical value for researchers studying societal structures, daily life, and environmental conditions of that period. However, transcribing these documents remains a labor-intensive and time-consuming task, resulting in limited annotated data for training optical character recognition (OCR) systems. This research addresses this challenge of data scarcity by leveraging parallel textline images - pairs of original Kindai text and their counterparts in contemporary Japanese fonts - to augment training datasets. We introduce a distance-based objective function that minimizes the gap between self-attention features of the parallel image pairs. Specifically, we explore Euclidean distance and Maximum Mean Discrepancy (MMD) as domain adaptation metrics. Experimental results demonstrate that our method reduces the character error rate (CER) by 2.23% and 3.94% over a Transformer-based OCR baseline when using Euclidean distance and MMD, respectively. Furthermore, our approach improves the discriminative quality of self-attention representations, leading to more effective OCR performance for historical documents.[73] Calibration Attention: Instance-wise Temperature Scaling for Vision Transformers
Wenhao Liang,Wei Emma Zhang,Lin Yue,Miao Xu,Olaf Maennel,Weitong Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为CalAttn的模块,用于Vision Transformers的概率校准,它能够从ViT的CLS token中学习自适应的实例温度,从而在不牺牲准确率的情况下产生更可靠的概率,这种方法简单、高效且与架构无关。
Details
Motivation: Vision Transformers在风险敏感应用中部署时,概率校准至关重要。标准方法后校准温度缩放使用单一全局标量,并且需要保留验证集,这促使了CalAttn这种能够直接从ViT的CLS token中学习自适应、每个实例温度的模块的开发。 Method: 引入了一种名为CalAttn的模块,该模块通过从ViT的CLS token中学习每个实例的温度来进行概率校准,与需要单独验证集的标准后校准温度缩放方法不同。 Result: 在CIFAR-10/100、MNIST、Tiny-ImageNet和ImageNet-1K数据集上,CalAttn在ViT-224、DeiT和Swin模型上最多减少了4倍的校准误差,同时仅增加了0.1%的额外参数。学习到的温度值紧密地聚集在1.0周围,与标准温度缩放所使用的较大的全局值形成对比。 Conclusion: CalAttn是一个简单、高效且与架构无关的模块,能够从ViT的CLS token中学习自适应的实例温度,从而在不牺牲准确率的情况下产生更可靠的概率。 Abstract: Probability calibration is critical when Vision Transformers are deployed in risk-sensitive applications. The standard fix, post-hoc temperature scaling, uses a single global scalar and requires a held-out validation set. We introduce Calibration Attention (CalAttn), a drop-in module that learns an adaptive, per-instance temperature directly from the ViT's CLS token. Across CIFAR-10/100, MNIST, Tiny-ImageNet, and ImageNet-1K, CalAttn reduces calibration error by up to 4x on ViT-224, DeiT, and Swin, while adding under 0.1 percent additional parameters. The learned temperatures cluster tightly around 1.0, in contrast to the large global values used by standard temperature scaling. CalAttn is simple, efficient, and architecture-agnostic, and yields more trustworthy probabilities without sacrificing accuracy. Code: [https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-](https://github.com/EagleAdelaide/CalibrationAttention-CalAttn-)[74] Boosting Generic Semi-Supervised Medical Image Segmentation via Diverse Teaching and Label Propagation
Wei Li,Pengcheng Zhou,Linye Ma,Wenyi Zhao,Huihua Yang
Main category: cs.CV
TL;DR: This paper proposes a Diverse Teaching and Label Propagation Network (DTLP-Net) to address challenges in semi-supervised medical image segmentation, achieving significant improvements over existing methods across multiple tasks.
Details
Motivation: The motivation is to overcome challenges in medical image segmentation such as limited annotation and domain shift, which lead to scenarios like semi-supervised medical image segmentation (SSMIS), semi-supervised medical domain generalization (Semi-MDG), and unsupervised medical domain adaptation (UMDA). Method: The authors proposed a Diverse Teaching and Label Propagation Network (DTLP-Net) involving a student model and two diverse teacher models, along with inter-sample and intra-sample data augmentation and label propagation techniques. Result: The results show notable improvements compared to state-of-the-art methods across all five benchmark dataset settings, indicating the framework's potential for tackling challenging SSL scenarios. Conclusion: The paper concludes that their proposed DTLP-Net framework shows notable improvements in tackling semi-supervised medical image segmentation challenges across various scenarios. Abstract: Both limited annotation and domain shift are significant challenges frequently encountered in medical image segmentation, leading to derivative scenarios like semi-supervised medical (SSMIS), semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). Conventional methods are generally tailored to specific tasks in isolation, the error accumulation hinders the effective utilization of unlabeled data and limits further improvements, resulting in suboptimal performance when these issues occur. In this paper, we aim to develop a generic framework that masters all three tasks. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data and increasing the diversity of the model. To tackle this issue, we employ a Diverse Teaching and Label Propagation Network (DTLP-Net) to boosting the Generic Semi-Supervised Medical Image Segmentation. Our DTLP-Net involves a single student model and two diverse teacher models, which can generate reliable pseudo-labels for the student model. The first teacher model decouple the training process with labeled and unlabeled data, The second teacher is momentum-updated periodically, thus generating reliable yet divers pseudo-labels. To fully utilize the information within the data, we adopt inter-sample and intra-sample data augmentation to learn the global and local knowledge. In addition, to further capture the voxel-level correlations, we propose label propagation to enhance the model robust. We evaluate our proposed framework on five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks. The results showcase notable improvements compared to state-of-the-art methods across all five settings, indicating the potential of our framework to tackle more challenging SSL scenarios.[75] Unlocking the Potential of Diffusion Priors in Blind Face Restoration
Yunqi Miao,Zhiyu Qu,Mingqi Gao,Changrui Chen,Jifei Song,Jungong Han,Jiankang Deng
Main category: cs.CV
TL;DR: FLIPNET bridges the gap between diffusion models and blind face restoration by using dual modes for better restoration and realistic degradation synthesis.
Details
Motivation: The vanilla diffusion model faces an inherent gap when applied to blind face restoration (BFR) due to discrepancies between high-quality/synthesized images and real-world low-quality images. This work aims to bridge that gap. Method: A unified network called FLIPNET that operates in two modes: Restoration mode integrates BFR-oriented features and face embeddings for restoration, while Degradation mode synthesizes real-world-like degraded images based on learned knowledge. Result: Extensive evaluations show that FLIPNET outperforms previous diffusion prior-based BFR methods in authenticity and fidelity, and better models real-world degradations compared to naive degradation models. Conclusion: FLIPNET effectively addresses the gaps between vanilla diffusion models and BFR settings by switching between Restoration and Degradation modes, leading to improved authenticity, fidelity, and real-world degradation modeling. Abstract: Although diffusion prior is rising as a powerful solution for blind face restoration (BFR), the inherent gap between the vanilla diffusion model and BFR settings hinders its seamless adaptation. The gap mainly stems from the discrepancy between 1) high-quality (HQ) and low-quality (LQ) images and 2) synthesized and real-world images. The vanilla diffusion model is trained on images with no or less degradations, whereas BFR handles moderately to severely degraded images. Additionally, LQ images used for training are synthesized by a naive degradation model with limited degradation patterns, which fails to simulate complex and unknown degradations in real-world scenarios. In this work, we use a unified network FLIPNET that switches between two modes to resolve specific gaps. In Restoration mode, the model gradually integrates BFR-oriented features and face embeddings from LQ images to achieve authentic and faithful face restoration. In Degradation mode, the model synthesizes real-world like degraded images based on the knowledge learned from real-world degradation datasets. Extensive evaluations on benchmark datasets show that our model 1) outperforms previous diffusion prior based BFR methods in terms of authenticity and fidelity, and 2) outperforms the naive degradation model in modeling the real-world degradations.[76] Think as Cardiac Sonographers: Marrying SAM with Left Ventricular Indicators Measurements According to Clinical Guidelines
Tuo Liu,Qinghan Yang,Yu Zhang,Rongjun Ge,Yang Chen,Guangquan Zhou
Main category: cs.CV
TL;DR: AutoSAME improves LV indicator measurements in echocardiography by combining SAM with segmentation and landmark localization tasks, using FCBA and SGPA for enhanced accuracy.
Details
Motivation: Existing algorithms struggle with automated LV quantification due to small datasets and limitations in identifying critical anatomical points, necessitating the use of vision foundational models. Method: AutoSAME integrates filtered cross-branch attention (FCBA) and spatial-guided prompt alignment (SGPA) to enhance heatmap regression and prompt embeddings for accurate LV segmentation and landmark localization. Result: Experiments show that AutoSAME outperforms existing methods in LV segmentation, landmark localization, and indicator measurements on echocardiography datasets. Conclusion: The proposed AutoSAME framework effectively combines the visual understanding of SAM with segmentation and landmark localization tasks, enhancing LV indicator measurements in clinical echocardiography. Abstract: Left ventricular (LV) indicator measurements following clinical echocardiog-raphy guidelines are important for diagnosing cardiovascular disease. Alt-hough existing algorithms have explored automated LV quantification, they can struggle to capture generic visual representations due to the normally small training datasets. Therefore, it is necessary to introduce vision founda-tional models (VFM) with abundant knowledge. However, VFMs represented by the segment anything model (SAM) are usually suitable for segmentation but incapable of identifying key anatomical points, which are critical in LV indicator measurements. In this paper, we propose a novel framework named AutoSAME, combining the powerful visual understanding of SAM with seg-mentation and landmark localization tasks simultaneously. Consequently, the framework mimics the operation of cardiac sonographers, achieving LV indi-cator measurements consistent with clinical guidelines. We further present fil-tered cross-branch attention (FCBA) in AutoSAME, which leverages relatively comprehensive features in the segmentation to enhance the heatmap regression (HR) of key points from the frequency domain perspective, optimizing the vis-ual representation learned by the latter. Moreover, we propose spatial-guided prompt alignment (SGPA) to automatically generate prompt embeddings guid-ed by spatial properties of LV, thereby improving the accuracy of dense pre-dictions by prior spatial knowledge. The extensive experiments on an echocar-diography dataset demonstrate the efficiency of each design and the superiori-ty of our AutoSAME in LV segmentation, landmark localization, and indicator measurements. The code will be available at https://github.com/QC-LIU-1997/AutoSAME.[77] Superclass-Guided Representation Disentanglement for Spurious Correlation Mitigation
Chenruo Liu,Hongjun Liu,Zeyu Lai,Yiqiu Shen,Chen Zhao,Qi Lei
Main category: cs.CV
TL;DR: 本文提出了一种利用类别标签内在语义结构(特别是超类信息)来增强组鲁棒性的方法,以解决现有方法对辅助注释和相同组集的不自然且不切实际的要求。
Details
Motivation: 现有方法依赖于对群体或虚假特征的辅助注释,并假设源域和目标域中的群体集相同,这在现实世界中是不自然和不切实际的。因此,作者希望提出一种不需要注释源样本的新方法来克服这些限制。 Method: 该方法利用类别标签的内在语义结构,使用基于预训练视觉-语言模型的梯度注意力来分离与超类相关和不相关的特征,并通过促进使用所有与超类相关的特征进行预测,以提高模型的鲁棒性。 Result: 在不同数据集上的实验表明,该方法在域泛化任务中显著优于基线方法,并在定量指标和定性可视化方面都有明显改进。 Conclusion: 所提出的方法能够通过利用类别标签的语义结构,有效减少对虚假特征的依赖,提高模型在不同域上的泛化能力,且无需任何源样本的注释。 Abstract: To enhance group robustness to spurious correlations, prior work often relies on auxiliary annotations for groups or spurious features and assumes identical sets of groups across source and target domains. These two requirements are both unnatural and impractical in real-world settings. To overcome these limitations, we propose a method that leverages the semantic structure inherent in class labels--specifically, superclass information--to naturally reduce reliance on spurious features. Our model employs gradient-based attention guided by a pre-trained vision-language model to disentangle superclass-relevant and irrelevant features. Then, by promoting the use of all superclass-relevant features for prediction, our approach achieves robustness to more complex spurious correlations without the need to annotate any source samples. Experiments across diverse datasets demonstrate that our method significantly outperforms baselines in domain generalization tasks, with clear improvements in both quantitative metrics and qualitative visualizations.[78] RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space
Jingyun Liang,Jingkai Zhou,Shikai Li,Chenjie Cao,Lei Sun,Yichen Qian,Weihua Chen,Fan Wang
Main category: cs.CV
TL;DR: This paper introduces a new framework for generating realistic human videos with separate control over foreground, background, trajectory, and action, achieving superior performance in both controllability and quality.
Details
Motivation: Existing methods for generating human videos lack separate control over four key elements: foreground subject, background video, human trajectory, and action patterns. The goal is to enable flexible mix-and-match composition of these elements. Method: A decomposed human motion control and video generation framework that decouples motion from appearance, subject from background, and action from trajectory, using a ground-aware 3D world coordinate system and motion editing in 3D space, combined with text-to-video diffusion transformer models. Result: The method achieves state-of-the-art performance on both element-wise controllability and overall video quality, as demonstrated by extensive experiments on benchmark datasets and real-world cases. Conclusion: The proposed framework allows for flexible and separate control over key video elements, achieving state-of-the-art performance in controllability and video quality. Abstract: Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.[79] DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
Wenwen Yu,Zhibo Yang,Yuliang Liu,Xiang Bai
Main category: cs.CV
TL;DR: DocThinker는 MLLM 기반 문서 이해에서 추론 과정의 설명 가능성과 적응성을 향상시키기 위해 규칙 기반 강화 학습 프레임워크를 제안합니다.
Details
Motivation: 기존 방법은 고정된 사고 연결(CoT) 추론과 지도 파인튜닝(SFT)을 사용하여 재앙적 망각, 낮은 적응성, 도메인 간 일반화 한계 등의 문제를 겪고 있습니다. 이에 따라 신뢰성과 투명성을 높이기 위한 새로운 접근법이 필요합니다. Method: DocThinker는 정적 CoT 템플릿 대신 정책 학습을 통해 추론 전략을 자율적으로 개선하는 규칙 기반 강화 학습(RL) 프레임워크입니다. 여러 목표 규칙 기반 보상과 KL-제약 최적화를 통합해 재앙적 망각을 완화하고 적응성과 투명성을 향상시킵니다. Result: 다양한 벤치마크 실험에서 DocThinker는 일반화 능력을 크게 향상시키고 인간이 이해할 수 있는 설명 가능한 추론 단계를 생성합니다. Conclusion: DocThinker는 MLLM 기반 문서 이해에서 설명 가능성과 적응성을 높이는 강력한 대안으로서의 RL의 가능성을 입증합니다. Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks. In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating explainable intermediate results, including structured reasoning processes, rephrased questions, regions of interest (RoI) supporting the answer, and the final answer. By integrating multi-objective rule-based rewards and KL-constrained optimization, our method mitigates catastrophic forgetting and enhances both adaptability and transparency. Extensive experiments on multiple benchmarks demonstrate that DocThinker significantly improves generalization while producing more explainable and human-understandable reasoning steps. Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding. Code will be available at https://github.com/wenwenyu/DocThinker.[80] QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection
Yuxiao Wang,Wolin Liang,Yu Lei,Weiying Xue,Nan Zhuang,Qi Liu
Main category: cs.CV
TL;DR: QueryCraft improves HOI detection by incorporating semantic priors and guided feature learning through transformer-based query initialization, achieving state-of-the-art results.
Details
Motivation: DETR-based methods for HOI detection suffer from suboptimal performance due to randomly initialized queries lacking explicit semantics. Method: QueryCraft uses ACTOR, a cross-modal Transformer encoder, to extract action-relevant features by jointly attending to visual regions and textual prompts. Additionally, PDQD distills object category awareness from a pre-trained detector to improve object query initiation. Result: Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that the proposed method achieves state-of-the-art performance and strong generalization. Conclusion: The proposed QueryCraft framework with ACTOR and PDQD achieves state-of-the-art performance and strong generalization in HOI detection. Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is \textbf{ACTOR} (\textbf{A}ction-aware \textbf{C}ross-modal \textbf{T}ransf\textbf{OR}mer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to infer interaction semantics and produce semantically meaningful query representations. To further enhance object-level query quality, we introduce a \textbf{P}erceptual \textbf{D}istilled \textbf{Q}uery \textbf{D}ecoder (\textbf{PDQD}), which distills object category awareness from a pre-trained detector to serve as object query initiation. This dual-branch query initialization enables the model to generate more interpretable and effective queries for HOI detection. Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that our method achieves state-of-the-art performance and strong generalization. Code will be released upon publication.[81] Yan: Foundational Interactive Video Generation
Yan Team
Main category: cs.CV
TL;DR: Yan是一个用于交互式视频生成的基础框架,包括模拟、生成和编辑三个核心模块,实现了实时高清视频的交互式创作。
Details
Motivation: 推动交互式视频生成技术的发展,实现高质量、实时、可编辑的AI驱动创作工具。 Method: 采用高度压缩的低延迟3D-VAE模型和KV-cache-based shift-window去噪推理过程,结合分层自回归字幕方法和多模态视频扩散模型(VDMs),以及混合模型实现多粒度编辑。 Result: Yan实现了1080P/60FPS的实时交互式视频生成,支持跨域风格融合,并允许通过文本进行多粒度视频内容编辑。 Conclusion: Yan为交互式视频生成提供了一个集成化框架,推动了AI在创意工具、媒体和娱乐领域的应用。 Abstract: We present Yan, a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing. Specifically, Yan comprises three core modules. AAA-level Simulation: We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process, achieving real-time 1080P/60FPS interactive simulation. Multi-Modal Generation: We introduce a hierarchical autoregressive caption method that injects game-specific knowledge into open-domain multi-modal video diffusion models (VDMs), then transforming the VDM into a frame-wise, action-controllable, real-time infinite interactive video generator. Notably, when the textual and visual prompts are sourced from different domains, the model demonstrates strong generalization, allowing it to blend and compose the style and mechanics across domains flexibly according to user prompts. Multi-Granularity Editing: We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text. Collectively, Yan offers an integration of these modules, pushing interactive video generation beyond isolated capabilities toward a comprehensive AI-driven interactive creation paradigm, paving the way for the next generation of creative tools, media, and entertainment. The project page is: https://greatx3.github.io/Yan/.[82] Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization
Jihwan Park,Taehoon song,Sanghyeok Lee,Miso Choi,Hyunwoo J. Kim
Main category: cs.CV
TL;DR: TransMiter is a lightweight, model-agnostic adapter that efficiently transfers adaptation knowledge across VLMs without backpropagation, enhancing performance in visual recognition tasks.
Details
Motivation: As VLMs grow in size and complexity, fine-tuning becomes costly. Existing adaptation transfer methods are limited due to model-specific design and high computational demands. Method: TransMiter uses an unsupervised, lightweight adapter approach to capture the knowledge gap between pre-trained and fine-tuned VLMs, enabling transfer without backpropagation. Result: TransMiter improves VLMs with minimal inference cost and marginal training cost, often surpassing fine-tuned stronger models, especially when supplemented with a few labeled data. Conclusion: TransMiter proves to be an effective and efficient method for transferring adaptation knowledge across different VLMs, maintaining generalization abilities in visual recognition tasks. Abstract: Vision-Language Models (VLMs) have been widely used in various visual recognition tasks due to their remarkable generalization capabilities. As these models grow in size and complexity, fine-tuning becomes costly, emphasizing the need to reuse adaptation knowledge from 'weaker' models to efficiently enhance 'stronger' ones. However, existing adaptation transfer methods exhibit limited transferability across models due to their model-specific design and high computational demands. To tackle this, we propose Transferable Model-agnostic adapter (TransMiter), a light-weight adapter that improves vision-language models 'without backpropagation'. TransMiter captures the knowledge gap between pre-trained and fine-tuned VLMs, in an 'unsupervised' manner. Once trained, this knowledge can be seamlessly transferred across different models without the need for backpropagation. Moreover, TransMiter consists of only a few layers, inducing a negligible additional inference cost. Notably, supplementing the process with a few labeled data further yields additional performance gain, often surpassing a fine-tuned stronger model, with a marginal training cost. Experimental results and analyses demonstrate that TransMiter effectively and efficiently transfers adaptation knowledge while preserving generalization abilities across VLMs of different sizes and architectures in visual recognition tasks.[83] SelfHVD: Self-Supervised Handheld Video Deblurring for Mobile Phones
Honglei Xu,Zhilu Zhang,Junjie Fan,Xiaohe Wu,Wangmeng Zuo
Main category: cs.CV
TL;DR: The paper introduces a self-supervised video deblurring method tailored for handheld mobile phone videos, overcoming the domain gap issue with novel training strategies and model regularization, leading to superior performance over existing approaches.
Details
Motivation: The motivation stems from the problem of blurry video frames caused by handheld instability and the limitations of existing video deblurring methods, which perform poorly on real-world handheld videos due to the domain gap between training and testing data. Method: The authors propose a self-supervised video deblurring approach using sharp clues from videos to create misalignment labels. They also introduce a Self-Enhanced Video Deblurring (SEVD) method for generating higher-quality paired data and a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model. They also constructed new synthetic and real-world datasets for evaluation. Result: The proposed method significantly outperforms existing self-supervised video deblurring techniques on both newly created and commonly used real-world datasets. The approach improves model performance and maintains spatial consistency between input and output frames. Conclusion: The proposed self-supervised method, which includes SEVD and SCSCM, significantly outperforms existing self-supervised approaches for handheld video deblurring, as demonstrated by experiments on newly constructed and existing datasets. Abstract: Shooting video with a handheld mobile phone, the most common photographic device, often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on real-world handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the model's ability, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct a synthetic and a real-world handheld video dataset for handheld video deblurring. Extensive experiments on these two and other common real-world datasets demonstrate that our method significantly outperforms existing self-supervised ones. The code and datasets are publicly available at https://github.com/cshonglei/SelfHVD.[84] Neural Artistic Style and Color Transfer Using Deep Learning
Justin London
Main category: cs.CV
TL;DR: 本文介绍了一种结合神经艺术风格与颜色传递的新方法,并通过KL散度评估不同颜色传递算法的效果。
Details
Motivation: 颜色传递算法在数字图像处理中非常重要,能够调整目标图像的颜色信息以匹配源图像,从而增强图像和视频的质量。 Method: 利用Kullback-Leibler(KL)散度定量评估包括Reinhard全局颜色传递、迭代分布传递(IDT)、带重新着色的IDT、Cholesky和PCA在内的颜色传递算法。 Result: 通过各种实验评估这些算法的KL散度及其颜色直方图在风格到内容传递中的表现。 Conclusion: 本文提出了一种结合神经艺术风格和颜色传递的方法,并使用KL散度评估不同颜色传递算法的性能。 Abstract: Neural artistic style transfers and blends the content and style representation of one image with the style of another. This enables artists to create unique innovative visuals and enhances artistic expression in various fields including art, design, and film. Color transfer algorithms are an important in digital image processing by adjusting the color information in a target image based on the colors in the source image. Color transfer enhances images and videos in film and photography, and can aid in image correction. We introduce a methodology that combines neural artistic style with color transfer. The method uses the Kullback-Leibler (KL) divergence to quantitatively evaluate color and luminance histogram matching algorithms including Reinhard global color transfer, iteration distribution transfer (IDT), IDT with regrain, Cholesky, and PCA between the original and neural artistic style transferred image using deep learning. We estimate the color channel kernel densities. Various experiments are performed to evaluate the KL of these algorithms and their color histograms for style to content transfer.[85] Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation
Jiahua Dong,Hui Yin,Wenqi Liang,Hanbin Zhao,Henghui Ding,Nicu Sebe,Salman Khan,Fahad Shahbaz Khan
Main category: cs.CV
TL;DR: This paper proposes the HVPL model for video instance segmentation, which effectively addresses catastrophic forgetting at both frame and video levels, outperforming baseline methods.
Details
Motivation: The authors aim to solve the problem where existing video instance segmentation (VIS) methods assume fixed object categories over time and suffer from catastrophic forgetting when learning new categories. Method: The paper proposes a Hierarchical Visual Prompt Learning (HVPL) model that includes a task-specific frame prompt, an orthogonal gradient correction (OGC) module, a task-specific video prompt, and a video context decoder to address catastrophic forgetting at both frame and video levels. Result: The HVPL model demonstrates more effective performance compared to baseline approaches in handling catastrophic forgetting. Conclusion: The HVPL model effectively addresses the issue of catastrophic forgetting in video instance segmentation, showing superior performance over baseline approaches. Abstract: Video instance segmentation (VIS) has gained significant attention for its capability in tracking and segmenting object instances across video frames. However, most of the existing VIS approaches unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new categories. To resolve these challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model that overcomes catastrophic forgetting of previous categories from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt features, and then propagates task-specific global video contexts from the frame prompt features to the video prompt. Through rigorous comparisons, our HVPL model proves to be more effective than baseline approaches. The code is available at https://github.com/JiahuaDong/HVPL.[86] AME: Aligned Manifold Entropy for Robust Vision-Language Distillation
Guiming Cao,Yuming Ou
Main category: cs.CV
TL;DR: The paper proposes AME, a method for robust vision-language knowledge distillation that works well even with limited data by applying entropy minimization over a shared manifold.
Details
Motivation: Vision-language knowledge distillation requires large amounts of data to generalize well, which is often impractical in real-world scenarios. Method: Proposed Aligned Manifold Entropy (AME) which applies entropy minimization over a reconfigured shared manifold to achieve robust knowledge distillation. Result: AME enables robust knowledge distillation under low-data regimes without modifying the backbone architecture and achieves better generalization performance across various downstream tasks. Conclusion: AME can serve as a plug-and-play module compatible with a wide range of vision-language distillation frameworks and results in superior generalization performance. Abstract: Knowledge distillation is a long-established technique for knowledge transfer, and has regained attention in the context of the recent emergence of large vision-language models (VLMs). However, vision-language knowledge distillation often requires sufficient training data to achieve robust generalization on amples with ambiguous or boundary-adjacent representations, which are associated with high predictive uncertainty. Critically, collecting such large-scale, task-specific data for training is often impractical in real-world scenarios. To address this major challenge arising from the entanglement of uncertainty and cross-modal feature representation, we propose Aligned Manifold Entropy for Robust Vision-Language Distillation (AME), aiming to achieve robust generalization under real-world conditions. AME applies entropy minimization over a reconfigured shared manifold, where multi-modal data (i.e., image and text) are bridged through a pair of projection functions, conducive to structural compression for cross-modal feature representations. This enables robust knowledge distillation under low-data regimes, while requiring no architectural modifications to the backbone. As a result, it can serve as a plug-and-play module compatible with a wide range of vision-language distillation frameworks. Notably, our theoretical analysis reveals that integrating knowledge distillation with entropy minimization over the shared manifold leads to a tighter generalization error bound. Extensive experiments across diverse distillation architectures and training settings demonstrate that AME consistently facilitates robust knowledge distillation, resulting in superior generalization performance across a wide spectrum of downstream tasks.[87] Unified and Semantically Grounded Domain Adaptation for Medical Image Segmentation
Xin Wang,Yin Guo,Jiamin Xia,Kaiyu Zhang,Niranjan Balu,Mahmud Mossa-Basha,Linda Shapiro,Chun Yuan
Main category: cs.CV
TL;DR: 本文提出了一种统一的、语义基础的无监督领域自适应框架,用于医学图像分割,能够同时支持源可访问和源自由的设置,并实现了最先进的性能和强大的可解释性。
Details
Motivation: 大多数先前的无监督领域自适应方法在医学图像分割中仅适用于源可访问或源自由的设置,缺乏一种明确的、结构化的解剖知识构建,这自然可以跨领域和设置进行泛化。 Method: 提出了一种统一的、语义基础的框架,该框架通过模型架构本身实现适应性,无需任何手工制作的适应策略。模型学习一个领域无关的概率流形作为解剖规律的全局空间,并将每个图像的结构内容解释为从该流形中检索到的规范解剖和捕获个体特定几何形状的空间变换。 Result: 实验表明,该框架在源自由设置中的性能接近其源可访问对应物,达到了先前工作中很少观察到的一致性水平。此外,该框架具有很强的可解释性,可以通过流形遍历实现平滑的形状操作。 Conclusion: 该框架通过学习领域无关的概率流形来实现医学图像分割中的解耦和可解释预测,实现了最先进的性能,并且在源无关和源自由设置中都表现优异。 Abstract: Most prior unsupervised domain adaptation approaches for medical image segmentation are narrowly tailored to either the source-accessible setting, where adaptation is guided by source-target alignment, or the source-free setting, which typically resorts to implicit supervision mechanisms such as pseudo-labeling and model distillation. This substantial divergence in methodological designs between the two settings reveals an inherent flaw: the lack of an explicit, structured construction of anatomical knowledge that naturally generalizes across domains and settings. To bridge this longstanding divide, we introduce a unified, semantically grounded framework that supports both source-accessible and source-free adaptation. Fundamentally distinct from all prior works, our framework's adaptability emerges naturally as a direct consequence of the model architecture, without the need for any handcrafted adaptation strategies. Specifically, our model learns a domain-agnostic probabilistic manifold as a global space of anatomical regularities, mirroring how humans establish visual understanding. Thus, the structural content in each image can be interpreted as a canonical anatomy retrieved from the manifold and a spatial transformation capturing individual-specific geometry. This disentangled, interpretable formulation enables semantically meaningful prediction with intrinsic adaptability. Extensive experiments on challenging cardiac and abdominal datasets show that our framework achieves state-of-the-art results in both settings, with source-free performance closely approaching its source-accessible counterpart, a level of consistency rarely observed in prior works. Beyond quantitative improvement, we demonstrate strong interpretability of the proposed framework via manifold traversal for smooth shape manipulation.[88] Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization
Ke Liu,Xuanhan Wang,Qilong Zhang,Lianli Gao,Jingkuan Song
Main category: cs.CV
TL;DR: HiWL is a new deep image watermarking method that improves invisibility, robustness, and efficiency, outperforming existing methods in accuracy and speed.
Details
Motivation: Existing watermarking methods struggle to simultaneously achieve invisibility, robustness, and broad applicability. HiWL was developed to overcome these limitations. Method: HiWL uses a two-stage optimization process: distribution alignment learning to establish a common latent space and ensure invisibility and robustness, followed by generalized watermark representation learning to separate watermarks from image content in RGB space. Result: HiWL achieves 7.6% higher watermark extraction accuracy than existing methods and can process 100K images in 8 seconds. Conclusion: Hierarchical Watermark Learning (HiWL) is a highly effective method for deep image watermarking, achieving improved invisibility, robustness, and broad applicability. Abstract: Deep image watermarking, which refers to enable imperceptible watermark embedding and reliable extraction in cover images, has shown to be effective for copyright protection of image assets. However, existing methods face limitations in simultaneously satisfying three essential criteria for generalizable watermarking: 1) invisibility (imperceptible hide of watermarks), 2) robustness (reliable watermark recovery under diverse conditions), and 3) broad applicability (low latency in watermarking process). To address these limitations, we propose a Hierarchical Watermark Learning (HiWL), a two-stage optimization that enable a watermarking model to simultaneously achieve three criteria. In the first stage, distribution alignment learning is designed to establish a common latent space with two constraints: 1) visual consistency between watermarked and non-watermarked images, and 2) information invariance across watermark latent representations. In this way, multi-modal inputs including watermark message (binary codes) and cover images (RGB pixels) can be well represented, ensuring the invisibility of watermarks and robustness in watermarking process thereby. The second stage employs generalized watermark representation learning to establish a disentanglement policy for separating watermarks from image content in RGB space. In particular, it strongly penalizes substantial fluctuations in separated RGB watermarks corresponding to identical messages. Consequently, HiWL effectively learns generalizable latent-space watermark representations while maintaining broad applicability. Extensive experiments demonstrate the effectiveness of proposed method. In particular, it achieves 7.6\% higher accuracy in watermark extraction than existing methods, while maintaining extremely low latency (100K images processed in 8s).[89] MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion
Tao Luo,Weihua Xu
Main category: cs.CV
TL;DR: 该论文提出了一种新的多模态医学图像融合方法MMIF-AMIN,该方法通过一种新架构有效地提取独特的和互补的特征,从而在定量和定性分析中都取得了优于现有方法的成果。
Details
Motivation: 多模态医学图像融合(MMIF)的目标是整合来自不同模态的图像,以产生一个全面的图像,从而通过准确描绘器官结构、组织纹理和代谢信息来增强医学诊断。捕捉多个模态之间的独特和互补信息是MMIF中的一个关键研究挑战。 Method: 提出了一种新颖的图像融合方法MMIF-AMIN,采用了可逆密集网络(IDN)进行无损特征提取,设计了多尺度互补特征提取模块(MCFEM),并引入了混合注意力机制、不同大小的卷积层和Transformer。此外,还引入了一种自适应损失函数来指导模型学习。 Result: 大量实验表明,MMIF-AMIN优于九种最先进的MMIF方法,在定量和定性分析中都取得了优异的结果。消融实验验证了所提方法各组成部分的有效性。将MMIF-AMIN扩展到其他图像融合任务也取得了令人期待的性能。 Conclusion: MMIF-AMIN是一种有效的多模态医学图像融合方法,其在数据挖掘深度和特征提取方面的创新使其在多个分析中优于现有技术。 Abstract: Multimodal medical image fusion (MMIF) aims to integrate images from different modalities to produce a comprehensive image that enhances medical diagnosis by accurately depicting organ structures, tissue textures, and metabolic information. Capturing both the unique and complementary information across multiple modalities simultaneously is a key research challenge in MMIF. To address this challenge, this paper proposes a novel image fusion method, MMIF-AMIN, which features a new architecture that can effectively extract these unique and complementary features. Specifically, an Invertible Dense Network (IDN) is employed for lossless feature extraction from individual modalities. To extract complementary information between modalities, a Multi-scale Complementary Feature Extraction Module (MCFEM) is designed, which incorporates a hybrid attention mechanism, convolutional layers of varying sizes, and Transformers. An adaptive loss function is introduced to guide model learning, addressing the limitations of traditional manually-designed loss functions and enhancing the depth of data mining. Extensive experiments demonstrate that MMIF-AMIN outperforms nine state-of-the-art MMIF methods, delivering superior results in both quantitative and qualitative analyses. Ablation experiments confirm the effectiveness of each component of the proposed method. Additionally, extending MMIF-AMIN to other image fusion tasks also achieves promising performance.[90] PADReg: Physics-Aware Deformable Registration Guided by Contact Force for Ultrasound Sequences
Yimeng Geng,Mingyang Zhao,Fan Xu,Guanglin Cao,Gaofeng Meng,Hongbin Liu
Main category: cs.CV
TL;DR: 提出了一种新的超声可变形配准框架PADReg,通过结合接触力和超声图像的多模态信息来提高配准精度。
Details
Motivation: 超声可变形配准在估计生物力学特性和提高甲状腺结节和乳腺癌等疾病的诊断准确性方面至关重要。然而,由于超声图像的低对比度、重噪声和模糊的组织边界,可靠的特征提取和对应匹配受到了严重阻碍。 Method: PADReg利用机器人超声系统测量的同步接触力作为物理先验来约束配准。通过构建像素级刚度图,并结合接触力数据估计密集变形场。 Result: 实验表明,PADReg比现有最先进方法的HD95提高了21.34%,达到12.90。 Conclusion: PADReg通过结合物理信息实现了更精确和物理合理的配准,优于仅依赖图像相似性的方法。 Abstract: Ultrasound deformable registration estimates spatial transformations between pairs of deformed ultrasound images, which is crucial for capturing biomechanical properties and enhancing diagnostic accuracy in diseases such as thyroid nodules and breast cancer. However, ultrasound deformable registration remains highly challenging, especially under large deformation. The inherently low contrast, heavy noise and ambiguous tissue boundaries in ultrasound images severely hinder reliable feature extraction and correspondence matching. Existing methods often suffer from poor anatomical alignment and lack physical interpretability. To address the problem, we propose PADReg, a physics-aware deformable registration framework guided by contact force. PADReg leverages synchronized contact force measured by robotic ultrasound systems as a physical prior to constrain the registration. Specifically, instead of directly predicting deformation fields, we first construct a pixel-wise stiffness map utilizing the multi-modal information from contact force and ultrasound images. The stiffness map is then combined with force data to estimate a dense deformation field, through a lightweight physics-aware module inspired by Hooke's law. This design enables PADReg to achieve physically plausible registration with better anatomical alignment than previous methods relying solely on image similarity. Experiments on in-vivo datasets demonstrate that it attains a HD95 of 12.90, which is 21.34\% better than state-of-the-art methods. The source code is available at https://github.com/evelynskip/PADReg.[91] ROD: RGB-Only Fast and Efficient Off-road Freespace Detection
Tong Sun,Hongliang Ye,Jilin Mei,Liang Chen,Fangzhou Zhao,Leiqiang Zong,Yu Hu
Main category: cs.CV
TL;DR: 本文提出了一种名为ROD的新型仅RGB的越野自由空间检测方法,该方法无需依赖LiDAR数据,实现了更高的精度和实时应用所需的推理速度。
Details
Motivation: 越野自由空间检测比道路场景更具挑战性,因为可行驶区域的边界模糊。现有的最先进方法使用RGB图像和LiDAR数据的多模态融合,但由于从LiDAR数据计算表面法线图时推理时间显著增加,多模态方法不适合实时应用。 Method: 使用预训练的视觉变换器(ViT)从RGB图像中提取丰富特征,并设计了一个轻量级但高效的解码器,从而提高精度和推理速度。 Result: ROD在ORFD和RELLIS-3D数据集上建立了新的最先进水平,并实现了50 FPS的推理速度,明显优于先前模型。 Conclusion: ROD方法成功解决了越野自由空间检测中的挑战,无需LiDAR数据,提高了实时应用的可行性。 Abstract: Off-road freespace detection is more challenging than on-road scenarios because of the blurred boundaries of traversable areas. Previous state-of-the-art (SOTA) methods employ multi-modal fusion of RGB images and LiDAR data. However, due to the significant increase in inference time when calculating surface normal maps from LiDAR data, multi-modal methods are not suitable for real-time applications, particularly in real-world scenarios where higher FPS is required compared to slow navigation. This paper presents a novel RGB-only approach for off-road freespace detection, named ROD, eliminating the reliance on LiDAR data and its computational demands. Specifically, we utilize a pre-trained Vision Transformer (ViT) to extract rich features from RGB images. Additionally, we design a lightweight yet efficient decoder, which together improve both precision and inference speed. ROD establishes a new SOTA on ORFD and RELLIS-3D datasets, as well as an inference speed of 50 FPS, significantly outperforming prior models.[92] Subjective and Objective Quality Assessment of Banding Artifacts on Compressed Videos
Qi Zheng,Li-Heng Chen,Chenlong He,Neil Berkbeck,Yilin Wang,Balu Adsumilli,Alan C. Bovik,Yibo Fan,Zhengzhong Tu
Main category: cs.CV
TL;DR: 本文提出了一种新的无参考视频质量评估模型CBAND,用于预测感知带状伪影,并发布了一个新的视频数据集LIVE-YT-Banding。
Details
Motivation: 尽管近年来视频压缩技术取得了显著进展,但带状伪影仍然是影响视频质量的一个严重问题,尤其是在高清视频的平滑区域。现有的带状伪影研究数据集仅限于静态图像数据,无法考虑时间带状动态。因此,需要系统研究带状视频质量评估问题。 Method: 作者使用深度神经网络嵌入中的自然图像统计特性,开发了一种新的无参考视频质量评估模型CBAND,并在新建立的LIVE-YT-Banding数据集上进行了测试和比较。 Result: 实验结果表明,CBAND在感知带状伪影预测性能方面显著优于现有最先进模型,并且速度快几个数量级。LIVE-YT-Banding数据库、代码和预训练模型均已公开。 Conclusion: 本文介绍了一种新的无参考视频质量评估模型CBAND,它在预测感知带状伪影方面显著优于现有的最先进模型,并且可以作为可微损失函数优化视频去带状模型。此外,作者还发布了一个新的视频数据集LIVE-YT-Banding。 Abstract: Although there have been notable advancements in video compression technologies in recent years, banding artifacts remain a serious issue affecting the quality of compressed videos, particularly on smooth regions of high-definition videos. Noticeable banding artifacts can severely impact the perceptual quality of videos viewed on a high-end HDTV or high-resolution screen. Hence, there is a pressing need for a systematic investigation of the banding video quality assessment problem for advanced video codecs. Given that the existing publicly available datasets for studying banding artifacts are limited to still picture data only, which cannot account for temporal banding dynamics, we have created a first-of-a-kind open video dataset, dubbed LIVE-YT-Banding, which consists of 160 videos generated by four different compression parameters using the AV1 video codec. A total of 7,200 subjective opinions are collected from a cohort of 45 human subjects. To demonstrate the value of this new resources, we tested and compared a variety of models that detect banding occurrences, and measure their impact on perceived quality. Among these, we introduce an effective and efficient new no-reference (NR) video quality evaluator which we call CBAND. CBAND leverages the properties of the learned statistics of natural images expressed in the embeddings of deep neural networks. Our experimental results show that the perceptual banding prediction performance of CBAND significantly exceeds that of previous state-of-the-art models, and is also orders of magnitude faster. Moreover, CBAND can be employed as a differentiable loss function to optimize video debanding models. The LIVE-YT-Banding database, code, and pre-trained model are all publically available at https://github.com/uniqzheng/CBAND.[93] SafeFix: Targeted Model Repair via Controlled Image Generation
Ouyang Xu,Baoming Zhang,Ruiyu Mao,Yunhui Guo
Main category: cs.CV
TL;DR: 该论文提出了一种基于解释性失败归因和合成数据生成的模型修复模块,以解决深度学习模型在视觉识别中的系统性错误。
Details
Motivation: 论文的动机是解决深度学习模型在视觉识别中由于语义子群体代表性不足而导致的系统性错误,以及现有调试框架在有效修复模型方面的困难。 Method: 论文的方法包括使用条件文本到图像模型生成语义忠实且针对失败情况的图像,并利用大型视觉-语言模型过滤输出,以保持生成样本的质量和相关性。 Result: 论文结果显示,通过重新训练具有罕见情况增强的合成数据集的视觉模型,显著减少了与罕见情况相关的错误,并证明了该方法的有效性。 Conclusion: 论文得出结论,通过使用基于解释性失败归因管道的模型修复模块,可以显著减少与罕见情况相关的错误,并提高模型鲁棒性而不引入新的错误。 Abstract: Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. Although existing debugging frameworks can pinpoint these failures by identifying key failure attributes, repairing the model effectively remains difficult. Current solutions often rely on manually designed prompts to generate synthetic training images -- an approach prone to distribution shift and semantic errors. To overcome these challenges, we introduce a model repair module that builds on an interpretable failure attribution pipeline. Our approach uses a conditional text-to-image model to generate semantically faithful and targeted images for failure cases. To preserve the quality and relevance of the generated samples, we further employ a large vision-language model (LVLM) to filter the outputs, enforcing alignment with the original data distribution and maintaining semantic consistency. By retraining vision models with this rare-case-augmented synthetic dataset, we significantly reduce errors associated with rare cases. Our experiments demonstrate that this targeted repair strategy improves model robustness without introducing new bugs. Code is available at https://github.com/oxu2/SafeFix[94] Adaptive Confidence-Wise Loss for Improved Lens Structure Segmentation in AS-OCT
Zunjie Xiao,Xiao Wu,Tianhang Liu,Lingxi Hu,Yinling Zhang,Xiaoqing Zhang,Risa Higashita,Jiang Liu
Main category: cs.CV
TL;DR: 本文提出了一种新的晶状体结构分割方法ACW损失函数和BECE度量,通过利用专家标注的置信度先验信息,显著提升了分割精度和校准性能。
Details
Motivation: 现有的深度分割网络在晶状体结构分割中通常对所有像素赋予相同的权重,忽略了晶状体结构子区域的不均匀性和边界区域的校准问题。临床专家对不同子区域的标注置信度不同,因此需要一种新的方法来利用这种先验信息。 Method: ACW损失函数通过将晶状体子区域分为高置信度和低置信度组,并使用区域加权损失重新调整每组的权重;此外,还设计了一种动态调整ACW置信度阈值的优化算法。 Result: 在临床晶状体结构AS-OCT数据集和其他多结构数据集上的实验表明,ACW在多个深度分割网络中显著优于竞争性的分割损失方法。ACW在U-Net下实现了6.13%的IoU提升、4.33%的DSC增加和4.79%的BECE减少。 Conclusion: 该论文提出了一种新的自适应置信度加权损失函数(ACW)和边界预期校准误差(BECE)度量,显著提高了晶状体结构分割的精度和校准性能。 Abstract: Precise lens structure segmentation is essential for the design of intraocular lenses (IOLs) in cataract surgery. Existing deep segmentation networks typically weight all pixels equally under cross-entropy (CE) loss, overlooking the fact that sub-regions of lens structures are inhomogeneous (e.g., some regions perform better than others) and that boundary regions often suffer from poor segmentation calibration at the pixel level. Clinically, experts annotate different sub-regions of lens structures with varying confidence levels, considering factors such as sub-region proportions, ambiguous boundaries, and lens structure shapes. Motivated by this observation, we propose an Adaptive Confidence-Wise (ACW) loss to group each lens structure sub-region into different confidence sub-regions via a confidence threshold from the unique region aspect, aiming to exploit the potential of expert annotation confidence prior. Specifically, ACW clusters each target region into low-confidence and high-confidence groups and then applies a region-weighted loss to reweigh each confidence group. Moreover, we design an adaptive confidence threshold optimization algorithm to adjust the confidence threshold of ACW dynamically. Additionally, to better quantify the miscalibration errors in boundary region segmentation, we propose a new metric, termed Boundary Expected Calibration Error (BECE). Extensive experiments on a clinical lens structure AS-OCT dataset and other multi-structure datasets demonstrate that our ACW significantly outperforms competitive segmentation loss methods across different deep segmentation networks (e.g., MedSAM). Notably, our method surpasses CE with 6.13% IoU gain, 4.33% DSC increase, and 4.79% BECE reduction in lens structure segmentation under U-Net. The code of this paper is available at https://github.com/XiaoLing12138/Adaptive-Confidence-Wise-Loss.[95] Bridging the Gap: A Framework for Real-World Video Deepfake Detection via Social Network Compression Emulation
Andrea Montibeller,Dasara Shullani,Daniele Baracchi,Alessandro Piva,Giulia Boato
Main category: cs.CV
TL;DR: 本文提出了一种通过模拟社交网络视频分享流程来提升深度伪造检测器性能的新方法,该方法无需直接API访问即可重现平台特定的伪影,并在真实场景中实现了与实验室训练相当的效果。
Details
Motivation: AI生成的视频在社交网络上的增加给深度伪造检测带来了新的挑战,因为受控条件下训练的检测器往往无法推广到真实世界的场景。 Method: 从少量上传视频中估计压缩和调整参数,构建一个能够在大数据集上重现平台特定伪影的本地模拟器。 Result: 实验表明,使用模拟视频进行微调的检测器可以达到与使用实际共享媒体训练的检测器相当的性能,模拟数据能够紧密匹配真实上传的退化模式。 Conclusion: 本文提出了一种新的框架,用于模拟社交网络视频分享流程,通过估计压缩和调整参数来重现平台特定的伪影,从而弥合实验室训练与真实世界部署之间的差距。 Abstract: The growing presence of AI-generated videos on social networks poses new challenges for deepfake detection, as detectors trained under controlled conditions often fail to generalize to real-world scenarios. A key factor behind this gap is the aggressive, proprietary compression applied by platforms like YouTube and Facebook, which launder low-level forensic cues. However, replicating these transformations at scale is difficult due to API limitations and data-sharing constraints. For these reasons, we propose a first framework that emulates the video sharing pipelines of social networks by estimating compression and resizing parameters from a small set of uploaded videos. These parameters enable a local emulator capable of reproducing platform-specific artifacts on large datasets without direct API access. Experiments on FaceForensics++ videos shared via social networks demonstrate that our emulated data closely matches the degradation patterns of real uploads. Furthermore, detectors fine-tuned on emulated videos achieve comparable performance to those trained on actual shared media. Our approach offers a scalable and practical solution for bridging the gap between lab-based training and real-world deployment of deepfake detectors, particularly in the underexplored domain of compressed video content.[96] SHREC 2025: Retrieval of Optimal Objects for Multi-modal Enhanced Language and Spatial Assistance (ROOMELSA)
Trong-Thuan Nguyen,Viet-Tham Huynh,Quang-Thuc Nguyen,Hoang-Phuc Nguyen,Long Le Bao,Thai Hoang Minh,Minh Nguyen Anh,Thang Nguyen Tien,Phat Nguyen Thuan,Huy Nguyen Phong,Bao Huynh Thai,Vinh-Tiep Nguyen,Duc-Vu Nguyen,Phu-Hoa Pham,Minh-Huy Le-Hoang,Nguyen-Khang Le,Minh-Chinh Nguyen,Minh-Quan Ho,Ngoc-Long Tran,Hien-Long Le-Hoang,Man-Khoi Tran,Anh-Duong Tran,Kim Nguyen,Quan Nguyen Hung,Dat Phan Thanh,Hoang Tran Van,Tien Huynh Viet,Nhan Nguyen Viet Thien,Dinh-Khoi Vo,Van-Loc Nguyen,Trung-Nghia Le,Tam V. Nguyen,Minh-Triet Tran
Main category: cs.CV
TL;DR: ROOMELSA是一个新的3D检索基准,旨在评估系统对自然语言的理解能力,适用于复杂实际场景。
Details
Motivation: 当前的3D检索系统主要设计用于简单、受控的场景,而实际场景更加复杂,需要基于模糊、自由形式的描述在杂乱场景中识别物体。因此提出了ROOMELSA这一新基准。 Method: ROOMELSA关注全景房间图像中的特定区域,并从大型数据库中准确检索相应的3D模型。 Result: ROOMELSA包含超过1,600个公寓场景、近5,200个房间以及超过44,000个目标查询。实验表明,虽然粗略物体检索已经基本解决,但在几乎所有测试案例中,仅有一个顶级模型能够始终将正确匹配排在第一位。 Conclusion: ROOMELSA强调了视觉和语言理解紧密结合的重要性,并通过弥合场景级定位与细粒度3D检索之间的差距,为推进鲁棒的实际应用型3D识别系统建立了新基准。 Abstract: Recent 3D retrieval systems are typically designed for simple, controlled scenarios, such as identifying an object from a cropped image or a brief description. However, real-world scenarios are more complex, often requiring the recognition of an object in a cluttered scene based on a vague, free-form description. To this end, we present ROOMELSA, a new benchmark designed to evaluate a system's ability to interpret natural language. Specifically, ROOMELSA attends to a specific region within a panoramic room image and accurately retrieves the corresponding 3D model from a large database. In addition, ROOMELSA includes over 1,600 apartment scenes, nearly 5,200 rooms, and more than 44,000 targeted queries. Empirically, while coarse object retrieval is largely solved, only one top-performing model consistently ranked the correct match first across nearly all test cases. Notably, a lightweight CLIP-based model also performed well, although it struggled with subtle variations in materials, part structures, and contextual cues, resulting in occasional errors. These findings highlight the importance of tightly integrating visual and language understanding. By bridging the gap between scene-level grounding and fine-grained 3D retrieval, ROOMELSA establishes a new benchmark for advancing robust, real-world 3D recognition systems.[97] DiffPose-Animal: A Language-Conditioned Diffusion Framework for Animal Pose Estimation
Tianyu Xiong,Dayi Tan,Wei Tian
Main category: cs.CV
TL;DR: 本文提出DiffPose-Animal,一种基于扩散模型和大语言模型的动物姿态估计方法,有效应对物种多样性、标注稀疏和复杂背景等挑战。
Details
Motivation: 动物姿态估计相较于人体姿态估计更具挑战性,因为物种间形态多样性高、身体结构复杂且标注数据有限。 Method: 将姿态估计重新定义为扩散模型生成框架下的去噪过程,并设计了一个扩散关键点解码器来逐步优化姿态预测。 Result: 在公共动物姿态数据集上的实验表明,DiffPose-Animal在多样物种、复杂背景和关键点缺失的情况下仍具有优异的性能和泛化能力。 Conclusion: DiffPose-Animal是一个基于扩散模型的动物姿态估计框架,通过利用大型语言模型提取全局解剖先验和局部关键点语义,提高了姿态估计的准确性和鲁棒性。 Abstract: Animal pose estimation is a fundamental task in computer vision, with growing importance in ecological monitoring, behavioral analysis, and intelligent livestock management. Compared to human pose estimation, animal pose estimation is more challenging due to high interspecies morphological diversity, complex body structures, and limited annotated data. In this work, we introduce DiffPose-Animal, a novel diffusion-based framework for top-down animal pose estimation. Unlike traditional heatmap regression methods, DiffPose-Animal reformulates pose estimation as a denoising process under the generative framework of diffusion models. To enhance semantic guidance during keypoint generation, we leverage large language models (LLMs) to extract both global anatomical priors and local keypoint-wise semantics based on species-specific prompts. These textual priors are encoded and fused with image features via cross-attention modules to provide biologically meaningful constraints throughout the denoising process. Additionally, a diffusion-based keypoint decoder is designed to progressively refine pose predictions, improving robustness to occlusion and annotation sparsity. Extensive experiments on public animal pose datasets demonstrate the effectiveness and generalization capability of our method, especially under challenging scenarios with diverse species, cluttered backgrounds, and incomplete keypoints.[98] Region-Adaptive Video Sharpening via Rate-Perception Optimization
Yingxue Pang,Shijie Zhao,Mengxi Guo,Junlin Li,Li Zhang
Main category: cs.CV
TL;DR: This paper introduces RPO-AdaSharp, a video sharpening model that improves perceptual quality and saves bitrate by adapting sharpening intensity to regional variations using CTU partition masks.
Details
Motivation: Uniform sharpening intensity degrades video quality and increases bitrate without optimal allocation techniques for diverse regions. Method: The authors proposed RPO-AdaSharp, an end-to-end region-adaptive video sharpening model, using coding tree unit (CTU) partition masks as prior information to guide and constrain bit allocation. Result: Experiments on benchmarks showed the proposed model's effectiveness in both qualitative and quantitative evaluations. Conclusion: RPO-AdaSharp effectively enhances video perception while saving bitrate by leveraging CTU partition masks to guide bit allocation. Abstract: Sharpening is a widely adopted video enhancement technique. However, uniform sharpening intensity ignores texture variations, degrading video quality. Sharpening also increases bitrate, and there's a lack of techniques to optimally allocate these additional bits across diverse regions. Thus, this paper proposes RPO-AdaSharp, an end-to-end region-adaptive video sharpening model for both perceptual enhancement and bitrate savings. We use the coding tree unit (CTU) partition mask as prior information to guide and constrain the allocation of increased bits. Experiments on benchmarks demonstrate the effectiveness of the proposed model qualitatively and quantitatively.[99] MonoPartNeRF:Human Reconstruction from Monocular Video via Part-Based Neural Radiance Fields
Yao Lu,Jiawei Li,Ming Jiang
Main category: cs.CV
TL;DR: This paper proposes MonoPartNeRF, a novel framework for monocular dynamic human rendering that effectively handles complex poses and occlusions through a bidirectional deformation model, part-based pose embedding, and attention-based appearance modeling.
Details
Motivation: Existing part-based rendering methods struggle with unnatural transitions at part boundaries and inaccurate reconstruction of occluded regions under complex pose variations in monocular settings. Method: The study proposes MonoPartNeRF, which uses a bidirectional deformation model combining rigid and non-rigid transformations, part-based pose embedding, keyframe pose retrieval and interpolation, and a learnable appearance code via attention mechanism. Result: Experiments on ZJU-MoCap and MonoCap datasets show that MonoPartNeRF significantly outperforms prior approaches in handling complex poses and occlusions, with improved joint alignment, texture fidelity, and structural continuity. Conclusion: MonoPartNeRF offers a robust solution for monocular dynamic human rendering, effectively addressing challenges in complex pose variations and occlusion recovery. Abstract: In recent years, Neural Radiance Fields (NeRF) have achieved remarkable progress in dynamic human reconstruction and rendering. Part-based rendering paradigms, guided by human segmentation, allow for flexible parameter allocation based on structural complexity, thereby enhancing representational efficiency. However, existing methods still struggle with complex pose variations, often producing unnatural transitions at part boundaries and failing to reconstruct occluded regions accurately in monocular settings. We propose MonoPartNeRF, a novel framework for monocular dynamic human rendering that ensures smooth transitions and robust occlusion recovery. First, we build a bidirectional deformation model that combines rigid and non-rigid transformations to establish a continuous, reversible mapping between observation and canonical spaces. Sampling points are projected into a parameterized surface-time space (u, v, t) to better capture non-rigid motion. A consistency loss further suppresses deformation-induced artifacts and discontinuities. We introduce a part-based pose embedding mechanism that decomposes global pose vectors into local joint embeddings based on body regions. This is combined with keyframe pose retrieval and interpolation, along three orthogonal directions, to guide pose-aware feature sampling. A learnable appearance code is integrated via attention to model dynamic texture changes effectively. Experiments on the ZJU-MoCap and MonoCap datasets demonstrate that our method significantly outperforms prior approaches under complex pose and occlusion conditions, achieving superior joint alignment, texture fidelity, and structural continuity.[100] Identity-Preserving Aging and De-Aging of Faces in the StyleGAN Latent Space
Luis S. Luevano,Pavel Korshunov,Sebastien Marcel
Main category: cs.CV
TL;DR: This paper introduces a novel method for face aging and de-aging using StyleGAN2's latent space editing, ensuring identity preservation without relying on complex conditioning, and provides a public dataset for cross-age face recognition research.
Details
Motivation: The motivation stems from the limitations of current state-of-the-art methods in face aging/de-aging, which rely on complex conditioning mechanisms and often neglect identity preservation, leading to inconsistent results and high data demands. Method: The method involves editing the latent space of StyleGAN2 using support vector modeling to determine aging/de-aging directions, and feature selection techniques to identify the identity-preserving subspace. The authors also develop a formula to estimate aging/de-aging parameters for identity preservation. Result: The authors successfully synthesized aged and de-aged faces while preserving identity, validated using two state-of-the-art face recognition systems. They also generated a public dataset of synthetic faces at different ages for benchmarking purposes. Conclusion: The paper concludes that their proposed method allows for the synthesis of aged and de-aged faces while preserving identity, addressing the limitations of existing methods that rely on complex conditioning techniques. Abstract: Face aging or de-aging with generative AI has gained significant attention for its applications in such fields like forensics, security, and media. However, most state of the art methods rely on conditional Generative Adversarial Networks (GANs), Diffusion-based models, or Visual Language Models (VLMs) to age or de-age faces based on predefined age categories and conditioning via loss functions, fine-tuning, or text prompts. The reliance on such conditioning leads to complex training requirements, increased data needs, and challenges in generating consistent results. Additionally, identity preservation is rarely taken into accountor evaluated on a single face recognition system without any control or guarantees on whether identity would be preserved in a generated aged/de-aged face. In this paper, we propose to synthesize aged and de-aged faces via editing latent space of StyleGAN2 using a simple support vector modeling of aging/de-aging direction and several feature selection approaches. By using two state-of-the-art face recognition systems, we empirically find the identity preserving subspace within the StyleGAN2 latent space, so that an apparent age of a given face can changed while preserving the identity. We then propose a simple yet practical formula for estimating the limits on aging/de-aging parameters that ensures identity preservation for a given input face. Using our method and estimated parameters we have generated a public dataset of synthetic faces at different ages that can be used for benchmarking cross-age face recognition, age assurance systems, or systems for detection of synthetic images. Our code and dataset are available at the project page https://www.idiap.ch/paper/agesynth/[101] Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment
Shi-Chen Zhang,Yunheng Li,Yu-Huan Wu,Qibin Hou,Ming-Ming Cheng
Main category: cs.CV
TL;DR: The paper proposes an offset learning paradigm to address the misalignment between class representations and image features in lightweight semantic segmentation models, resulting in improved performance with minimal additional parameters.
Details
Motivation: The authors identify a limitation in current lightweight semantic segmentation models: the misalignment between class representations and image features due to the per-pixel classification paradigm. This paradigm assumes that pixel features for the same category should not vary across images, which is challenging in practice. Method: The authors propose a coupled dual-branch offset learning paradigm that dynamically refines class representations and spatial image features. They construct an efficient semantic segmentation network called OffSeg and test its effectiveness on four datasets. Result: The offset learning paradigm improves the mIoU performance of existing methods (SegFormer-B0, SegNeXt-T, Mask2Former-Tiny) by 2.7%, 1.9%, and 2.6% respectively on the ADE20K dataset, with only 0.1-0.2M additional parameters. Conclusion: The proposed offset learning paradigm can improve the performance of existing semantic segmentation methods with minimal additional parameters. Abstract: Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference through lightweight designs, we reveal their inherent limitation: misalignment between class representations and image features caused by a per-pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual-branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network, OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent improvements with negligible parameters. For instance, on the ADE20K dataset, our proposed offset learning paradigm improves SegFormer-B0, SegNeXt-T, and Mask2Former-Tiny by 2.7%, 1.9%, and 2.6% mIoU, respectively, with only 0.1-0.2M additional parameters required.[102] TARA: Token-Aware LoRA for Composable Personalization in Diffusion Models
Yuqi Peng,Lingtao Zheng,Yufeng Yang,Yi Huang,Mingfu Yan,Jianzhuang Liu,Shifeng Chen
Main category: cs.CV
TL;DR: This paper proposes TARA, a method that improves multi-concept text-to-image generation by addressing token-wise interference and spatial misalignment in LoRA modules.
Details
Motivation: Recent LoRA-based methods face challenges in multi-concept generation, including identity missing and visual feature leakage, due to token-wise interference and spatial misalignment. This work aims to address these issues. Method: The study proposes Token-Aware LoRA (TARA), which introduces a token mask to constrain each LoRA module to focus on its associated rare token, avoiding interference, and a training objective to align spatial attention with the concept region. Result: Experimental results show that TARA effectively enables multi-concept composition without additional training, preserving visual identity and reducing interference between modules. Conclusion: TARA enables efficient multi-concept inference while preserving each concept's visual identity by avoiding mutual interference between LoRA modules. Abstract: Personalized text-to-image generation aims to synthesize novel images of a specific subject or style using only a few reference images. Recent methods based on Low-Rank Adaptation (LoRA) enable efficient single-concept customization by injecting lightweight, concept-specific adapters into pre-trained diffusion models. However, combining multiple LoRA modules for multi-concept generation often leads to identity missing and visual feature leakage. In this work, we identify two key issues behind these failures: (1) token-wise interference among different LoRA modules, and (2) spatial misalignment between the attention map of a rare token and its corresponding concept-specific region. To address these issues, we propose Token-Aware LoRA (TARA), which introduces a token mask to explicitly constrain each module to focus on its associated rare token to avoid interference, and a training objective that encourages the spatial attention of a rare token to align with its concept region. Our method enables training-free multi-concept composition by directly injecting multiple independently trained TARA modules at inference time. Experimental results demonstrate that TARA enables efficient multi-concept inference and effectively preserving the visual identity of each concept by avoiding mutual interference between LoRA modules. The code and models are available at https://github.com/YuqiPeng77/TARA.[103] 3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs
Noor Ahmed,Cameron Braunstein,Steffen Eger,Eddy Ilg
Main category: cs.CV
TL;DR: 本文提出3DFroMLLM框架,通过代理管道生成3D对象原型,提升多模态大语言模型的空间推理能力,并在图像分类和细粒度视觉任务中取得显著性能提升。
Details
Motivation: 尽管多模态大语言模型在文本和图像联合表示学习方面表现出色,但其空间推理能力仍有限,因此提出3DFroMLLM以增强其3D生成能力。 Method: 构建一个由设计师、编码器和视觉检查器组成的代理管道,通过优化循环生成3D对象原型。 Result: 使用3DFroMLLM生成的图像进行图像分类预训练任务,效果优于先前方法15%;在细粒度视觉语言模型中,使用生成的原型微调CLIP实现了55%的准确率提升。 Conclusion: 3DFroMLLM框架能够直接从MLLMs生成3D对象原型,无需额外训练数据或详细用户指令,并能提升细粒度视觉语言模型的性能。 Abstract: Recent Multi-Modal Large Language Models (MLLMs) have demonstrated strong capabilities in learning joint representations from text and images. However, their spatial reasoning remains limited. We introduce 3DFroMLLM, a novel framework that enables the generation of 3D object prototypes directly from MLLMs, including geometry and part labels. Our pipeline is agentic, comprising a designer, coder, and visual inspector operating in a refinement loop. Notably, our approach requires no additional training data or detailed user instructions. Building on prior work in 2D generation, we demonstrate that rendered images produced by our framework can be effectively used for image classification pretraining tasks and outperforms previous methods by 15%. As a compelling real-world use case, we show that the generated prototypes can be leveraged to improve fine-grained vision-language models by using the rendered, part-labeled prototypes to fine-tune CLIP for part segmentation and achieving a 55% accuracy improvement without relying on any additional human-labeled data.[104] A Parametric Bi-Directional Curvature-Based Framework for Image Artifact Classification and Quantification
Diego Frias
Main category: cs.CV
TL;DR: 这篇论文提出了一种新的无参考图像质量评估框架,通过计算各向异性纹理丰富度(ATR)来准确地对图像退化进行分类和量化。
Details
Motivation: 为了提高无参考图像质量评估的准确性,论文作者提出了一种新的框架。 Method: 论文提出了一种各向异性纹理丰富度(ATR)的度量方法,使用两个可调节的阈值在一个特定的失真上优化参数,然后利用ATR分数对失真类型进行分类并进行质量评分。 Result: 实验结果显示,该方法在LIVE数据集上对高斯模糊和白噪声的斯皮尔曼相关系数分别达到-0.93和-0.95,对人类评分的预测系数决定(R2)为0.892,均方根误差(RMSE)为5.17 DMOS点。 Conclusion: 该论文提出了一种新的无参考图像质量评估框架,该框架能够准确地对图像退化进行分类和量化,具有较高的预测精度。 Abstract: This work presents a novel framework for No-Reference Image Quality Assessment (NR-IQA) founded on the analysis of directional image curvature. Within this framework, we define a measure of Anisotropic Texture Richness (ATR), which is computed at the pixel level using two tunable thresholds -- one permissive and one restrictive -- that quantify orthogonal texture suppression. When its parameters are optimized for a specific artifact, the resulting ATR score serves as a high-performance quality metric, achieving Spearman correlations with human perception of approximately -0.93 for Gaussian blur and -0.95 for white noise on the LIVE dataset. The primary contribution is a two-stage system that leverages the differential response of ATR to various distortions. First, the system utilizes the signature from two specialist ATR configurations to classify the primary artifact type (blur vs. noise) with over 97% accuracy. Second, following classification, it employs a dedicated regression model mapping the relevant ATR score to a quality rating to quantify the degradation. On a combined dataset, the complete system predicts human scores with a coefficient of determination (R2) of 0.892 and a Root Mean Square Error (RMSE) of 5.17 DMOS points. This error corresponds to just 7.4% of the dataset's total quality range, demonstrating high predictive accuracy. This establishes our framework as a robust, dual-purpose tool for the classification and subsequent quantification of image degradation.[105] Adaptive High-Frequency Preprocessing for Video Coding
Yingxue Pang,Shijie Zhao,Junlin Li,Li Zhang
Main category: cs.CV
TL;DR: 本文提出了一种用于视频编码的端到端学习框架,通过自适应高频预处理提高主观质量和降低比特率。
Details
Motivation: 高频组件对于保持视频清晰度和真实感至关重要,但它们也显著影响编码比特率,导致带宽和存储成本增加。 Method: 该框架采用频率注意力特征金字塔预测网络(FFPN)来预测最佳高频预处理策略,指导后续滤波操作符在压缩后实现比特率和质量之间的最佳权衡。 Result: 该文提出了一个端到端的学习框架,用于自适应高频预处理以提高主观质量和节省视频编码中的比特率。 Conclusion: 该框架在多个数据集上的全面评估显示出了视觉上令人满意的增强效果和比特率节省。 Abstract: High-frequency components are crucial for maintaining video clarity and realism, but they also significantly impact coding bitrate, resulting in increased bandwidth and storage costs. This paper presents an end-to-end learning-based framework for adaptive high-frequency preprocessing to enhance subjective quality and save bitrate in video coding. The framework employs the Frequency-attentive Feature pyramid Prediction Network (FFPN) to predict the optimal high-frequency preprocessing strategy, guiding subsequent filtering operators to achieve the optimal tradeoff between bitrate and quality after compression. For training FFPN, we pseudo-label each training video with the optimal strategy, determined by comparing the rate-distortion (RD) performance across different preprocessing types and strengths. Distortion is measured using the latest quality assessment metric. Comprehensive evaluations on multiple datasets demonstrate the visually appealing enhancement capabilities and bitrate savings achieved by our framework.[106] GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments
Lin Zeng,Boming Zhao,Jiarui Hu,Xujie Shen,Ziqiang Dang,Hujun Bao,Zhaopeng Cui
Main category: cs.CV
TL;DR: 本文提出GaussianUpdate,结合3D高斯表示和持续学习,解决场景变化适应问题,实现无需存储图像的自我感知更新。
Details
Motivation: 现有方法要么需要大量模型重新训练,要么无法捕捉随时间变化的详细类型。 Method: 提出了一种新的多阶段更新策略,引入了可视感知的持续学习方法与生成回放。 Result: 在基准数据集上的实验表明,该方法具有优越且实时的渲染能力,并能够可视化不同时间的变化。 Conclusion: GaussianUpdate通过结合3D高斯表示和持续学习,有效地更新高斯辐射场,并且无需存储图像即可进行自我感知更新。 Abstract: Novel view synthesis with neural models has advanced rapidly in recent years, yet adapting these models to scene changes remains an open problem. Existing methods are either labor-intensive, requiring extensive model retraining, or fail to capture detailed types of changes over time. In this paper, we present GaussianUpdate, a novel approach that combines 3D Gaussian representation with continual learning to address these challenges. Our method effectively updates the Gaussian radiance fields with current data while preserving information from past scenes. Unlike existing methods, GaussianUpdate explicitly models different types of changes through a novel multi-stage update strategy. Additionally, we introduce a visibility-aware continual learning approach with generative replay, enabling self-aware updating without the need to store images. The experiments on the benchmark dataset demonstrate our method achieves superior and real-time rendering with the capability of visualizing changes over different times[107] Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos
Chaoyi Wang,Yifan Yang,Jun Pei,Lijie Xia,Jianpo Liu,Xiaobing Yuan,Xinhan Di
Main category: cs.CV
TL;DR: The paper introduces WB-DH, an open-source benchmark dataset with multi-modal annotations and evaluation tools, designed to improve the assessment of whole-body animatable avatar generation.
Details
Motivation: The motivation is to address the limitations in current datasets and metrics for evaluating whole-body animatable avatars, especially regarding subtle expressions, body movements, and dynamic backgrounds. Method: The authors created the Whole-Body Benchmark Dataset (WB-DH), which includes multi-modal annotations, a versatile evaluation framework, and provides public access to the dataset and tools. Result: The result is the development and release of the WB-DH dataset, featuring detailed annotations, an evaluation framework, and publicly accessible resources. Conclusion: The paper concludes by introducing WB-DH, an open-source, multi-modal benchmark for evaluating whole-body animatable avatar generation, addressing the lack of proper evaluation in existing datasets. Abstract: Creating realistic, fully animatable whole-body avatars from a single portrait is challenging due to limitations in capturing subtle expressions, body movements, and dynamic backgrounds. Current evaluation datasets and metrics fall short in addressing these complexities. To bridge this gap, we introduce the Whole-Body Benchmark Dataset (WB-DH), an open-source, multi-modal benchmark designed for evaluating whole-body animatable avatar generation. Key features include: (1) detailed multi-modal annotations for fine-grained guidance, (2) a versatile evaluation framework, and (3) public access to the dataset and tools at https://github.com/deepreasonings/WholeBodyBenchmark.[108] A Robust Epipolar-Domain Regularization Algorithm for Light Field Depth Estimation
Noor Islam S. Mohammad
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的光场深度估计方法,结合视差信息和随机游走优化,无需大量训练数据,计算效率高且准确度良好。
Details
Motivation: 稳健的深度估计在光场成像中对于增强现实、生物医学成像和场景重建等模式识别应用仍然是一个关键挑战。现有的深度卷积神经网络方法计算成本高且在噪声环境中表现不佳。 Method: 提出了一种不依赖大规模训练数据和深度卷积神经网络的深度估计方法,通过整合光场视差信息与定向随机游走细化算法来提高深度图的一致性。 Result: 在4D光场基准数据集和多种真实世界图像上的实验表明,该算法在非受控条件下性能略有下降,但计算复杂度低且准确性具有竞争力。 Conclusion: 该论文提出了一种新颖的轻量级深度估计流程,结合了光场视差信息和定向随机游走优化算法,为光场成像中的深度估计和分割提供了一个稳健且高效的替代方案。 Abstract: Robust depth estimation in light field imaging remains a critical challenge for pattern recognition applications such as augmented reality, biomedical imaging, and scene reconstruction. While existing approaches often rely heavily on deep convolutional neural networks, they tend to incur high computational costs and struggle in noisy real-world environments. This paper proposes a novel lightweight depth estimation pipeline that integrates light field-based disparity information with a directed random walk refinement algorithm. Unlike traditional CNN-based methods, our approach enhances depth map consistency without requiring extensive training or large-scale datasets. The proposed method was evaluated on the 4D Light Field Benchmark dataset and a diverse set of real-world images. Experimental results indicate that while performance slightly declines under uncontrolled conditions, the algorithm consistently maintains low computational complexity and competitive accuracy compared to state-of-the-art deep learning models. These findings highlight the potential of our method as a robust and efficient alternative for depth estimation and segmentation in light field imaging. The work provides insights into practical algorithm design for light field-based pattern recognition and opens new directions for integrating probabilistic graph models with depth sensing frameworks.[109] Masked Clustering Prediction for Unsupervised Point Cloud Pre-training
Bin Ren,Xiaoshui Huang,Mengyuan Liu,Hong Liu,Fabio Poiesi,Nicu Sebe,Guofeng Mei
Main category: cs.CV
TL;DR: This paper proposes MaskClu, an unsupervised pre-training method for Vision Transformers on 3D point clouds that combines masked point modeling with clustering-based learning and global contrastive learning to learn richer semantic representations.
Details
Motivation: The challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. Method: MaskClu integrates masked point modeling with clustering-based learning and introduces a global contrastive learning mechanism. Result: MaskClu sets new competitive results in multiple 3D tasks like part segmentation, semantic segmentation, object detection, and classification. Conclusion: MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. Abstract: Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance-level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of our method via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, where MaskClu sets new competitive results. The code and models will be released at:https://github.com/Amazingren/maskclu.[110] Automatic and standardized surgical reporting for central nervous system tumors
David Bouget,Mathilde Gajda Faanes,Asgeir Store Jakola,Frederik Barkhof,Hilko Ardon,Lorenzo Bello,Mitchel S. Berger,Shawn L. Hervey-Jumper,Julia Furtner,Albert J. S. Idema,Barbara Kiesel,Georg Widhalm,Rishi Nandoe Tewarie,Emmanuel Mandonnet,Pierre A. Robe,Michiel Wagemakers,Timothy R. Smith,Philip C. De Witt Hamer,Ole solheim,Ingerid Reinertsen
Main category: cs.CV
TL;DR: This study presents an automated pipeline for postoperative CNS tumor analysis using deep learning models (Attention U-Net and DenseNet), achieving high accuracy in segmentation and classification, and integrated into the open-source Raidionics platform following RANO 2.0 guidelines.
Details
Motivation: While automated analysis of preoperative CNS tumor data has advanced, postoperative imaging analysis has received limited attention. This study aims to fill that gap by developing a standardized pipeline for postoperative evaluation, improving clinical decision-making. Method: The study used Attention U-Net for tumor segmentation tasks and DenseNet for MR sequence and tumor type classification. The models were trained on multicenter datasets (2000 to 7000 patients) with 5-fold cross-validation and evaluated using patient-wise, voxel-wise, and object-wise metrics, benchmarked against BraTS challenge results. Result: Segmentation models achieved average voxel-wise Dice scores of 87%, 66%, 70%, and 77% for tumor core, non-enhancing tumor core, contrast-enhancing residual tumor, and resection cavity, respectively. Classification models achieved 99.5% balanced accuracy in MR sequence classification and 80% in tumor type classification. Conclusion: The study successfully developed a comprehensive automated pipeline for postoperative reporting of CNS tumors, integrating segmentation and classification models aligned with RANO 2.0 guidelines and implemented as an open-source platform, Raidionics. Abstract: Magnetic resonance (MR) imaging is essential for evaluating central nervous system (CNS) tumors, guiding surgical planning, treatment decisions, and assessing postoperative outcomes and complication risks. While recent work has advanced automated tumor segmentation and report generation, most efforts have focused on preoperative data, with limited attention to postoperative imaging analysis. This study introduces a comprehensive pipeline for standardized postsurtical reporting in CNS tumors. Using the Attention U-Net architecture, segmentation models were trained for the preoperative (non-enhancing) tumor core, postoperative contrast-enhancing residual tumor, and resection cavity. Additionally, MR sequence classification and tumor type identification for contrast-enhancing lesions were explored using the DenseNet architecture. The models were integrated into a reporting pipeline, following the RANO 2.0 guidelines. Training was conducted on multicentric datasets comprising 2000 to 7000 patients, using a 5-fold cross-validation. Evaluation included patient-, voxel-, and object-wise metrics, with benchmarking against the latest BraTS challenge results. The segmentation models achieved average voxel-wise Dice scores of 87%, 66%, 70%, and 77% for the tumor core, non-enhancing tumor core, contrast-enhancing residual tumor, and resection cavity, respectively. Classification models reached 99.5% balanced accuracy in MR sequence classification and 80% in tumor type classification. The pipeline presented in this study enables robust, automated segmentation, MR sequence classification, and standardized report generation aligned with RANO 2.0 guidelines, enhancing postoperative evaluation and clinical decision-making. The proposed models and methods were integrated into Raidionics, open-source software platform for CNS tumor analysis, now including a dedicated module for postsurgical analysis.[111] A Pseudo Global Fusion Paradigm-Based Cross-View Network for LiDAR-Based Place Recognition
Jintao Cheng,Jiehao Luo,Xieyuanli Chen,Jin Wu,Rui Fan,Xiaoyu Tang,Wei Zhang
Main category: cs.CV
TL;DR: The paper proposes a new LiDAR-based Place Recognition method that improves performance in complex environments through a novel cross-view network and geometric formulation of the feature space.
Details
Motivation: LiDAR-based Place Recognition is crucial for localization in GPS-denied environments and loop closure detection, yet existing methods limit performance due to Euclidean-centric formulations that do not capture nonlinear data distributions. Method: A novel cross-view network with a pseudo-global information guidance mechanism and a Manifold Adaptation and Pairwise Variance-Locality Learning Metric that uses a SPD matrix to compute Mahalanobis distance was introduced. Result: The proposed algorithm demonstrates superior performance in complex and temporal-varying scenarios by accurately characterizing intrinsic data distributions and capturing complex inter-class dependencies. Conclusion: The proposed method excels in complex environmental conditions, achieving competitive performance in LiDAR-based Place Recognition. Abstract: LiDAR-based Place Recognition (LPR) remains a critical task in Embodied Artificial Intelligence (AI) and Autonomous Driving, primarily addressing localization challenges in GPS-denied environments and supporting loop closure detection. Existing approaches reduce place recognition to a Euclidean distance-based metric learning task, neglecting the feature space's intrinsic structures and intra-class variances. Such Euclidean-centric formulation inherently limits the model's capacity to capture nonlinear data distributions, leading to suboptimal performance in complex environments and temporal-varying scenarios. To address these challenges, we propose a novel cross-view network based on an innovative fusion paradigm. Our framework introduces a pseudo-global information guidance mechanism that coordinates multi-modal branches to perform feature learning within a unified semantic space. Concurrently, we propose a Manifold Adaptation and Pairwise Variance-Locality Learning Metric that constructs a Symmetric Positive Definite (SPD) matrix to compute Mahalanobis distance, superseding traditional Euclidean distance metrics. This geometric formulation enables the model to accurately characterize intrinsic data distributions and capture complex inter-class dependencies within the feature space. Experimental results demonstrate that the proposed algorithm achieves competitive performance, particularly excelling in complex environmental conditions.[112] Shape Completion and Real-Time Visualization in Robotic Ultrasound Spine Acquisitions
Miruna-Alexandra Gafencu,Reem Shaban,Yordanka Velikova,Mohammad Farid Azampour,Nassir Navab
Main category: cs.CV
TL;DR: 本研究开发了一种基于机器人超声和深度学习的实时脊柱可视化系统,提升了脊柱成像的一致性和可重复性,减少了对传统CT图像的依赖。
Details
Motivation: 超声成像在脊柱手术中应用广泛,但其效果受阴影伪影影响较大,传统方法如CT到US配准存在复杂度高、脊柱曲率差异和需要近期CT图像等限制,因此需要一种新的替代方法。 Method: 该系统通过机器人平台自主获取腰椎的超声扫描数据,利用深度学习的形状补全网络重建完整的脊柱解剖结构,并提供交互式实时可视化和自动重复扫描功能。 Result: 通过定量实验验证了形状补全的准确性,评估了多种脊柱采集协议在模型设置中的效果,并展示了志愿者扫描中的可视化定性结果。 Conclusion: 该研究提出了一种结合机器人超声和实时形状补全技术的新系统,用于增强脊柱可视化,提高了脊柱结构的一致性、可重复性和解剖理解。 Abstract: Ultrasound (US) imaging is increasingly used in spinal procedures due to its real-time, radiation-free capabilities; however, its effectiveness is hindered by shadowing artifacts that obscure deeper tissue structures. Traditional approaches, such as CT-to-US registration, incorporate anatomical information from preoperative CT scans to guide interventions, but they are limited by complex registration requirements, differences in spine curvature, and the need for recent CT imaging. Recent shape completion methods can offer an alternative by reconstructing spinal structures in US data, while being pretrained on large set of publicly available CT scans. However, these approaches are typically offline and have limited reproducibility. In this work, we introduce a novel integrated system that combines robotic ultrasound with real-time shape completion to enhance spinal visualization. Our robotic platform autonomously acquires US sweeps of the lumbar spine, extracts vertebral surfaces from ultrasound, and reconstructs the complete anatomy using a deep learning-based shape completion network. This framework provides interactive, real-time visualization with the capability to autonomously repeat scans and can enable navigation to target locations. This can contribute to better consistency, reproducibility, and understanding of the underlying anatomy. We validate our approach through quantitative experiments assessing shape completion accuracy and evaluations of multiple spine acquisition protocols on a phantom setup. Additionally, we present qualitative results of the visualization on a volunteer scan.[113] Accelerated Volumetric Compression without Hierarchies: A Fourier Feature Based Implicit Neural Representation Approach
Leona Žůrková,Petr Strakoš,Michal Kravčenko,Tomáš Brzobohatý,Lubomír Říha
Main category: cs.CV
TL;DR: The paper proposes a structure-free neural compression method that improves volumetric data efficiency with minimal quality loss and faster training times.
Details
Motivation: The motivation behind the paper is to address the importance of volumetric data compression in fields like medical imaging, scientific simulation, and entertainment, by introducing a more efficient and compact representation method. Method: The method involves a dynamic voxel selection process using morphological dilation to prioritize active regions, reducing redundant computation without hierarchical metadata. Result: Experiments showed that sparse training reduced training time by 63.7% with only minor quality loss, achieving a compression rate of 14 while eliminating traditional data-loading overhead. Conclusion: The paper concludes that their introduced structure-free neural compression method effectively combines Fourier feature encoding with selective voxel sampling to yield compact volumetric representations and faster convergence, offering a scalable solution for practical applications. Abstract: Volumetric data compression is critical in fields like medical imaging, scientific simulation, and entertainment. We introduce a structure-free neural compression method combining Fourierfeature encoding with selective voxel sampling, yielding compact volumetric representations and faster convergence. Our dynamic voxel selection uses morphological dilation to prioritize active regions, reducing redundant computation without any hierarchical metadata. In the experiment, sparse training reduced training time by 63.7 % (from 30 to 11 minutes) with only minor quality loss: PSNR dropped 0.59 dB (from 32.60 to 32.01) and SSIM by 0.008 (from 0.948 to 0.940). The resulting neural representation, stored solely as network weights, achieves a compression rate of 14 and eliminates traditional data-loading overhead. This connects coordinate-based neural representation with efficient volumetric compression, offering a scalable, structure-free solution for practical applications.[114] MADPromptS: Unlocking Zero-Shot Morphing Attack Detection with Multiple Prompt Aggregation
Eduarda Caldeira,Fadi Boutros,Naser Damer
Main category: cs.CV
TL;DR: This paper proposes a zero-shot approach for Face Morphing Attack Detection using CLIP by designing and aggregating multiple textual prompts, showing that prompt engineering can effectively utilize multimodal foundation models without fine-tuning.
Details
Motivation: The motivation stems from the challenge of Face Morphing Attacks in recognition systems and the underutilized potential of multimodal foundation models like CLIP for direct deployment without fine-tuning. Method: The research employs a zero-shot approach using the CLIP model, focusing on the design and aggregation of multiple textual prompts per class to align the model's representations with the MAD task. Result: The results demonstrate that aggregating textual prompts significantly improves the ability to detect morphed face attacks in a zero-shot setting, showcasing the viability of using CLIP's built-in multimodal knowledge. Conclusion: The study concludes that prompt aggregation can effectively enhance zero-shot detection performance for Face Morphing Attack Detection using CLIP, highlighting the importance of leveraging multimodal knowledge through prompt engineering. Abstract: Face Morphing Attack Detection (MAD) is a critical challenge in face recognition security, where attackers can fool systems by interpolating the identity information of two or more individuals into a single face image, resulting in samples that can be verified as belonging to multiple identities by face recognition systems. While multimodal foundation models (FMs) like CLIP offer strong zero-shot capabilities by jointly modeling images and text, most prior works on FMs for biometric recognition have relied on fine-tuning for specific downstream tasks, neglecting their potential for direct, generalizable deployment. This work explores a pure zero-shot approach to MAD by leveraging CLIP without any additional training or fine-tuning, focusing instead on the design and aggregation of multiple textual prompts per class. By aggregating the embeddings of diverse prompts, we better align the model's internal representations with the MAD task, capturing richer and more varied cues indicative of bona-fide or attack samples. Our results show that prompt aggregation substantially improves zero-shot detection performance, demonstrating the effectiveness of exploiting foundation models' built-in multimodal knowledge through efficient prompt engineering.[115] UniSTFormer: Unified Spatio-Temporal Lightweight Transformer for Efficient Skeleton-Based Action Recognition
Wenhan Wu,Zhishuai Guo,Chen Chen,Aidong Lu
Main category: cs.CV
TL;DR: This paper proposes a lightweight transformer framework for skeleton-based action recognition that integrates spatial and temporal modeling and uses a simplified multi-scale pooling fusion module, achieving high efficiency with minimal performance loss.
Details
Motivation: Existing skeleton-based action recognition methods rely on complex module compositions and heavy designs, which increase parameter counts, computational costs, and limit scalability. This work aims to address these limitations. Method: The authors propose a unified spatio-temporal lightweight transformer framework that integrates spatial and temporal modeling within a single attention module and introduces a simplified multi-scale pooling fusion module. Result: Extensive experiments show that the proposed model reduces parameter complexity by over 58% and computational cost by over 60% compared to state-of-the-art transformer-based baselines while maintaining competitive recognition performance. Conclusion: The paper concludes that the proposed lightweight framework achieves a better balance between accuracy and efficiency, significantly reducing parameter complexity and computational cost while maintaining competitive performance. Abstract: Skeleton-based action recognition (SAR) has achieved impressive progress with transformer architectures. However, existing methods often rely on complex module compositions and heavy designs, leading to increased parameter counts, high computational costs, and limited scalability. In this paper, we propose a unified spatio-temporal lightweight transformer framework that integrates spatial and temporal modeling within a single attention module, eliminating the need for separate temporal modeling blocks. This approach reduces redundant computations while preserving temporal awareness within the spatial modeling process. Furthermore, we introduce a simplified multi-scale pooling fusion module that combines local and global pooling pathways to enhance the model's ability to capture fine-grained local movements and overarching global motion patterns. Extensive experiments on benchmark datasets demonstrate that our lightweight model achieves a superior balance between accuracy and efficiency, reducing parameter complexity by over 58% and lowering computational cost by over 60% compared to state-of-the-art transformer-based baselines, while maintaining competitive recognition performance.[116] Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation
Ao Ma,Jiasong Feng,Ke Cao,Jing Wang,Yun Wang,Quanwei Zhang,Zhanjie Zhang
Main category: cs.CV
TL;DR: 本文介绍了一种新的布局可切换故事叙述任务,并提出了 Lay2Story 框架和相关数据集,以提高故事叙述中主体一致性和细节控制的性能。
Details
Motivation: 现有的故事叙述方法在保持主体一致性方面存在挑战,且缺乏高质量数据集以实现对主体位置、外观、服装、表情和姿势的精确控制。 Method: 基于扩散变换器 (DiTs) 架构设计了一个强大的框架 Lay2Story,并引入了一个包含超过100万张图像的数据集 Lay2Story-1M 以及一个包含3000个提示的基准测试 Lay2Story-Bench 来评估方法性能。 Result: 通过定性和定量实验,Lay2Story 在一致性、语义相关性和美学质量方面取得了最佳结果。 Conclusion: 本文提出了一种基于布局条件的可切换叙述框架 Lay2Story,该框架在保持主体一致性和细节控制方面优于现有的最先进方法。 Abstract: Storytelling tasks involving generating consistent subjects have gained significant attention recently. However, existing methods, whether training-free or training-based, continue to face challenges in maintaining subject consistency due to the lack of fine-grained guidance and inter-frame interaction. Additionally, the scarcity of high-quality data in this field makes it difficult to precisely control storytelling tasks, including the subject's position, appearance, clothing, expression, and posture, thereby hindering further advancements. In this paper, we demonstrate that layout conditions, such as the subject's position and detailed attributes, effectively facilitate fine-grained interactions between frames. This not only strengthens the consistency of the generated frame sequence but also allows for precise control over the subject's position, appearance, and other key details. Building on this, we introduce an advanced storytelling task: Layout-Togglable Storytelling, which enables precise subject control by incorporating layout conditions. To address the lack of high-quality datasets with layout annotations for this task, we develop Lay2Story-1M, which contains over 1 million 720p and higher-resolution images, processed from approximately 11,300 hours of cartoon videos. Building on Lay2Story-1M, we create Lay2Story-Bench, a benchmark with 3,000 prompts designed to evaluate the performance of different methods on this task. Furthermore, we propose Lay2Story, a robust framework based on the Diffusion Transformers (DiTs) architecture for Layout-Togglable Storytelling tasks. Through both qualitative and quantitative experiments, we find that our method outperforms the previous state-of-the-art (SOTA) techniques, achieving the best results in terms of consistency, semantic correlation, and aesthetic quality.[117] Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
Elman Ghazaei,Erchan Aptoula
Main category: cs.CV
TL;DR: This paper introduces the TCSSM framework and BrightVQA dataset to address domain shifts in Change Detection Visual Question Answering, showing improved performance over existing methods.
Details
Motivation: Traditional change detection methods require expert knowledge for interpretation, limiting broader access for non-expert users. Additionally, existing CDVQA methods assume similar distributions for training and testing datasets, which is not valid in real-world applications with domain shifts. The paper aims to address these limitations. Method: The paper introduces a new multi-modal and multi-domain dataset, BrightVQA, and proposes the Text-Conditioned State Space Model (TCSSM). This model dynamically predicts input-dependent parameters using bi-temporal images and textual descriptions to extract domain-invariant features. Result: Extensive experiments demonstrate that the proposed TCSSM framework outperforms state-of-the-art models in handling domain shifts in CDVQA tasks. Conclusion: The paper concludes that the proposed TCSSM framework effectively addresses the domain shift problem in CDVQA by leveraging both bi-temporal imagery and geo-disaster-related textual information, showing superior performance compared to state-of-the-art models. Abstract: The Earth's surface is constantly changing, and detecting these changes provides valuable insights that benefit various aspects of human society. While traditional change detection methods have been employed to detect changes from bi-temporal images, these approaches typically require expert knowledge for accurate interpretation. To enable broader and more flexible access to change information by non-expert users, the task of Change Detection Visual Question Answering (CDVQA) has been introduced. However, existing CDVQA methods have been developed under the assumption that training and testing datasets share similar distributions. This assumption does not hold in real-world applications, where domain shifts often occur. In this paper, the CDVQA task is revisited with a focus on addressing domain shift. To this end, a new multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate domain generalization research in CDVQA. Furthermore, a novel state space model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The TCSSM framework is designed to leverage both bi-temporal imagery and geo-disaster-related textual information in an unified manner to extract domain-invariant features across domains. Input-dependent parameters existing in TCSSM are dynamically predicted by using both bi-temporal images and geo-disaster-related description, thereby facilitating the alignment between bi-temporal visual data and the associated textual descriptions. Extensive experiments are conducted to evaluate the proposed method against state-of-the-art models, and superior performance is consistently demonstrated. The code and dataset will be made publicly available upon acceptance at https://github.com/Elman295/TCSSM.[118] TaoCache: Structure-Maintained Video Generation Acceleration
Zhentao Fan,Zongzuo Wang,Weiwei Zhang
Main category: cs.CV
TL;DR: TaoCache is a training-free, plug-and-play caching strategy that improves video diffusion model acceleration by preserving structure and enhancing late denoising stages.
Details
Motivation: Existing cache-based acceleration methods for video diffusion models often lead to structural discrepancies and hinder instruction following and character consistency. Method: TaoCache uses a fixed-point perspective to predict the model's noise output, calibrating cosine similarities and norm ratios of consecutive noise deltas for effective late denoising. Result: TaoCache preserves high-resolution structure while enabling aggressive skipping and is orthogonal to complementary accelerations like Pyramid Attention Broadcast and TeaCache. Conclusion: TaoCache achieves higher visual quality compared to prior caching methods under the same speedups across multiple frameworks. Abstract: Existing cache-based acceleration methods for video diffusion models primarily skip early or mid denoising steps, which often leads to structural discrepancies relative to full-timestep generation and can hinder instruction following and character consistency. We present TaoCache, a training-free, plug-and-play caching strategy that, instead of residual-based caching, adopts a fixed-point perspective to predict the model's noise output and is specifically effective in late denoising stages. By calibrating cosine similarities and norm ratios of consecutive noise deltas, TaoCache preserves high-resolution structure while enabling aggressive skipping. The approach is orthogonal to complementary accelerations such as Pyramid Attention Broadcast (PAB) and TeaCache, and it integrates seamlessly into DiT-based frameworks. Across Latte-1, OpenSora-Plan v110, and Wan2.1, TaoCache attains substantially higher visual quality (LPIPS, SSIM, PSNR) than prior caching methods under the same speedups.[119] ColorGPT: Leveraging Large Language Models for Multimodal Color Recommendation
Ding Xia,Naoto Inoue,Qianru Qiu,Kotaro Kikuchi
Main category: cs.CV
TL;DR: 本文提出了一种基于预训练大型语言模型的ColorGPT管道,用于颜色推荐任务,在颜色建议准确性和颜色分布方面优于现有方法。
Details
Motivation: 传统方法在处理颜色推荐时由于颜色设计的复杂性和数据可用性的限制常常面临挑战,因此探索了预训练大型语言模型在颜色推荐中的应用。 Method: 通过系统测试多种颜色表示并应用有效的提示工程技巧,开发了一个强大且经过严格验证的管道ColorGPT。 Result: 实验结果表明,基于LLM的管道在颜色建议准确性和颜色分布方面优于现有方法,并在颜色多样性与相似性方面有所提升。 Conclusion: 使用预训练的大型语言模型(LLMs)进行颜色推荐在颜色建议准确性和颜色分布方面优于现有方法,并且在颜色多样性与相似性方面也有所提升。 Abstract: Colors play a crucial role in the design of vector graphic documents by enhancing visual appeal, facilitating communication, improving usability, and ensuring accessibility. In this context, color recommendation involves suggesting appropriate colors to complete or refine a design when one or more colors are missing or require alteration. Traditional methods often struggled with these challenges due to the complex nature of color design and the limited data availability. In this study, we explored the use of pretrained Large Language Models (LLMs) and their commonsense reasoning capabilities for color recommendation, raising the question: Can pretrained LLMs serve as superior designers for color recommendation tasks? To investigate this, we developed a robust, rigorously validated pipeline, ColorGPT, that was built by systematically testing multiple color representations and applying effective prompt engineering techniques. Our approach primarily targeted color palette completion by recommending colors based on a set of given colors and accompanying context. Moreover, our method can be extended to full palette generation, producing an entire color palette corresponding to a provided textual description. Experimental results demonstrated that our LLM-based pipeline outperformed existing methods in terms of color suggestion accuracy and the distribution of colors in the color palette completion task. For the full palette generation task, our approach also yielded improvements in color diversity and similarity compared to current techniques.[120] KFFocus: Highlighting Keyframes for Enhanced Video Understanding
Ming Nie,Chunwei Wang,Hang Xu,Li Zhang
Main category: cs.CV
TL;DR: KFFocus improves video comprehension by efficiently compressing video tokens and emphasizing informative contexts, outperforming current methods in both efficiency and accuracy.
Details
Motivation: Current Vid-LLMs use uniform sampling and intra-frame compression, which can lead to omission of keyframes containing essential temporal and semantic details due to uneven temporal distribution of information. Method: KFFocus uses a keyframe-based compression method and a spatiotemporal modeling module to reduce token redundancy and enhance understanding of spatial-temporal dynamics. Result: Experiments show that KFFocus significantly outperforms existing methods on video understanding benchmarks, especially in long video scenarios, achieving better computational efficiency and accuracy. Conclusion: KFFocus provides a more efficient and effective approach for video comprehension in Vid-LLMs, particularly for long video sequences. Abstract: Recently, with the emergence of large language models, multimodal LLMs have demonstrated exceptional capabilities in image and video modalities. Despite advancements in video comprehension, the substantial computational demands of long video sequences lead current video LLMs (Vid-LLMs) to employ compression strategies at both the inter-frame level (e.g., uniform sampling of video frames) and intra-frame level (e.g., condensing all visual tokens of each frame into a limited number). However, this approach often neglects the uneven temporal distribution of critical information across frames, risking the omission of keyframes that contain essential temporal and semantic details. To tackle these challenges, we propose KFFocus, a method designed to efficiently compress video tokens and emphasize the informative context present within video frames. We substitute uniform sampling with a refined approach inspired by classic video compression principles to identify and capture keyframes based on their temporal redundancy. By assigning varying condensation ratios to frames based on their contextual relevance, KFFocus efficiently reduces token redundancy while preserving informative content details. Additionally, we introduce a spatiotemporal modeling module that encodes both the temporal relationships between video frames and the spatial structure within each frame, thus providing Vid-LLMs with a nuanced understanding of spatial-temporal dynamics. Extensive experiments on widely recognized video understanding benchmarks, especially long video scenarios, demonstrate that KFFocus significantly outperforms existing methods, achieving substantial computational efficiency and enhanced accuracy.[121] Spatial-Temporal Multi-Scale Quantization for Flexible Motion Generation
Zan Wang,Jingze Zhang,Yixin Chen,Baoxiong Jia,Wei Liang,Siyuan Huang
Main category: cs.CV
TL;DR: MSQ introduces a multi-scale quantization method for motion sequences that improves flexibility and performance in motion generation tasks.
Details
Motivation: Current motion representations are limited by their inability to capture multi-scale motion patterns and lack of compositional flexibility. Method: MSQ compresses motion sequences into multi-scale discrete tokens using different encoders for spatial granularities and temporal interpolation. Result: The approach outperforms baseline methods on various benchmarks and allows seamless composition of motion tokens without specialized design or re-training. Conclusion: MSQ provides a new multi-scale discrete token representation for motion sequences that improves compositional flexibility and generalization in generation tasks. Abstract: Despite significant advancements in human motion generation, current motion representations, typically formulated as discrete frame sequences, still face two critical limitations: (i) they fail to capture motion from a multi-scale perspective, limiting the capability in complex patterns modeling; (ii) they lack compositional flexibility, which is crucial for model's generalization in diverse generation tasks. To address these challenges, we introduce MSQ, a novel quantization method that compresses the motion sequence into multi-scale discrete tokens across spatial and temporal dimensions. MSQ employs distinct encoders to capture body parts at varying spatial granularities and temporally interpolates the encoded features into multiple scales before quantizing them into discrete tokens. Building on this representation, we establish a generative mask modeling model to effectively support motion editing, motion control, and conditional motion generation. Through quantitative and qualitative analysis, we show that our quantization method enables the seamless composition of motion tokens without requiring specialized design or re-training. Furthermore, extensive evaluations demonstrate that our approach outperforms existing baseline methods on various benchmarks.[122] UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale
Yuhao Wang,Wei Xi
Main category: cs.CV
TL;DR: 该论文提出了一种新的卷积网络设计方法,能够在保持ERF的渐近高斯分布的同时有效扩展感受野,从而在多个视觉识别任务中超越现有的CNN和ViT模型。
Details
Motivation: 现有的具有大感受野(ERF)的卷积神经网络(ConvNets)虽然有效,但参数量和计算量(FLOPs)较高,且破坏了ERF的渐近高斯分布(AGD)。该研究旨在通过更有效和高效的方式来扩展ERF,同时保持AGD。 Method: 该论文提出了一种感受野聚合模块(Three-layer Receptive Field Aggregator)和一个基本操作算子(Layer Operator),通过组合较小的卷积核(如7×7、9×9、11×11)来有效扩展感受野,同时保持ERF的渐近高斯分布。 Result: UniConvNet-T在ImageNet-1K数据集上以30M参数和5.1G FLOPs实现了84.2%的Top-1准确率,而UniConvNet-XL在ImageNet上达到了88.4%的Top-1准确率,显示出优异的性能和可扩展性。 Conclusion: UniConvNet通过堆叠三层感受野聚合模块,在保持ERF的渐近高斯分布的同时,将ERF扩展到与现有大核ConvNets相当的水平。这种方法在各种视觉识别任务中均优于现有CNN和ViT模型,并且适用于轻量级和大规模模型。 Abstract: Convolutional neural networks (ConvNets) with large effective receptive field (ERF), still in their early stages, have demonstrated promising effectiveness while constrained by high parameters and FLOPs costs and disrupted asymptotically Gaussian distribution (AGD) of ERF. This paper proposes an alternative paradigm: rather than merely employing extremely large ERF, it is more effective and efficient to expand the ERF while maintaining AGD of ERF by proper combination of smaller kernels, such as $7\times{7}$, $9\times{9}$, $11\times{11}$. This paper introduces a Three-layer Receptive Field Aggregator and designs a Layer Operator as the fundamental operator from the perspective of receptive field. The ERF can be expanded to the level of existing large-kernel ConvNets through the stack of proposed modules while maintaining AGD of ERF. Using these designs, we propose a universal model for ConvNet of any scale, termed UniConvNet. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K demonstrate that UniConvNet outperforms state-of-the-art CNNs and ViTs across various vision recognition tasks for both lightweight and large-scale models with comparable throughput. Surprisingly, UniConvNet-T achieves $84.2\%$ ImageNet top-1 accuracy with $30M$ parameters and $5.1G$ FLOPs. UniConvNet-XL also shows competitive scalability to big data and large models, acquiring $88.4\%$ top-1 accuracy on ImageNet. Code and models are publicly available at https://github.com/ai-paperwithcode/UniConvNet.[123] Towards Perfection: Building Inter-component Mutual Correction for Retinex-based Low-light Image Enhancement
Luyang Cao,Han Xu,Jian Zhang,Lei Qi,Jiayi Ma,Yinghuan Shi,Yang Gao
Main category: cs.CV
TL;DR: 本文提出了一种新的低光图像增强方法IRetinex,通过减少组件间残差(ICR)显著提高了图像质量。
Details
Motivation: Retinex-based深度学习方法在低光图像增强中因可解释性强而受到关注,但实现照明和反射成分的完美分解具有挑战性,残差(ICR)被低估,影响分解精度和图像质量。 Method: 提出了一种新的Inter-correction Retinex模型(IRetinex),在分解阶段使用组件间残差减少模块,在增强阶段利用两组件的特征相似性来检测和减轻ICR的影响。 Result: 提出的方法在三个低光基准数据集上展示了减少ICR的有效性,优于现有方法。 Conclusion: 通过减少ICR,该方法在三个低光基准数据集上定性和定量地优于最先进的方法。 Abstract: In low-light image enhancement, Retinex-based deep learning methods have garnered significant attention due to their exceptional interpretability. These methods decompose images into mutually independent illumination and reflectance components, allows each component to be enhanced separately. In fact, achieving perfect decomposition of illumination and reflectance components proves to be quite challenging, with some residuals still existing after decomposition. In this paper, we formally name these residuals as inter-component residuals (ICR), which has been largely underestimated by previous methods. In our investigation, ICR not only affects the accuracy of the decomposition but also causes enhanced components to deviate from the ideal outcome, ultimately reducing the final synthesized image quality. To address this issue, we propose a novel Inter-correction Retinex model (IRetinex) to alleviate ICR during the decomposition and enhancement stage. In the decomposition stage, we leverage inter-component residual reduction module to reduce the feature similarity between illumination and reflectance components. In the enhancement stage, we utilize the feature similarity between the two components to detect and mitigate the impact of ICR within each enhancement unit. Extensive experiments on three low-light benchmark datasets demonstrated that by reducing ICR, our method outperforms state-of-the-art approaches both qualitatively and quantitatively.[124] Uncertainty-aware Cross-training for Semi-supervised Medical Image Segmentation
Kaiwen Huang,Tao Zhou,Huazhu Fu,Yizhe Zhang,Yi Zhou,Xiao-Jun Wu
Main category: cs.CV
TL;DR: 本文提出UC-Seg,一种基于不确定性感知的交叉训练框架,通过两个子网络和一致性保持策略,有效提升半监督医学图像分割性能。
Details
Motivation: 半监督学习在医学图像分割中减少对专家标注的依赖,但现有方法存在认知偏差问题,伪标签生成仍然具有挑战性。 Method: 提出了一种不确定性感知的交叉训练框架(UC-Seg),包括跨子网一致性保持策略(CCP)和不确定性感知伪标签生成模块(UPG)。 Result: UC-Seg在多种医学图像分割任务中表现出优于现有半监督方法的性能。 Conclusion: UC-Seg通过结合两个子网络和不确定性感知机制,在半监督医学图像分割中表现出优越的分割准确性和泛化能力。 Abstract: Semi-supervised learning has gained considerable popularity in medical image segmentation tasks due to its capability to reduce reliance on expert-examined annotations. Several mean-teacher (MT) based semi-supervised methods utilize consistency regularization to effectively leverage valuable information from unlabeled data. However, these methods often heavily rely on the student model and overlook the potential impact of cognitive biases within the model. Furthermore, some methods employ co-training using pseudo-labels derived from different inputs, yet generating high-confidence pseudo-labels from perturbed inputs during training remains a significant challenge. In this paper, we propose an Uncertainty-aware Cross-training framework for semi-supervised medical image Segmentation (UC-Seg). Our UC-Seg framework incorporates two distinct subnets to effectively explore and leverage the correlation between them, thereby mitigating cognitive biases within the model. Specifically, we present a Cross-subnet Consistency Preservation (CCP) strategy to enhance feature representation capability and ensure feature consistency across the two subnets. This strategy enables each subnet to correct its own biases and learn shared semantics from both labeled and unlabeled data. Additionally, we propose an Uncertainty-aware Pseudo-label Generation (UPG) component that leverages segmentation results and corresponding uncertainty maps from both subnets to generate high-confidence pseudo-labels. We extensively evaluate the proposed UC-Seg on various medical image segmentation tasks involving different modality images, such as MRI, CT, ultrasound, colonoscopy, and so on. The results demonstrate that our method achieves superior segmentation accuracy and generalization performance compared to other state-of-the-art semi-supervised methods. Our code will be released at https://github.com/taozh2017/UCSeg.[125] When Deepfakes Look Real: Detecting AI-Generated Faces with Unlabeled Data due to Annotation Challenges
Zhiqiang Yang,Renshuai Tao,Xiaolong Zheng,Guodong Yang,Chunjie Zhang
Main category: cs.CV
TL;DR: 本文提出了DPGNet,用于解决深度伪造检测中的领域差异和未标记数据利用问题,通过文本引导的跨域对齐和课程驱动的伪标签生成等方法,有效提高了检测性能。
Details
Motivation: 现有的深度伪造检测方法依赖于标记训练数据,而随着AI生成内容变得更加逼真,人工标注变得费时且不可靠。 Method: 引入了双路径引导网络,包含文本引导的跨域对齐和课程驱动的伪标签生成模块,并通过跨域知识蒸馏防止灾难性遗忘。 Result: 在11个流行数据集上的实验表明,DPGNet比现有最先进技术高出6.3%。 Conclusion: DPGNet有效地解决了深度伪造检测中的领域差异和未标记数据利用问题,优于现有的最先进技术。 Abstract: Existing deepfake detection methods heavily depend on labeled training data. However, as AI-generated content becomes increasingly realistic, even \textbf{human annotators struggle to distinguish} between deepfakes and authentic images. This makes the labeling process both time-consuming and less reliable. Specifically, there is a growing demand for approaches that can effectively utilize large-scale unlabeled data from online social networks. Unlike typical unsupervised learning tasks, where categories are distinct, AI-generated faces closely mimic real image distributions and share strong similarities, causing performance drop in conventional strategies. In this paper, we introduce the Dual-Path Guidance Network (DPGNet), to tackle two key challenges: (1) bridging the domain gap between faces from different generation models, and (2) utilizing unlabeled image samples. The method features two core modules: text-guided cross-domain alignment, which uses learnable prompts to unify visual and textual embeddings into a domain-invariant feature space, and curriculum-driven pseudo label generation, which dynamically exploit more informative unlabeled samples. To prevent catastrophic forgetting, we also facilitate bridging between domains via cross-domain knowledge distillation. Extensive experiments on \textbf{11 popular datasets}, show that DPGNet outperforms SoTA approaches by \textbf{6.3\%}, highlighting its effectiveness in leveraging unlabeled data to address the annotation challenges posed by the increasing realism of deepfakes.[126] Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding
Maxim A. Patratskiy,Alexey K. Kovalev,Aleksandr I. Panov
Main category: cs.CV
TL;DR: 本文介绍了一种新的视觉-语言-行动模型方法,通过将关键点的视觉轨迹投影到深度图上,同时捕捉空间和时间信息,从而在任务解决能力方面取得了显著提升。
Details
Motivation: 虽然最近的研究集中在独立增强空间和时间理解上,但本文提出了一种将两者结合起来的新方法。 Method: 将观察中的关键点视觉轨迹投影到深度图上,使模型同时捕捉空间和时间信息。 Result: 实验结果显示,在SimplerEnv中,与SpatialVLA相比,成功解决的任务数量平均增加了4%,与TraceVLA相比,增加了19%。 Conclusion: 本文提出了一种新的方法,通过视觉提示整合空间和时间理解,提高了任务解决能力,并且在数据收集具有挑战性的现实世界应用中具有重要价值。 Abstract: Vision-Language-Action models have demonstrated remarkable capabilities in predicting agent movements within virtual environments and real-world scenarios based on visual observations and textual instructions. Although recent research has focused on enhancing spatial and temporal understanding independently, this paper presents a novel approach that integrates both aspects through visual prompting. We introduce a method that projects visual traces of key points from observations onto depth maps, enabling models to capture both spatial and temporal information simultaneously. The experiments in SimplerEnv show that the mean number of tasks successfully solved increased for 4% compared to SpatialVLA and 19% compared to TraceVLA. Furthermore, we show that this enhancement can be achieved with minimal training data, making it particularly valuable for real-world applications where data collection is challenging. The project page is available at https://ampiromax.github.io/ST-VLA.[127] Per-Query Visual Concept Learning
Ori Malca,Dvir Samuel,Gal Chechik
Main category: cs.CV
TL;DR: This paper proposes an improved method for text-to-image personalization by incorporating a specific personalization step that enhances semantic similarity using attention-based loss terms and PDM features.
Details
Motivation: The motivation is to enhance the process of visual concept learning or text-to-image personalization, which has applications in various fields like product placement, entertainment, and personalized design. Method: The method involves augmenting existing personalization techniques with a new step that utilizes PDM features and applies two loss terms related to attention mechanisms to capture identity. Result: The result demonstrates significant improvements in personalized semantic similarity when the proposed method is applied on top of six different personalization methods and various base text-to-image models. Conclusion: The paper concludes that by adding a prompt and noise seed specific personalization step using self- and cross-attention loss terms, the method significantly improves personalized semantic similarity over existing methods. Abstract: Visual concept learning, also known as Text-to-image personalization, is the process of teaching new concepts to a pretrained model. This has numerous applications from product placement to entertainment and personalized design. Here we show that many existing methods can be substantially augmented by adding a personalization step that is (1) specific to the prompt and noise seed, and (2) using two loss terms based on the self- and cross- attention, capturing the identity of the personalized concept. Specifically, we leverage PDM features -- previously designed to capture identity -- and show how they can be used to improve personalized semantic similarity. We evaluate the benefit that our method gains on top of six different personalization methods, and several base text-to-image models (both UNet- and DiT-based). We find significant improvements even over previous per-query personalization methods.[128] ALFred: An Active Learning Framework for Real-world Semi-supervised Anomaly Detection with Adaptive Thresholds
Shanle Yao,Ghazal Alinezhad Noghre,Armin Danesh Pazho,Hamed Tabkhi
Main category: cs.CV
TL;DR: The paper proposes an active learning framework with a human-in-the-loop approach for Video Anomaly Detection, enabling adaptive thresholds and improved performance in dynamic real-world environments.
Details
Motivation: The motivation is to address the limitations of traditional VAD evaluation metrics, which rely on static assumptions and fail to identify thresholds distinguishing normal from anomalous behavior in dynamic settings. This is due to the dynamic nature of human actions, environmental variations, and domain shifts in real-world applications. Method: The paper introduces an active learning framework for VAD that continuously selects the most informative data points for labeling, incorporates a human-in-the-loop mechanism to identify actual normal and anomalous instances, and defines an adaptive threshold for different environments. It is tested using a lab-based framework simulating real-world conditions with a new evaluation metric. Result: The experimental results show that the proposed method achieves an EBI (Error Balance Index) of 68.91 for Q3 in real-world simulated scenarios, demonstrating its practical effectiveness in enhancing VAD adaptability. Conclusion: The paper concludes that the proposed active learning framework with a human-in-the-loop mechanism significantly enhances the applicability of Video Anomaly Detection (VAD) in dynamic environments by adapting to changing conditions and defining environment-specific thresholds. Abstract: Video Anomaly Detection (VAD) can play a key role in spotting unusual activities in video footage. VAD is difficult to use in real-world settings due to the dynamic nature of human actions, environmental variations, and domain shifts. Traditional evaluation metrics often prove inadequate for such scenarios, as they rely on static assumptions and fall short of identifying a threshold that distinguishes normal from anomalous behavior in dynamic settings. To address this, we introduce an active learning framework tailored for VAD, designed for adapting to the ever-changing real-world conditions. Our approach leverages active learning to continuously select the most informative data points for labeling, thereby enhancing model adaptability. A critical innovation is the incorporation of a human-in-the-loop mechanism, which enables the identification of actual normal and anomalous instances from pseudo-labeling results generated by AI. This collected data allows the framework to define an adaptive threshold tailored to different environments, ensuring that the system remains effective as the definition of 'normal' shifts across various settings. Implemented within a lab-based framework that simulates real-world conditions, our approach allows rigorous testing and refinement of VAD algorithms with a new metric. Experimental results show that our method achieves an EBI (Error Balance Index) of 68.91 for Q3 in real-world simulated scenarios, demonstrating its practical effectiveness and significantly enhancing the applicability of VAD in dynamic environments.[129] VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception
Fuhao Chang,Shuxin Li,Yabei Li,Lei He
Main category: cs.CV
TL;DR: 本文提出了一种新的端到端框架VLM-3D,通过低秩适应和联合语义几何损失设计,显著提高了自动驾驶场景中3D感知的准确性。
Details
Motivation: 自动驾驶系统在复杂交通环境中面临开放集感知的挑战,尤其是对未见过物体类别的识别。视觉语言模型(VLMs)具有强大的语义推理能力,但现有方法存在多阶段误差传播的问题,影响感知准确性。 Method: VLM-3D 使用低秩适应(LoRA)快速适应视觉语言模型(VLMs)到驾驶任务,并采用联合语义几何损失设计,包括早期的语义损失和后期的3D IoU损失。 Result: 在nuScenes数据集上的评估表明,VLM-3D的联合语义几何损失设计使感知准确性提升了12.8%。 Conclusion: VLM-3D 提出了一种全新的端到端框架,通过引入低秩适应和联合语义几何损失设计,有效提升了自动驾驶场景中3D感知的准确性。 Abstract: Open-set perception in complex traffic environments poses a critical challenge for autonomous driving systems, particularly in identifying previously unseen object categories, which is vital for ensuring safety. Visual Language Models (VLMs), with their rich world knowledge and strong semantic reasoning capabilities, offer new possibilities for addressing this task. However, existing approaches typically leverage VLMs to extract visual features and couple them with traditional object detectors, resulting in multi-stage error propagation that hinders perception accuracy. To overcome this limitation, we propose VLM-3D, the first end-to-end framework that enables VLMs to perform 3D geometric perception in autonomous driving scenarios. VLM-3D incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving tasks with minimal computational overhead, and introduces a joint semantic-geometric loss design: token-level semantic loss is applied during early training to ensure stable convergence, while 3D IoU loss is introduced in later stages to refine the accuracy of 3D bounding box predictions. Evaluations on the nuScenes dataset demonstrate that the proposed joint semantic-geometric loss in VLM-3D leads to a 12.8% improvement in perception accuracy, fully validating the effectiveness and advancement of our method.[130] Scaling Learned Image Compression Models up to 1 Billion
Yuqi Li,Haotian Zhang,Li Li,Dong Liu,Feng Wu
Main category: cs.CV
TL;DR: 该论文研究了模型规模对学习图像压缩性能的影响,发现扩展模型规模可以显著提升性能,并揭示了压缩与智能之间的潜在联系。
Details
Motivation: 当前学习图像压缩模型的规模受限,影响其表示能力,而模型规模对压缩性能的影响尚未被探索。 Method: 以HPCM模型为基线,将模型参数从6850万扩展到10亿,并拟合测试损失与模型规模、训练计算资源之间的幂律关系。 Result: 实验表明,扩展后的HPCM-1B模型实现了最先进的率失真性能,并发现了模型规模与性能之间的可外推趋势。 Conclusion: 论文得出结论,通过扩展模型规模可以显著提升学习图像压缩模型的性能,并揭示了压缩与智能之间的潜在联系。 Abstract: Recent advances in large language models (LLMs) highlight a strong connection between intelligence and compression. Learned image compression, a fundamental task in modern data compression, has made significant progress in recent years. However, current models remain limited in scale, restricting their representation capacity, and how scaling model size influences compression performance remains unexplored. In this work, we present a pioneering study on scaling up learned image compression models and revealing the performance trends through scaling laws. Using the recent state-of-the-art HPCM model as baseline, we scale model parameters from 68.5 millions to 1 billion and fit power-law relations between test loss and key scaling variables, including model size and optimal training compute. The results reveal a scaling trend, enabling extrapolation to larger scale models. Experimental results demonstrate that the scaled-up HPCM-1B model achieves state-of-the-art rate-distortion performance. We hope this work inspires future exploration of large-scale compression models and deeper investigations into the connection between compression and intelligence.[131] Addressing Bias in VLMs for Glaucoma Detection Without Protected Attribute Supervision
Ahsan Habib Akash,Greg Murray,Annahita Amireskandari,Joel Palko,Carol Laxson,Binod Bhattarai,Prashnna Gyawali
Main category: cs.CV
TL;DR: 本文提出了一种新的无监督去偏方法,用于青光眼筛查的视觉-语言模型,以减少亚群体间的预测差异并提高公平性。
Details
Motivation: 视觉-语言模型在多模态任务中表现出色,但在没有显式受保护属性的情况下也可能表现出人口统计偏差。本文旨在解决青光眼筛查中模型对弱势群体的不公平问题。 Method: 基于重加权对比学习框架,通过无监督聚类图像嵌入来推断代理子组,并计算梯度相似性权重以加权联合目标函数。 Result: 在Harvard FairVLMed数据集上评估,通过Equalized Odds Distance (EOD)、Equalized Subgroup AUC (ES AUC)和Groupwise AUC等指标展示了模型在推断的亚群体间的公平性能。 Conclusion: 该论文提出了一种无属性依赖的去偏方法,用于减少视网膜图像中青光眼筛查的亚群体差异,实现了公平的模型性能。 Abstract: Vision-Language Models (VLMs) have achieved remarkable success on multimodal tasks such as image-text retrieval and zero-shot classification, yet they can exhibit demographic biases even when explicit protected attributes are absent during training. In this work, we focus on automated glaucoma screening from retinal fundus images, a critical application given that glaucoma is a leading cause of irreversible blindness and disproportionately affects underserved populations. Building on a reweighting-based contrastive learning framework, we introduce an attribute-agnostic debiasing method that (i) infers proxy subgroups via unsupervised clustering of image-image embeddings, (ii) computes gradient-similarity weights between the CLIP-style multimodal loss and a SimCLR-style image-pair contrastive loss, and (iii) applies these weights in a joint, top-$k$ weighted objective to upweight underperforming clusters. This label-free approach adaptively targets the hardest examples, thereby reducing subgroup disparities. We evaluate our method on the Harvard FairVLMed glaucoma subset, reporting Equalized Odds Distance (EOD), Equalized Subgroup AUC (ES AUC), and Groupwise AUC to demonstrate equitable performance across inferred demographic subgroups.[132] Deep Learning Models for Robust Facial Liveness Detection
Oleksandr Kuznetsov,Emanuele Frontoni,Luca Romeo,Riccardo Rosati,Andrea Maranesi,Alessandro Muscatello
Main category: cs.CV
TL;DR: This study introduces improved deep learning models for facial recognition security, effectively countering advanced spoofing attacks and achieving 99.9% accuracy.
Details
Motivation: The increasing sophistication of spoofing attacks, especially those using deepfakes and AI-driven manipulations, has exposed a significant gap in current liveness detection methodologies, necessitating the development of more robust anti-spoofing techniques. Method: The research employs novel deep learning models that integrate texture analysis and reflective properties of genuine human traits to distinguish real biometric features from spoofed ones. Evaluations were conducted across five diverse datasets with various attack vectors and environmental conditions. Result: The best model, AttackNet V2.2, achieved a 99.9% average accuracy when trained on combined data, demonstrating a substantial improvement over existing systems. The study also provided insights into the behavioral patterns of impostor attacks. Conclusion: The study successfully developed advanced deep learning models, namely AttackNet V2.2, which significantly enhance the liveness detection in facial recognition systems, making them more resilient to spoofing attacks. These models contribute to increased confidence in biometric systems across various sectors. Abstract: In the rapidly evolving landscape of digital security, biometric authentication systems, particularly facial recognition, have emerged as integral components of various security protocols. However, the reliability of these systems is compromised by sophisticated spoofing attacks, where imposters gain unauthorized access by falsifying biometric traits. Current literature reveals a concerning gap: existing liveness detection methodologies - designed to counteract these breaches - fall short against advanced spoofing tactics employing deepfakes and other artificial intelligence-driven manipulations. This study introduces a robust solution through novel deep learning models addressing the deficiencies in contemporary anti-spoofing techniques. By innovatively integrating texture analysis and reflective properties associated with genuine human traits, our models distinguish authentic presence from replicas with remarkable precision. Extensive evaluations were conducted across five diverse datasets, encompassing a wide range of attack vectors and environmental conditions. Results demonstrate substantial advancement over existing systems, with our best model (AttackNet V2.2) achieving 99.9% average accuracy when trained on combined data. Moreover, our research unveils critical insights into the behavioral patterns of impostor attacks, contributing to a more nuanced understanding of their evolving nature. The implications are profound: our models do not merely fortify the authentication processes but also instill confidence in biometric systems across various sectors reliant on secure access.[133] Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices
Ya Zou,Jingfeng Yao,Siyuan Yu,Shuai Zhang,Wenyu Liu,Xinggang Wang
Main category: cs.CV
TL;DR: Turbo-VAED通过架构优化和训练方法改进,首次实现在移动设备上的高效视频VAE解码,兼顾速度、参数量和重建质量。
Details
Motivation: 由于当前流行的视频生成模型中的VAE在移动设备上部署时存在计算瓶颈,包括参数量大、核函数不匹配导致内存溢出或推理速度慢的问题,因此需要一种低成本的解决方案将视频VAE高效迁移至移动设备。 Method: 通过分析现有VAE架构中的冗余,引入3D深度可分离卷积以减少参数数量;提出一种解耦的3D像素混洗方案以优化上采样过程;开发了一种高效的VAE解码器训练方法,仅训练解码器部分以实现快速移动适配。 Result: 该方法在GPU上实现了最高84.5倍的加速,参数数量减少至原来的17.5%,保留了96.9%的原始重建质量;在iPhone 16 Pro上相比移动优化的VAE实现了2.9倍的FPS提升并具有更好的重建质量。 Conclusion: Turbo-VAED 是一种高效的视频生成模型解码器,它通过减少参数数量和优化硬件适配性,首次实现在移动设备上的720p视频VAE实时解码。 Abstract: There is a growing demand for deploying large generative AI models on mobile devices. For recent popular video generative models, however, the Variational AutoEncoder (VAE) represents one of the major computational bottlenecks. Both large parameter sizes and mismatched kernels cause out-of-memory errors or extremely slow inference on mobile devices. To address this, we propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices. (1) We analyze redundancy in existing VAE architectures and get empirical design insights. By integrating 3D depthwise separable convolutions into our model, we significantly reduce the number of parameters. (2) We observe that the upsampling techniques in mainstream video VAEs are poorly suited to mobile hardware and form the main bottleneck. In response, we propose a decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Building upon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED. (3) We propose an efficient VAE decoder training method. Since only the decoder is used during deployment, we distill it to Turbo-VAED instead of retraining the full VAE, enabling fast mobile adaptation with minimal performance loss. To our knowledge, our method enables real-time 720p video VAE decoding on mobile devices for the first time. This approach is widely applicable to most video VAEs. When integrated into four representative models, with training cost as low as $95, it accelerates original VAEs by up to 84.5x at 720p resolution on GPUs, uses as low as 17.5% of original parameter count, and retains 96.9% of the original reconstruction quality. Compared to mobile-optimized VAEs, Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality on the iPhone 16 Pro. The code and models will soon be available at https://github.com/hustvl/Turbo-VAED.[134] HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis
Timo Teufel,Pulkit Gera,Xilong Zhou,Umar Iqbal,Pramod Rao,Jan Kautz,Vladislav Golyanik,Christian Theobalt
Main category: cs.CV
TL;DR: The HumanOLAT dataset is a new resource for advancing research on digital human relighting and rendering.
Details
Motivation: The lack of high-quality, publicly available datasets for full-body human captures has limited progress in simultaneous relighting and novel-view rendering. Method: The authors introduced the HumanOLAT dataset, which consists of large-scale, multi-view One-Light-at-a-Time captures of full-body humans under various illuminations. Result: The dataset provides HDR RGB frames under different lighting conditions, including white light, environment maps, color gradients, and fine-grained OLAT illuminations. Conclusion: The HumanOLAT dataset is expected to facilitate future research in relighting and rendering techniques for digital human representations. Abstract: Simultaneous relighting and novel-view rendering of digital human representations is an important yet challenging task with numerous applications. Progress in this area has been significantly limited due to the lack of publicly available, high-quality datasets, especially for full-body human captures. To address this critical gap, we introduce the HumanOLAT dataset, the first publicly accessible large-scale dataset of multi-view One-Light-at-a-Time (OLAT) captures of full-body humans. The dataset includes HDR RGB frames under various illuminations, such as white light, environment maps, color gradients and fine-grained OLAT illuminations. Our evaluations of state-of-the-art relighting and novel-view synthesis methods underscore both the dataset's value and the significant challenges still present in modeling complex human-centric appearance and lighting interactions. We believe HumanOLAT will significantly facilitate future research, enabling rigorous benchmarking and advancements in both general and human-specific relighting and rendering techniques.eess.IV [Back]
[135] SharpXR: Structure-Aware Denoising for Pediatric Chest X-Rays
Ilerioluwakiiye Abolade,Emmanuel Idoko,Solomon Odelola,Promise Omoigui,Adetola Adebanwo,Aondana Iorumbur,Udunna Anazodo,Alessandro Crimi,Raymond Confidence
Main category: eess.IV
TL;DR: SharpXR is a deep learning model that effectively denoises low-dose pediatric chest X-rays while preserving critical diagnostic features, enhancing diagnostic accuracy and suitability for low-resource settings.