Table of Contents
cs.CL [Back]
[1] PEACH: A sentence-aligned Parallel English-Arabic Corpus for Healthcare
Rania Al-Sabbagh
Main category: cs.CL
TL;DR: This paper presents PEACH, a manually aligned parallel English-Arabic healthcare corpus with 51,671 sentences, designed to support research in translation, linguistics, and healthcare-related natural language processing.
Details
Motivation: The motivation for creating PEACH is to provide a reliable and manually aligned gold-standard resource for contrastive linguistics, translation studies, and natural language processing in the healthcare domain. Method: The paper introduces PEACH, a sentence-aligned parallel corpus containing 51,671 sentences with healthcare-related content, including patient information leaflets and educational materials. Result: PEACH consists of 51,671 parallel sentences, with approximately 590,517 English and 567,707 Arabic word tokens, and an average sentence length between 9.52 and 11.83 words. Conclusion: PEACH is a publicly accessible, manually aligned parallel English-Arabic corpus of healthcare texts that serves as a gold-standard resource for various research and educational purposes. Abstract: This paper introduces PEACH, a sentence-aligned parallel English-Arabic corpus of healthcare texts encompassing patient information leaflets and educational materials. The corpus contains 51,671 parallel sentences, totaling approximately 590,517 English and 567,707 Arabic word tokens. Sentence lengths vary between 9.52 and 11.83 words on average. As a manually aligned corpus, PEACH is a gold-standard corpus, aiding researchers in contrastive linguistics, translation studies, and natural language processing. It can be used to derive bilingual lexicons, adapt large language models for domain-specific machine translation, evaluate user perceptions of machine translation in healthcare, assess patient information leaflets and educational materials' readability and lay-friendliness, and as an educational resource in translation studies. PEACH is publicly accessible.[2] Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation
Chi Zhang,Changjia Zhu,Junjie Xiong,Xiaoran Xu,Lingyao Li,Yao Liu,Zhuo Lu
Main category: cs.CL
TL;DR: The paper reviews recent studies on the risks and benefits of large language models (LLMs), proposes a taxonomy of harms and defenses, and assesses mitigation efforts to guide the development of safe and ethical language technologies.
Details
Motivation: The motivation of the paper is to address the sociotechnical challenge posed by the dual role of LLMs as both powerful tools and potential sources of harmful language. Method: The paper systematically reviews recent studies on LLM-related harms and defenses, including unintentional toxicity, adversarial jailbreaking attacks, and content moderation techniques. Result: The paper proposes a unified taxonomy of LLM-related harms and defenses, analyzes emerging jailbreak strategies, and assesses mitigation efforts such as RLHF, prompt engineering, and safety alignment. Conclusion: The paper concludes that while LLMs offer significant benefits, they also pose serious risks, and there is a need for robust and ethically aligned language technologies. Abstract: Large Language Models (LLMs) have revolutionized content creation across digital platforms, offering unprecedented capabilities in natural language generation and understanding. These models enable beneficial applications such as content generation, question and answering (Q&A), programming, and code reasoning. Meanwhile, they also pose serious risks by inadvertently or intentionally producing toxic, offensive, or biased content. This dual role of LLMs, both as powerful tools for solving real-world problems and as potential sources of harmful language, presents a pressing sociotechnical challenge. In this survey, we systematically review recent studies spanning unintentional toxicity, adversarial jailbreaking attacks, and content moderation techniques. We propose a unified taxonomy of LLM-related harms and defenses, analyze emerging multimodal and LLM-assisted jailbreak strategies, and assess mitigation efforts, including reinforcement learning with human feedback (RLHF), prompt engineering, and safety alignment. Our synthesis highlights the evolving landscape of LLM safety, identifies limitations in current evaluation methodologies, and outlines future research directions to guide the development of robust and ethically aligned language technologies.[3] FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification
Xiangyan Chen,Yufeng Li,Yujian Gan,Arkaitz Zubiaga,Matthew Purver
Main category: cs.CL
TL;DR: 本文提出了 FineDialFact 基准,用于更细致地验证对话系统生成内容中的事实性,尽管引入了 CoT 推理方法,但任务仍然具有挑战性。
Details
Motivation: 现有的对话系统幻觉检测方法只能给出一个事实性标签,过于简单,无法应对回复中混杂的准确、错误或不可验证事实的问题。 Method: 构建了一个基于公开对话数据集的细粒度对话事实验证基准 FineDialFact,并使用多种基线方法对其进行评估。 Result: 引入了 FineDialFact 基准,并发现结合思维链 (CoT) 推理的方法能提升性能,但在 HybriDialogue 数据集上最高 F1 分数仅为 0.75。 Conclusion: FineDialFact 是一个具有挑战性的新基准,用于对话事实验证,尽管引入了 CoT 推理等方法,但仍有改进空间。 Abstract: Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many Natural Language Processing (NLP) applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or unverifiable facts, making one factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought (CoT) reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.75, indicating that the benchmark remains a challenging task for future research. Our dataset and code will be public on GitHub.[4] Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models
Abishek Thamma,Micha Heilbron
Main category: cs.CL
TL;DR: 短暂记忆有助于Transformer模型的语言学习,但不利于预测人类阅读时间。
Details
Motivation: 认知科学家认为记忆的局限性可能有助于语言学习,但Transformer模型的出现挑战了这一观点,因此需要验证短暂记忆对语言学习的实际影响。 Method: 通过在开发性真实的训练集上进行有和没有短暂记忆的Transformer模型训练,对比其在语言建模和句法评估方面的表现。 Result: 短暂记忆在语言建模和句法评估中表现更好,但在基于惊讶度的人类阅读时间预测中表现较差。 Conclusion: 研究结果支持记忆限制对神经网络语言学习有益,但并不利于行为预测。 Abstract: Human memory is fleeting. As words are processed, the exact wordforms that make up incoming sentences are rapidly lost. Cognitive scientists have long believed that this limitation of memory may, paradoxically, help in learning language - an idea supported by classic connectionist modelling work. The rise of Transformers appears to challenge this idea, as these models can learn language effectively, despite lacking memory limitations or other architectural recency biases. Here, we investigate the hypothesized benefit of fleeting memory for language learning in tightly controlled experiments on transformer language models. Training transformers with and without fleeting memory on a developmentally realistic training set, we find that fleeting memory consistently improves language learning (as quantified by both overall language modelling performance and targeted syntactic evaluation) but, unexpectedly, impairs surprisal-based prediction of human reading times. Interestingly, follow up analyses revealed that this discrepancy - better language modeling, yet worse reading time prediction - could not be accounted for by prior explanations of why better language models sometimes fit human reading time worse. Together, these results support a benefit of memory limitations on neural network language learning - but not on predicting behavior.[5] "Mirror" Language AI Models of Depression are Criterion-Contaminated
Tong Li,Rasiq Hussain,Mehak Gupta,Joshua R. Oltmanns
Main category: cs.CL
TL;DR: Mirror models for predicting depression have inflated effect sizes due to criterion contamination, whereas Non-Mirror models provide more generalizable and reliable results.
Details
Motivation: To address the issue of criterion contamination in AI models predicting depression, which leads to inflated effect sizes and reduced model generalizability. Method: The study compared Mirror models, which use structured diagnostic interview data, with Non-Mirror models, which use life history interview data, to predict depression scores. The models used were GPT-4, GPT-4o, and LLaMA3-70B. Result: Mirror models showed high effect sizes (e.g., R2 = .80), but similar correlations with self-reported symptoms as Non-Mirror models (r = ~.54), indicating bias. Non-Mirror models demonstrated smaller but relatively large effect sizes (e.g., R2 = .27). Conclusion: Mirror models of depression show inflated effect sizes and less generalizability due to criterion contamination, while Non-Mirror models offer more reliable and interpretable features for real-world psychological assessments. Abstract: A growing number of studies show near-perfect LLM language-based prediction of depression assessment scores (up to R2 of .70). However, many develop these models directly from language responses to depression assessments. These "Mirror models" suffer from "criterion contamination", which arises when a predicted score depends in part on the predictors themselves. This causes artificial effect size inflation which reduces model generalizability. The present study compares the performance of Mirror models versus "Non-Mirror models", which are developed from language that does not mirror the assessment they are developed to predict. N = 110 research participants completed two different interviews: structured diagnostic and life history interviews. GPT-4, GPT-4o and LLaMA3-70B were then prompted to predict structured diagnostic interview depression scores from the two transcripts separately. Mirror models (using structured diagnostic data) showed very large effect sizes (e.g., R2 = .80). As expected, NonMirror models (using life history data) demonstrated smaller effect sizes, but were relatively large (e.g., R2 = .27). When Mirror and Non-Mirror model-predicted structured interview depression scores were correlated with self-reported depression symptoms, Mirror and NonMirror performed the same (e.g., r = ~.54), indicating that Mirror models contain bias perhaps due to criterion contamination. Topic modeling identified clusters across Mirror and Non-Mirror models, as well as between true-positive and false-positive predictions. In this head-to-head comparison study, Mirror language AI models of depression showed artificially inflated effect sizes and less generalizability. As language AI models for depression continue to evolve, incorporating Non-Mirror models may identify interpretable, and generalizable semantic features that have unique utility in real-world psychological assessment.[6] Discovering Properties of Inflectional Morphology in Neural Emergent Communication
Miles Gilberti,Shane Storks,Huteng Dai
Main category: cs.CL
TL;DR: This paper explores how emergent communication using deep neural networks can simulate natural language features like concatenative and fusional morphology, showing that phonological constraints encourage these naturalistic patterns in artificial communication systems.
Details
Motivation: The research aims to expand the focus of emergent communication studies beyond subfield-specific goals and metrics that prioritize one-to-one representation and syntactic composition, seeking to better understand the nature of human language through deep neural network-based agents. Method: The authors reinterpret a common EmCom setting, the attribute-value reconstruction game, by imposing a small-vocabulary constraint to simulate double articulation and formulating a novel setting analogous to naturalistic inflectional morphology. They develop new metrics and explore variations motivated by real properties of inflectional morphology, such as concatenativity and fusionality. Result: The experiments reveal that simulated phonological constraints promote concatenative morphology, and emergent languages demonstrate a tendency to fuse grammatical attributes, aligning with patterns observed in natural languages. Conclusion: The study concludes that simulated phonological constraints encourage concatenative morphology, and emergent languages tend to fuse grammatical attributes, replicating natural languages' tendencies. Abstract: Emergent communication (EmCom) with deep neural network-based agents promises to yield insights into the nature of human language, but remains focused primarily on a few subfield-specific goals and metrics that prioritize communication schemes which represent attributes with unique characters one-to-one and compose them syntactically. We thus reinterpret a common EmCom setting, the attribute-value reconstruction game, by imposing a small-vocabulary constraint to simulate double articulation, and formulating a novel setting analogous to naturalistic inflectional morphology (enabling meaningful comparison to natural language communication schemes). We develop new metrics and explore variations of this game motivated by real properties of inflectional morphology: concatenativity and fusionality. Through our experiments, we discover that simulated phonological constraints encourage concatenative morphology, and emergent languages replicate the tendency of natural languages to fuse grammatical attributes.[7] Do Machines Think Emotionally? Cognitive Appraisal Analysis of Large Language Models
Sree Bhattacharyya,Lucas Craig,Tharun Dilliraj,Jia Li,James Z. Wang
Main category: cs.CL
TL;DR: 本论文提出了一个名为CoRE的大型基准测试,用于评估大型语言模型(LLMs)在情感推理中的认知结构,旨在超越表面层次的情感任务,探究LLMs如何通过认知维度进行情感推理。
Details
Motivation: 情感计算是人工智能发展的重要领域,但大多数研究以监督方式处理情感任务,局限于标准和表面的情感相关任务。论文旨在深入探究LLMs在情感推理中使用的内部认知结构。 Method: 借鉴认知评价理论,引入了名为CoRE的大型基准测试,通过大量评估实验和分析,研究LLMs在情感推理中使用的内部认知结构。 Result: 结果揭示了不同LLMs在情感推理中的多样化推理模式,并确定了某些认知维度在刻画特定情感中的重要性。 Conclusion: 论文通过提出的CoRE基准测试,展示了LLMs在情感推理中的内部认知结构,并揭示了不同模型在使用特定认知评价维度上的倾向性。 Abstract: Affective Computing has been established as a crucial field of inquiry to advance the holistic development of Artificial Intelligence (AI) systems. Foundation models -- especially Large Language Models (LLMs) -- have been evaluated, trained, or instruction-tuned in several past works, to become better predictors or generators of emotion. Most of these studies, however, approach emotion-related tasks in a supervised manner, assessing or training the capabilities of LLMs using discrete emotion labels associated with stimuli (e.g., text, images, video, audio). Evaluation studies, in particular, have often been limited to standard and superficial emotion-related tasks, such as the recognition of evoked or expressed emotions. In this paper, we move beyond surface-level emotion tasks to investigate how LLMs reason about emotions through cognitive dimensions. Drawing from cognitive appraisal theory, we examine whether LLMs produce coherent and plausible cognitive reasoning when reasoning about emotionally charged stimuli. We introduce a large-scale benchmark on Cognitive Reasoning for Emotions - CoRE - to evaluate internal cognitive structures implicitly used by LLMs for emotional reasoning. Through a plethora of evaluation experiments and analysis, we seek to answer: (a) Are models more likely to implicitly rely on specific cognitive appraisal dimensions?, (b) What cognitive dimensions are important for characterizing specific emotions?, and, (c) Can the internal representations of different emotion categories in LLMs be interpreted through cognitive appraisal dimensions? Our results and analyses reveal diverse reasoning patterns across different LLMs. Our benchmark and code will be made publicly available.[8] Spectrum Projection Score: Aligning Retrieved Summaries with Reader Models in Retrieval-Augmented Generation
Zhanghao Hu,Qinglin Zhu,Siya Qi,Yulan He,Hanqi Yan,Lin Gui
Main category: cs.CL
TL;DR: This paper introduces Spectrum Projection Score (SPS) and xCompress framework to better evaluate and enhance retrieval-augmented generation by isolating retrieval contributions and improving performance across QA tasks.
Details
Motivation: Prior work often evaluates retrieval-augmented generation (RAG) holistically, making it difficult to isolate the true contribution of retrieval. This work aims to address this issue by introducing a new metric and framework for better evaluation and performance. Method: Spectrum Projection Score (SPS), a lightweight, supervision-free metric, was introduced to gauge semantic alignment. xCompress, an inference time controller framework, dynamically samples, ranks, and compresses retrieval summary candidates. Result: Extensive experiments on five QA benchmarks with four open-source LLMs show that SPS enhances performance across a range of tasks. Conclusion: SPS provides a principled perspective on the interaction between retrieval and generation, enhancing performance across tasks. Abstract: Large Language Models (LLMs) have shown improved generation performance through retrieval-augmented generation (RAG) following the retriever-reader paradigm, which supplements model inputs with externally retrieved knowledge. However, prior work often evaluates RAG holistically, assessing the retriever and reader jointly, making it difficult to isolate the true contribution of retrieval, particularly given the prompt sensitivity of LLMs used as readers. We introduce Spectrum Projection Score (SPS), a lightweight, supervision-free metric that allows the reader to gauge the semantic alignment of a retrieved summary with its hidden representation by comparing the area formed by generated tokens from the summary, and the principal directions of subspace in the reader and to measure the relevance. Building on SPS we present xCompress, an inference time controller framework that dynamically samples, ranks, and compresses retrieval summary candidates. Extensive experiments on five QA benchmarks with four open source LLMs show that SPS not only enhances performance across a range of tasks but also provides a principled perspective on the interaction between retrieval and generation.[9] Prosocial Behavior Detection in Player Game Chat: From Aligning Human-AI Definitions to Efficient Annotation at Scale
Rafal Kocielnik,Min Kim,Penphob,Boonyarungsrit,Fereshteh Soltani,Deshawn Sambrano,Animashree Anandkumar,R. Michael Alvarez
Main category: cs.CL
TL;DR: 本研究开发了一个三阶段的人工智能流水线,用于高效、低成本地检测文本中的亲社会性,为负责任的人工智能任务提供了新的解决方案。
Details
Motivation: 检测文本中的亲社会性——旨在肯定、支持或改善他人行为的交流——对于信任和安全系统来说是一个新颖且日益重要的挑战。与有毒内容检测不同,亲社会性缺乏明确的定义和标注数据,需要新的标注和部署方法。 Method: 首先,使用少量人工标注的示例确定最佳的基于LLM的标注策略。其次,引入人类-AI改进循环,通过审查GPT-4与人类之间的高争议案例,迭代地澄清和扩展任务定义。最后,利用GPT-4合成10k高质量标签,并训练一个两阶段推理系统,其中轻量级分类器处理高置信度预测,而大约35%的模糊实例升级到GPT-4o进行进一步分类。 Result: 最终,该方法在实现高精度(约0.90)的同时,将推理成本降低了约70%。通过有针对性的人类-AI互动、精心设计的任务形式以及部署感知的架构设计,展示了如何为新兴的负责任的人工智能任务提供可扩展的解决方案。 Conclusion: 该论文提出了一种实用的三阶段流水线方法,能够实现可扩展、高精度的亲社会内容分类,同时最大限度地减少人工标注的工作量和推理成本,为新出现的负责任的人工智能任务提供可扩展的解决方案。 Abstract: Detecting prosociality in text--communication intended to affirm, support, or improve others' behavior--is a novel and increasingly important challenge for trust and safety systems. Unlike toxic content detection, prosociality lacks well-established definitions and labeled data, requiring new approaches to both annotation and deployment. We present a practical, three-stage pipeline that enables scalable, high-precision prosocial content classification while minimizing human labeling effort and inference costs. First, we identify the best LLM-based labeling strategy using a small seed set of human-labeled examples. We then introduce a human-AI refinement loop, where annotators review high-disagreement cases between GPT-4 and humans to iteratively clarify and expand the task definition-a critical step for emerging annotation tasks like prosociality. This process results in improved label quality and definition alignment. Finally, we synthesize 10k high-quality labels using GPT-4 and train a two-stage inference system: a lightweight classifier handles high-confidence predictions, while only $\sim$35\% of ambiguous instances are escalated to GPT-4o. This architecture reduces inference costs by $\sim$70% while achieving high precision ($\sim$0.90). Our pipeline demonstrates how targeted human-AI interaction, careful task formulation, and deployment-aware architecture design can unlock scalable solutions for novel responsible AI tasks.[10] Adversarial Topic-aware Prompt-tuning for Cross-topic Automated Essay Scoring
Chunyun Zhang,Hongyan Zhao,Chaoran Cui,Qilong Song,Zhiqing Lu,Shuai Gong,Kailin Liu
Main category: cs.CL
TL;DR: This paper introduces ATOP, a novel method for cross-topic automated essay scoring that improves performance by jointly learning topic-shared and topic-specific features using adversarial training and a neighbor-based classifier.
Details
Motivation: Existing methods for cross-topic AES primarily focus on topic-shared features, neglecting topic-specific features which are crucial for evaluating traits like topic adherence. Method: ATOP uses adversarial training and a neighbor-based classifier to jointly learn topic-shared and topic-specific features from pre-trained language models. Result: Experiments on the ASAP++ dataset show that ATOP outperforms state-of-the-art methods in both holistic and multi-trait essay scoring. Conclusion: The proposed ATOP method significantly outperforms existing state-of-the-art methods in cross-topic automated essay scoring by effectively capturing both topic-shared and topic-specific features. Abstract: Cross-topic automated essay scoring (AES) aims to develop a transferable model capable of effectively evaluating essays on a target topic. A significant challenge in this domain arises from the inherent discrepancies between topics. While existing methods predominantly focus on extracting topic-shared features through distribution alignment of source and target topics, they often neglect topic-specific features, limiting their ability to assess critical traits such as topic adherence. To address this limitation, we propose an Adversarial TOpic-aware Prompt-tuning (ATOP), a novel method that jointly learns topic-shared and topic-specific features to improve cross-topic AES. ATOP achieves this by optimizing a learnable topic-aware prompt--comprising both shared and specific components--to elicit relevant knowledge from pre-trained language models (PLMs). To enhance the robustness of topic-shared prompt learning and mitigate feature scale sensitivity introduced by topic alignment, we incorporate adversarial training within a unified regression and classification framework. In addition, we employ a neighbor-based classifier to model the local structure of essay representations and generate pseudo-labels for target-topic essays. These pseudo-labels are then used to guide the supervised learning of topic-specific prompts tailored to the target topic. Extensive experiments on the publicly available ASAP++ dataset demonstrate that ATOP significantly outperforms existing state-of-the-art methods in both holistic and multi-trait essay scoring. The implementation of our method is publicly available at: https://anonymous.4open.science/r/ATOP-A271.[11] Crisp Attention: Regularizing Transformers via Structured Sparsity
Sagar Gandhi,Vishal Gandhi
Main category: cs.CL
TL;DR: This paper demonstrates that applying attention sparsity in a DistilBERT model during fine-tuning can improve model accuracy, challenging the belief that sparsity reduces performance.
Details
Motivation: The quadratic computational cost of the self-attention mechanism poses a challenge in scaling Transformer models, and attention sparsity is often believed to reduce model accuracy. Method: Structured, post-hoc sparsity was introduced to the attention mechanism of a DistilBERT model during fine-tuning on the SST-2 sentiment analysis task. Result: The model with 80% attention sparsity achieved a validation accuracy of 91.59%, a 0.97% improvement over the dense baseline. Conclusion: Attention sparsity can not only improve computational efficiency but also enhance the generalization and performance of Transformer models. Abstract: The quadratic computational cost of the self-attention mechanism is a primary challenge in scaling Transformer models. While attention sparsity is widely studied as a technique to improve computational efficiency, it is almost universally assumed to come at the cost of model accuracy. In this paper, we report a surprising counter-example to this common wisdom. By introducing structured, post-hoc sparsity to the attention mechanism of a DistilBERT model during fine-tuning on the SST-2 sentiment analysis task, we find that model accuracy improves significantly. Our model with 80\% attention sparsity achieves a validation accuracy of 91.59\%, a 0.97\% absolute improvement over the dense baseline. We hypothesize that this phenomenon is due to sparsity acting as a powerful implicit regularizer, preventing the model from overfitting by forcing it to make predictions with a more constrained and robust set of features. Our work recasts attention sparsity not just as a tool for computational efficiency, but as a potential method for improving the generalization and performance of Transformer models.[12] Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
Yidong Wang,Xin Wang,Cunxiang Wang,Junfeng Fang,Qiufeng Wang,Jianing Chu,Xuran Meng,Shuxun Yang,Libo Qin,Yue Zhang,Wei Ye,Shikun Zhang
Main category: cs.CL
TL;DR: Temporal Self-Rewarding Language Models 通过协调不同世代的模型生成,改进传统 Self-Rewarding 的局限性,实现更好的生成能力和泛化性能。
Details
Motivation: 传统 Self-Rewarding 方法在生成和评估过程中同步改进优质和拒绝响应,导致对比样本差异缩小,影响偏好学习效果。 Method: 采用双阶段框架:Anchored Rejection(使用过去模型的输出固定拒绝响应)和 Future-Guided Chosen(使用下一代模型预测动态筛选优质样本),并结合迭代 DPO 算法进行训练。 Result: 在多个模型家族(Llama, Qwen, Mistral)和不同规模(Llama3B/8B/70B)上实验表明,该方法相比传统 Self-Rewarding 表现更优,例如 Llama3.1-8B 在 AlpacaEval 2.0 上胜率达到 29.44,优于基线 19.69。 Conclusion: Temporal Self-Rewarding Language Models 提出了一种新的框架,通过协调过去、现在和未来的模型生成,解决了传统 Self-Rewarding 范式中对比样本表示差异逐渐缩小的问题。 Abstract: Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose \textbf{Temporal Self-Rewarding Language Models} that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) \textit{Anchored Rejection} - fixing rejected responses using the past initial model's outputs and (2) \textit{Future-Guided Chosen} - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.[13] Efficient Knowledge Probing of Large Language Models by Adapting Pre-trained Embeddings
Kartik Sharma,Yiqiao Jin,Rakshit Trivedi,Srijan Kumar
Main category: cs.CL
TL;DR: This paper proposes PEEK, a method using adapted embeddings to efficiently estimate LLMs' knowledge, achieving high accuracy and offering insights into their representations.
Details
Motivation: Existing methods to probe LLM knowledge are computationally expensive; this work aims to develop a more efficient approach. Method: Proposed PEEK, using pre-trained embedding models as proxies for LLMs, adapting them with a linear decoder layer to predict LLM outputs. Result: Embeddings predicted LLM knowledge with up to 90% accuracy, showing sentence embeddings outperform graph embeddings. Conclusion: Knowledge-adapted embeddings can identify knowledge gaps in LLMs and provide insights into their internal inductive bias. Abstract: Large language models (LLMs) acquire knowledge across diverse domains such as science, history, and geography encountered during generative pre-training. However, due to their stochasticity, it is difficult to predict what LLMs have acquired. Prior work has developed different ways to probe this knowledge by investigating the hidden representations, crafting specific task prompts, curating representative samples, and estimating their uncertainty. However, these methods require making forward passes through the underlying model to probe the LLM's knowledge about a specific fact, making them computationally expensive and time-consuming. To bridge this gap, we propose $\textbf{PEEK}$ or $\textbf{P}$roxy $\textbf{E}$mbeddings to $\textbf{E}$stimate $\textbf{K}$nowledge of LLMs, by leveraging the pre-trained embedding models that effectively encode factual knowledge as text or graphs as proxies for LLMs. First, we identify a training set of facts known by LLMs through various probing strategies and then adapt embedding models to predict the LLM outputs with a linear decoder layer. Comprehensive evaluation on $3$ Wikipedia-derived datasets, $4$ LLMs, and $7$ embedding models shows that embeddings can predict LLM knowledge on a held-out set with up to 90 % accuracy. Furthermore, we find that sentence embedding models are more suitable than graph embeddings to predict LLM knowledge, shedding light on the underlying representation of the factual landscape. Thus, we believe that knowledge-adapted embeddings can be used to identify knowledge gaps in LLMs at scale and can provide deeper insights into LLMs' internal inductive bias. The code and data are made available at https://github.com/claws-lab/peek.[14] EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation
Xinda Wang,Zhengxu Hou,Yangshijie Zhang,Bingren Yan,Zhibo Yang,Xingsheng Zhang,Luxi Xing,Qiang Zhou,Chen Zhang
Main category: cs.CL
TL;DR: 本文提出了一种新的自我演化推理框架EvolvR,通过多角色策略生成和过滤思维链数据,提高了故事评估的质量,并在多个基准上实现了最先进的性能。
Details
Motivation: 现有的故事评估方法在封闭式模型中适应性差,在开放式模型中缺乏严格的推理能力,因此需要一种新的方法来解决这一困境。 Method: 该研究采用了一种多角色策略,自我合成与评分一致的思维链(CoT)数据,并通过多智能体自我过滤过程确保数据质量,最后将基于精炼数据的评估器作为奖励模型应用于故事生成。 Result: 实验结果表明,EvolvR框架在三个评估基准(StoryER、HANNA和OpenMEVA)上达到了最先进的性能,并显著提高了生成故事的质量。 Conclusion: 本文提出了一种基于成对比较的自我演化推理框架EvolvR,用于提高故事评估的质量,并展示了其在多个基准上的最先进性能和在故事生成中的应用优势。 Abstract: Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self-evolving approach.[15] ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline
Morris Alper,Moran Yanuka,Raja Giryes,Gašper Beguš
Main category: cs.CL
TL;DR: ConlangCrafter是一个利用现代LLM作为计算创意辅助工具进行端到端辅助语言创建的多跳管道。
Details
Motivation: 辅助语言在艺术、哲学和国际交流中扮演多样化的角色,而大规模基础模型已经彻底改变了创意生成。 Method: 引入ConlangCrafter,一个多跳管道,将语言设计分解为模块化阶段,并利用LLM的元语言推理能力。 Result: 评估ConlangCrafter在衡量连贯性和类型多样性方面的表现,证明其能够产生一致且多样的辅助语言。 Conclusion: ConlangCrafter可以无需人类语言学专家的情况下生成一致且多样的辅助语言。 Abstract: Constructed languages (conlangs) such as Esperanto and Quenya have played diverse roles in art, philosophy, and international communication. Meanwhile, large-scale foundation models have revolutionized creative generation in text, images, and beyond. In this work, we leverage modern LLMs as computational creativity aids for end-to-end conlang creation. We introduce ConlangCrafter, a multi-hop pipeline that decomposes language design into modular stages -- phonology, morphology, syntax, lexicon generation, and translation. At each stage, our method leverages LLMs' meta-linguistic reasoning capabilities, injecting randomness to encourage diversity and leveraging self-refinement feedback to encourage consistency in the emerging language description. We evaluate ConlangCrafter on metrics measuring coherence and typological diversity, demonstrating its ability to produce coherent and varied conlangs without human linguistic expertise.[16] Few-Shot Prompting for Extractive Quranic QA with Instruction-Tuned LLMs
Mohamed Basem,Islam Oshallah,Ali Hamdi,Ammar Mohammed
Main category: cs.CL
TL;DR: This paper introduces an effective method for Extractive Question Answering on the Quran using few-shot prompting with instruction-tuned large language models and a specialized Arabic prompt framework, achieving better performance than traditional models.
Details
Motivation: The motivation stems from the challenges in Extractive QA on the Quran, such as its complex language, unique terminology, and deep semantic meaning, which traditional models struggle to handle effectively. Method: The paper proposes two approaches for Extractive QA on the Quran. The second, more effective approach uses few-shot prompting with instruction-tuned large language models (e.g., Gemini and DeepSeek). A specialized Arabic prompt framework is developed for span extraction, and a post-processing system incorporating subword alignment, overlap suppression, and semantic filtering is implemented to enhance precision and reduce hallucinations. Result: Evaluations demonstrate that large language models with Arabic instructions outperform traditional fine-tuned models, with the best configuration achieving a pAP10 score of 0.637, confirming the effectiveness of the proposed approach. Conclusion: The paper concludes that prompt-based instruction tuning is effective for low-resource, semantically rich QA tasks, as evidenced by the improved performance of large language models with Arabic instructions over traditional fine-tuned models. Abstract: This paper presents two effective approaches for Extractive Question Answering (QA) on the Quran. It addresses challenges related to complex language, unique terminology, and deep meaning in the text. The second uses few-shot prompting with instruction-tuned large language models such as Gemini and DeepSeek. A specialized Arabic prompt framework is developed for span extraction. A strong post-processing system integrates subword alignment, overlap suppression, and semantic filtering. This improves precision and reduces hallucinations. Evaluations show that large language models with Arabic instructions outperform traditional fine-tuned models. The best configuration achieves a pAP10 score of 0.637. The results confirm that prompt-based instruction tuning is effective for low-resource, semantically rich QA tasks.[17] You Don't Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures
Shengyuan Chen,Chuang Zhou,Zheng Yuan,Qinggang Zhang,Zeyang Cui,Hao Chen,Yilin Xiao,Jiannong Cao,Xiao Huang
Main category: cs.CL
TL;DR: LogicRAG 是一种无需预建图的检索增强生成框架,通过动态提取推理结构提高检索效率和性能。
Details
Motivation: 现有的基于图的检索增强生成方法依赖于昂贵的图构建过程,且预建图可能无法匹配实际查询所需的逻辑结构,导致检索效果不佳。因此,需要一种无需预建图的动态方法来提升检索增强生成的性能和效率。 Method: LogicRAG 的方法包括将输入查询分解为子问题、构建有向无环图 (DAG) 来建模逻辑依赖关系、利用拓扑排序线性化图以确保逻辑一致性、使用图剪枝减少冗余检索以及上下文剪枝过滤无关信息。 Result: 实验结果表明,与最先进的基线方法相比,LogicRAG 在检索增强生成任务中实现了更优的性能和更高的效率。 Conclusion: LogicRAG 通过动态构建推理结构,显著提高了检索增强生成的性能和效率,同时降低了对预建图的依赖和相关成本。 Abstract: Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre-constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph-based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real-world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre-built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a \textbf{\underline{Logic}}-aware \textbf{\underline{R}}etrieval-\textbf{\underline{A}}ugmented \textbf{\underline{G}}eneration framework (\textbf{LogicRAG}) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state-of-the-art baselines.[18] AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models
Sayantan Adak,Pratyush Chatterjee,Somnath Banerjee,Rima Hazra,Somak Aditya,Animesh Mukherjee
Main category: cs.CL
TL;DR: AURA is introduced as a multi-layered framework using Process Reward Models to enhance the safety and logical coherence of LLM outputs, surpassing existing methods.
Details
Motivation: Current LLM safety measures lack the granularity and proactivity to detect and address subtle reasoning steps that may lead to harmful outputs. Method: The study proposes AURA, which integrates Process Reward Models (PRMs), introspective self-critique, and adaptive safety-aware decoding to provide step-level evaluations for logical coherence and safety. Result: AURA significantly outperforms traditional safety approaches, offering improved logical integrity and affordance-sensitive safety for model outputs. Conclusion: This research introduces AURA, a novel framework that considerably enhances the safety and logical integrity of LLM outputs, marking a significant step toward responsible AI. Abstract: Present day LLMs face the challenge of managing affordance-based safety risks-situations where outputs inadvertently facilitate harmful actions due to overlooked logical implications. Traditional safety solutions, such as scalar outcome-based reward models, parameter tuning, or heuristic decoding strategies, lack the granularity and proactive nature needed to reliably detect and intervene during subtle yet crucial reasoning steps. Addressing this fundamental gap, we introduce AURA, an innovative, multi-layered framework centered around Process Reward Models (PRMs), providing comprehensive, step level evaluations across logical coherence and safety-awareness. Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding to dynamically and proactively guide models toward safer reasoning trajectories. Empirical evidence clearly demonstrates that this approach significantly surpasses existing methods, significantly improving the logical integrity and affordance-sensitive safety of model outputs. This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.[19] Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models
Lingyuan Liu,Mengxiang Zhang
Main category: cs.CL
TL;DR: Selective Reflection Distillation (SRD) enhances knowledge distillation of large language models by curating high-quality training data and improving model compatibility, leading to better performance and reduced computational costs.
Details
Motivation: The authors aimed to address the limitations of existing white-box KD methods that neglect training data quality and student-model compatibility. Method: SRD uses reflections from student models to curate high-quality, compatible training data through automated ranking and employs a curriculum scheduling strategy to introduce this data incrementally. Result: Experiments showed that SRD improves distillation outcomes across various methods and model architectures, enhances sample efficiency, and reduces training runtime by up to 39%. Conclusion: Selective Reflection Distillation (SRD) effectively improves the performance and efficiency of knowledge distillation in large language models by focusing on data quality and compatibility. Abstract: Knowledge Distillation (KD) is a fundamental technique for compressing large language models (LLMs) into compact, efficient student models. However, existing white-box KD methods mainly focus on balancing ground truth and student-generated responses while overlooking two critical factors: training data quality and student-model compatibility. To address these limitations, we propose Selective Reflection Distillation (SRD), a novel data curation framework that leverages reflections from student models to systematically refine training data. SRD dynamically evaluates and selects prompt-response pairs by comparing ground truth data with student model outputs, selectively curating high-quality, student-compatible training instances through automated ranking based on difficulty. Furthermore, after selecting the training data, a curriculum scheduling strategy is employed to incrementally introduce these curated subsets into the distillation process at fixed intervals. As a plug-and-play enhancement, SRD consistently improves distillation outcomes across diverse white-box KD approaches and model architectures, as well as decreases computational cost significantly during KD training. Experiments on a range of language model benchmarks demonstrate SRD's consistent improvements in distilled model performance, as well as a reduction in training runtime by up to 39%, under diverse KD methods and model families. Notably, SRD operates as a plug-and-play module, enhancing sample efficiency without modifying underlying KD algorithms. Our findings highlight that data quality and compatibility are pivotal to effective and efficient distillation of LLMs, and SRD provides a principled framework to achieve both. This work advances the understanding of data-centric factors in KD and offers practical insights for enhancing the capability and efficiency of compressed LLMs.[20] Scaling Personality Control in LLMs with Big Five Scaler Prompts
Gunhee Cho,Yun-Gyung Cheong
Main category: cs.CL
TL;DR: Big5-Scaler 是一种通过自然语言提示控制大型语言模型性格特征的方法,无需额外训练即可实现高效的个性化对话生成。
Details
Motivation: 为了实现对大型语言模型性格特征的可控性和可调节性,提高对话生成的个性化水平。 Method: 通过将数字性格特征值嵌入自然语言提示中,实现对大型语言模型的细粒度性格控制。 Result: 实验表明,Big5-Scaler 能够在不同模型上诱导出一致且可区分的性格特征,提示类型和强度对性能有影响,简洁的提示和较低的性格强度效果更好。 Conclusion: Big5-Scaler 是一种有效的、无需额外训练即可对大型语言模型进行个性化控制的方法,适用于构建具有个性意识的对话代理。 Abstract: We present Big5-Scaler, a prompt-based framework for conditioning large language models (LLMs) with controllable Big Five personality traits. By embedding numeric trait values into natural language prompts, our method enables fine-grained personality control without additional training. We evaluate Big5-Scaler across trait expression, dialogue generation, and human trait imitation tasks. Results show that it induces consistent and distinguishable personality traits across models, with performance varying by prompt type and scale. Our analysis highlights the effectiveness of concise prompts and lower trait intensities, providing a efficient approach for building personality-aware dialogue agents.[21] Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach
Renhan Zhang,Lian Lian,Zhen Qi,Guiran Liu
Main category: cs.CL
TL;DR: 本文提出了一种新的可解释偏见检测方法,用于识别大型语言模型中的隐性社会偏见,并在多个维度上展示了其出色的检测性能和解释能力。
Details
Motivation: 大型语言模型在生成过程中可能产生隐性偏见,需要一种可解释的偏见检测方法来识别那些不容易通过显式语言特征捕捉的语义倾向。 Method: 结合嵌套语义表示与上下文对比机制,从模型输出的向量空间结构中提取潜在偏见特征,并通过注意力权重扰动分析模型对特定社会属性词的敏感性。 Result: 在StereoSet数据集上的实验表明,该方法在性别、职业、宗教和种族等多个刻板印象维度上均表现出强大的偏见检测性能,能够准确识别语义相似文本间的偏见差异,同时保持高语义一致性和输出稳定性。 Conclusion: 本文提出了一种可解释的偏见检测方法,能够有效识别大型语言模型输出中的隐性社会偏见,并揭示模型内部的偏见关联机制,为需要高可信度生成内容的实际应用场景提供了透明且可靠的技术基础。 Abstract: This paper addresses the issue of implicit stereotypes that may arise during the generation process of large language models. It proposes an interpretable bias detection method aimed at identifying hidden social biases in model outputs, especially those semantic tendencies that are not easily captured through explicit linguistic features. The method combines nested semantic representation with a contextual contrast mechanism. It extracts latent bias features from the vector space structure of model outputs. Using attention weight perturbation, it analyzes the model's sensitivity to specific social attribute terms, thereby revealing the semantic pathways through which bias is formed. To validate the effectiveness of the method, this study uses the StereoSet dataset, which covers multiple stereotype dimensions including gender, profession, religion, and race. The evaluation focuses on several key metrics, such as bias detection accuracy, semantic consistency, and contextual sensitivity. Experimental results show that the proposed method achieves strong detection performance across various dimensions. It can accurately identify bias differences between semantically similar texts while maintaining high semantic alignment and output stability. The method also demonstrates high interpretability in its structural design. It helps uncover the internal bias association mechanisms within language models. This provides a more transparent and reliable technical foundation for bias detection. The approach is suitable for real-world applications where high trustworthiness of generated content is required.[22] One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging
Yingfeng Luo,Dingyang Lin,Junxin Wang,Ziqiang Xu,Kaiyan Chang,Tong Zheng,Bei Li,Anxiang Ma,Tong Xiao,Zhengtao Yu,Jingbo Zhu
Main category: cs.CL
TL;DR: TADrop is an adaptive sparsification strategy that assigns tailored sparsity levels to each parameter tensor in model merging, improving performance by respecting the heterogeneity of model parameters.
Details
Motivation: The prevailing sparsification approaches in model merging use a uniform sparsity ratio, which overlooks the heterogeneity of model parameters and often leads to suboptimal results. Method: TADrop assigns a tailored sparsification level to each parameter tensor based on its distributional properties instead of applying a global ratio. Result: Extensive experiments show that TADrop consistently and significantly boosts the performance of foundational, classic, and SOTA merging methods across diverse tasks and models. Conclusion: TADrop provides a more effective way to mitigate parameter interference by tailoring sparsification to the model's structure, offering a new baseline for high-performance model merging. Abstract: Model merging has emerged as a compelling data-free paradigm for multi-task learning, enabling the fusion of multiple fine-tuned models into a single, powerful entity. A key technique in merging methods is sparsification, which prunes redundant parameters from task vectors to mitigate interference. However, prevailing approaches employ a ``one-size-fits-all'' strategy, applying a uniform sparsity ratio that overlooks the inherent structural and statistical heterogeneity of model parameters. This often leads to a suboptimal trade-off, where critical parameters are inadvertently pruned while less useful ones are retained. To address this limitation, we introduce \textbf{TADrop} (\textbf{T}ensor-wise \textbf{A}daptive \textbf{Drop}), an adaptive sparsification strategy that respects this heterogeneity. Instead of a global ratio, TADrop assigns a tailored sparsity level to each parameter tensor based on its distributional properties. The core intuition is that tensors with denser, more redundant distributions can be pruned aggressively, while sparser, more critical ones are preserved. As a simple and plug-and-play module, we validate TADrop by integrating it with foundational, classic, and SOTA merging methods. Extensive experiments across diverse tasks (vision, language, and multimodal) and models (ViT, BEiT) demonstrate that TADrop consistently and significantly boosts their performance. For instance, when enhancing a leading merging method, it achieves an average performance gain of 2.0\% across 8 ViT-B/32 tasks. TADrop provides a more effective way to mitigate parameter interference by tailoring sparsification to the model's structure, offering a new baseline for high-performance model merging.[23] UR$^2$: Unify RAG and Reasoning through Reinforcement Learning
Weitao Li,Boran Xiang,Xiaolong Wang,Zhinan Gou,Weizhi Ma,Yang Liu
Main category: cs.CL
TL;DR: This paper introduces UR2, a unified framework combining Retrieval-Augmented Generation and Reinforcement Learning to enhance adaptability and performance across diverse tasks, outperforming current methods and matching advanced models like GPT-4o-mini.
Details
Motivation: Current approaches often develop RAG and RLVR separately, limiting the generalization and applicability of RAG-RL methods across broader domains. The proposed method aims to unify these paradigms for improved adaptability. Method: UR2 integrates Retrieval-Augmented Generation (RAG) and Reinforcement Learning from Verifiable Rewards (RLVR) through a difficulty-aware curriculum training and a hybrid knowledge access strategy combining domain-specific offline corpora with LLM-generated summaries. Result: Experiments on open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show that UR2 (built on Qwen2.5-3/7B and LLaMA-3.1-8B) significantly outperforms existing RAG and RL methods. Conclusion: UR2 significantly outperforms existing RAG and RL methods in various tasks, achieving performance comparable to advanced models like GPT-4o-mini and GPT-4.1-mini. Abstract: Large Language Models (LLMs) have shown remarkable capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances knowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR), which optimizes complex reasoning abilities. However, these two capabilities are often developed in isolation, and existing efforts to unify them remain narrow in scope-typically limited to open-domain QA with fixed retrieval settings and task-specific assumptions. This lack of integration constrains generalization and limits the applicability of RAG-RL methods to broader domains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), a general framework that unifies retrieval and reasoning through reinforcement learning. UR2 introduces two key contributions: a difficulty-aware curriculum training that selectively invokes retrieval only for challenging problems, and a hybrid knowledge access strategy combining domain-specific offline corpora with LLM-generated summaries. These components are designed to enable dynamic coordination between retrieval and reasoning, improving adaptability across a diverse range of tasks. Experiments across open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks demonstrate that UR2 (built on Qwen2.5-3/7B and LLaMA-3.1-8B) significantly outperforms existing RAG and RL methods, achieving comparable performance to GPT-4o-mini and GPT-4.1-mini on several benchmarks. We have released all code, models, and data at https://github.com/Tsinghua-dhy/UR2.[24] Pragmatics beyond humans: meaning, communication, and LLMs
Vít Gvoždiak
Main category: cs.CL
TL;DR: This paper argues that traditional pragmatic theories are insufficient for understanding communication with large language models and proposes new frameworks like HMC and probabilistic pragmatics to better address human-machine interaction.
Details
Motivation: The emergence of large language models (LLMs) in communicative contexts necessitates a re-evaluation and refinement of traditional pragmatic theories, which are largely human-centered and not well-suited for understanding machine-generated communication. Method: The paper uses theoretical analysis and critique of traditional pragmatic theories, including the semiotic trichotomy and Gricean-inspired pragmatics, and proposes alternatives such as the Human-Machine Communication (HMC) framework and probabilistic pragmatics like the Rational Speech Act framework. Result: The paper successfully challenges traditional views of pragmatics, highlights the limitations of human-centered theories when applied to LLMs, and proposes alternative frameworks that better accommodate the dynamics of human-machine communication. Conclusion: The paper concludes that pragmatic theory needs to be adjusted or expanded to better account for communication involving generative AI, as traditional theories are not fully compatible with the machine-centered nature of LLMs. Abstract: The paper reconceptualizes pragmatics not as a subordinate, third dimension of meaning, but as a dynamic interface through which language operates as a socially embedded tool for action. With the emergence of large language models (LLMs) in communicative contexts, this understanding needs to be further refined and methodologically reconsidered. The first section challenges the traditional semiotic trichotomy, arguing that connectionist LLM architectures destabilize established hierarchies of meaning, and proposes the Human-Machine Communication (HMC) framework as a more suitable alternative. The second section examines the tension between human-centred pragmatic theories and the machine-centred nature of LLMs. While traditional, Gricean-inspired pragmatics continue to dominate, it relies on human-specific assumptions ill-suited to predictive systems like LLMs. Probabilistic pragmatics, particularly the Rational Speech Act framework, offers a more compatible teleology by focusing on optimization rather than truth-evaluation. The third section addresses the issue of substitutionalism in three forms - generalizing, linguistic, and communicative - highlighting the anthropomorphic biases that distort LLM evaluation and obscure the role of human communicative subjects. Finally, the paper introduces the concept of context frustration to describe the paradox of increased contextual input paired with a collapse in contextual understanding, emphasizing how users are compelled to co-construct pragmatic conditions both for the model and themselves. These arguments suggest that pragmatic theory may need to be adjusted or expanded to better account for communication involving generative AI.[25] Comparing Knowledge Injection Methods for LLMs in a Low-Resource Regime
Hugo Abonizio,Thales Almeida,Roberto Lotufo,Rodrigo Nogueira
Main category: cs.CL
TL;DR: This paper explores efficient methods for injecting knowledge into large language models using limited data, showing that diverse prompting and synthetic data generation improve learning, while highlighting challenges like catastrophic forgetting and the limitations of RAG-based approaches.
Details
Motivation: The motivation stems from the challenge of updating LLMs with limited data, as traditional methods like large-scale pre-training or RAG are not always feasible. The study aims to explore more efficient ways to inject knowledge into LLMs using small, unstructured datasets. Method: The authors used a dataset of recent news to evaluate knowledge acquisition in LLMs. They compared methods such as continued pre-training and synthetic data generation through diverse prompting. They also analyzed the impact of retrieval-augmented generation (RAG) on knowledge injection and model performance on control datasets. Result: The experiments showed that simple continued pre-training on limited data offers modest improvements, while diverse prompting significantly enhances knowledge acquisition. The study also found that RAG-based approaches are sensitive and may degrade performance on control tasks. Models were also shown to be capable of generating effective synthetic training data. Conclusion: The study concludes that generating diverse textual variations significantly enhances the acquisition of new facts in large language models (LLMs), especially in small-data regimes, while highlighting the challenge of catastrophic forgetting. The research also demonstrates that models can generate effective synthetic training data for self-improvement. Abstract: Large language models (LLMs) often require vast amounts of text to effectively acquire new knowledge. While continuing pre-training on large corpora or employing retrieval-augmented generation (RAG) has proven successful, updating an LLM with only a few thousand or million tokens remains challenging. In this work, we investigate the task of injecting small, unstructured information into LLMs and its relation to the catastrophic forgetting phenomenon. We use a dataset of recent news -- ensuring no overlap with the model's pre-training data -- to evaluate the knowledge acquisition by probing the model with question-answer pairs related the learned information. Starting from a continued pre-training baseline, we explored different augmentation algorithms to generate synthetic data to improve the knowledge acquisition capabilities. Our experiments show that simply continuing pre-training on limited data yields modest improvements, whereas exposing the model to diverse textual variations significantly improves the learning of new facts -- particularly with methods that induce greater variability through diverse prompting. Furthermore, we shed light on the forgetting phenomenon in small-data regimes, illustrating the delicate balance between learning new content and retaining existing capabilities. We also confirm the sensitivity of RAG-based approaches for knowledge injection, which often lead to greater degradation on control datasets compared to parametric methods. Finally, we demonstrate that models can generate effective synthetic training data themselves, suggesting a pathway toward self-improving model updates. All code and generated data used in our experiments are publicly available, providing a resource for studying efficient knowledge injection in LLMs with limited data at https://github.com/hugoabonizio/knowledge-injection-methods.[26] DKG-LLM : A Framework for Medical Diagnosis and Personalized Treatment Recommendations via Dynamic Knowledge Graph and Large Language Model Integration
Ali Sarabadani,Maryam Abdollahi Shamami,Hamidreza Sadeghsalehi,Borhan Asadi,Saba Hesaraki
Main category: cs.CL
TL;DR: This paper presents the DKG-LLM framework, which combines a dynamic knowledge graph with a large language model to improve medical diagnosis and personalized treatment recommendations.
Details
Motivation: The exponential growth of Large Language Models (LLMs) since the release of ChatGPT has led to transformative advancements in natural language understanding, taking a significant step towards artificial general intelligence (AGI). Method: The DKG-LLM framework integrates a dynamic knowledge graph (DKG) with the Grok 3 large language model using the Adaptive Semantic Fusion Algorithm (ASFA). Result: The evaluation results show that DKG-LLM achieves a diagnostic accuracy of 84.19%, a treatment recommendation accuracy of 89.63%, and a semantic coverage of 93.48%. Conclusion: DKG-LLM is a reliable and transformative tool that handles noisy data and complex multi-symptom diseases, along with feedback-based learning from physician input. Abstract: Large Language Models (LLMs) have grown exponentially since the release of ChatGPT. These models have gained attention due to their robust performance on various tasks, including language processing tasks. These models achieve understanding and comprehension of tasks by training billions of parameters. The development of these models is a transformative force in enhancing natural language understanding and has taken a significant step towards artificial general intelligence (AGI). In this study, we aim to present the DKG-LLM framework. The DKG-LLM framework introduces a groundbreaking approach to medical diagnosis and personalized treatment recommendations by integrating a dynamic knowledge graph (DKG) with the Grok 3 large language model. Using the Adaptive Semantic Fusion Algorithm (ASFA), heterogeneous medical data (including clinical reports and PubMed articles) and patient records dynamically generate a knowledge graph consisting of 15,964 nodes in 13 distinct types (e.g., diseases, symptoms, treatments, patient profiles) and 127,392 edges in 26 relationship types (e.g., causal, therapeutic, association). ASFA utilizes advanced probabilistic models, Bayesian inference, and graph optimization to extract semantic information, dynamically updating the graph with approximately 150 new nodes and edges in each data category while maintaining scalability with up to 987,654 edges. Real-world datasets, including MIMIC-III and PubMed, were utilized to evaluate the proposed architecture. The evaluation results show that DKG-LLM achieves a diagnostic accuracy of 84.19%. The model also has a treatment recommendation accuracy of 89.63% and a semantic coverage of 93.48%. DKG-LLM is a reliable and transformative tool that handles noisy data and complex multi-symptom diseases, along with feedback-based learning from physician input.[27] Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation
Lai Jiang,Yuekang Li,Xiaohan Zhang,Youtao Ding,Li Pan
Main category: cs.CL
TL;DR: SceneJailEval提出了一种灵活的场景自适应多维评估框架,并构建了14个场景的综合数据集,显著提高了对不同场景下LLM越狱行为的评估精度。
Details
Motivation: 现有评估方法要么使用二分类,要么采用统一标准的多维框架,无法准确量化不同场景下的危害强度,导致评估精度下降。 Method: 提出SceneJailEval框架和构建14个场景的综合数据集,并使用F1分数评估其性能。 Result: SceneJailEval在全场景数据集上的F1得分为0.917(比之前SOTA高+6%),在JBB上的F1得分为0.995(比之前SOTA高+3%)。 Conclusion: SceneJailEval是一个具有强扩展性的场景自适应多维评估框架,在全面场景数据集和JBB上均达到最先进的结果,克服了现有方法的局限性。 Abstract: Precise jailbreak evaluation is vital for LLM red teaming and jailbreak research. Current approaches employ binary classification ( e.g., string matching, toxic text classifiers, LLM-driven methods), yielding only "yes/no" labels without quantifying harm intensity. Existing multi-dimensional frameworks ( e.g., Security Violation, Relative Truthfulness, Informativeness) apply uniform evaluation criteria across scenarios, resulting in scenario-specific mismatches--for instance, "Relative Truthfulness" is irrelevant to "hate speech"--which compromise evaluation precision. To tackle these limitations, we introduce SceneJailEval, with key contributions: (1) A groundbreaking scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical "one-size-fits-all" constraint of existing multi-dimensional methods, and featuring strong extensibility to flexibly adapt to customized or emerging scenarios. (2) A comprehensive 14-scenario dataset with diverse jailbreak variants and regional cases, filling the long-standing gap in high-quality, holistic benchmarks for scenario-adaptive evaluation. (3) SceneJailEval achieves state-of-the-art results, with an F1 score of 0.917 on our full-scenario dataset (+6% over prior SOTA) and 0.995 on JBB (+3% over prior SOTA), surpassing accuracy limits of existing evaluation methods in heterogeneous scenarios and confirming its advantage.[28] EICAP: Deep Dive in Assessment and Enhancement of Large Language Models in Emotional Intelligence through Multi-Turn Conversations
Nizi Nazar,Ehsaneddin Asgari
Main category: cs.CL
TL;DR: This paper introduces a new EI framework and benchmark for LLMs, showing that current models have limited emotional reasoning and suggesting the need for more targeted approaches to improve EI alignment.
Details
Motivation: Emotional Intelligence (EI) is an important but underexplored area in the development of human-aligned LLMs. This research aims to address this gap by creating a structured framework and benchmark for evaluating and improving EI in LLMs. Method: The researchers introduced a four-layer taxonomy of EI for LLMs and developed EICAP-Bench, a benchmark to assess EI in LLMs. They evaluated six models and fine-tuned two models using LoRA adapters on the UltraChat dataset in English and Arabic. Result: Among the six evaluated LLMs, Qwen2.5-Instruct performed best. Fine-tuning with UltraChat showed improvement only in the Appraisal layer of EI, indicating limited effectiveness of current methods in enhancing deeper emotional reasoning. Conclusion: The study concludes that current pretraining and instruction-tuning methods have limitations in developing deep emotional reasoning in LLMs, and targeted strategies are needed for better emotional intelligence (EI) alignment. Abstract: Emotional Intelligence (EI) is a critical yet underexplored dimension in the development of human-aligned LLMs. To address this gap, we introduce a unified, psychologically grounded four-layer taxonomy of EI tailored for large language models (LLMs), encompassing emotional tracking, cause inference, appraisal, and emotionally appropriate response generation. Building on this framework, we present EICAP-Bench, a novel MCQ style multi-turn benchmark designed to evaluate EI capabilities in open-source LLMs across diverse linguistic and cultural contexts. We evaluate six LLMs: LLaMA3 (8B), LLaMA3-Instruct, Gemma (9B), Gemma-Instruct, Qwen2.5 (7B), and Qwen2.5-Instruct on EmoCap-Bench, identifying Qwen2.5-Instruct as the strongest baseline. To assess the potential for enhancing EI capabilities, we fine-tune both Qwen2.5-Base and Qwen2.5-Instruct using LoRA adapters on UltraChat (UC), a large-scale, instruction-tuned dialogue dataset, in both English and Arabic. Our statistical analysis reveals that among the five EI layers, only the Appraisal layer shows significant improvement through UC-based fine-tuning. These findings highlight the limitations of existing pretraining and instruction-tuning paradigms in equipping LLMs with deeper emotional reasoning and underscore the need for targeted data and modeling strategies for comprehensive EI alignment.[29] Classification is a RAG problem: A case study on hate speech detection
Richard Willats,Josh Pennington,Aravind Mohan,Bertie Vidgen
Main category: cs.CL
TL;DR: This paper introduces a Retrieval-Augmented Generation (RAG) approach for content moderation that improves flexibility, explainability, and adaptability without retraining, offering a more efficient and transparent classification system.
Details
Motivation: Content moderation systems need to adapt quickly to evolving policies without the cost and delay of retraining models. Traditional classification systems are limited in flexibility and explainability. Method: The authors proposed a Retrieval-Augmented Generation (RAG) system, called Contextual Policy Engine (CPE), which shifts traditional classification tasks by evaluating content in relation to contextual knowledge retrieved at inference rather than relying solely on pre-trained parameters. Result: The CPE system demonstrated robust classification accuracy comparable to leading commercial systems, provided inherent explainability through retrieved policy segments, and enabled dynamic policy updates without retraining. It also allowed fine-grained policy control, such as adjusting protection for specific identity groups, without compromising overall performance. Conclusion: RAG-based classification proves to be a flexible, transparent, and adaptable method for content moderation and broader classification problems without costly retraining. Abstract: Robust content moderation requires classification systems that can quickly adapt to evolving policies without costly retraining. We present classification using Retrieval-Augmented Generation (RAG), which shifts traditional classification tasks from determining the correct category in accordance with pre-trained parameters to evaluating content in relation to contextual knowledge retrieved at inference. In hate speech detection, this transforms the task from "is this hate speech?" to "does this violate the hate speech policy?" Our Contextual Policy Engine (CPE) - an agentic RAG system - demonstrates this approach and offers three key advantages: (1) robust classification accuracy comparable to leading commercial systems, (2) inherent explainability via retrieved policy segments, and (3) dynamic policy updates without model retraining. Through three experiments, we demonstrate strong baseline performance and show that the system can apply fine-grained policy control by correctly adjusting protection for specific identity groups without requiring retraining or compromising overall performance. These findings establish that RAG can transform classification into a more flexible, transparent, and adaptable process for content moderation and wider classification problems.[30] InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?
Keummin Ka,Junhyeong Park,Jahyun Jeon,Youngjae Yu
Main category: cs.CL
TL;DR: 本文提出了InfoCausalQA,一个用于评估基于信息图的因果推理的新基准,揭示了当前视觉-语言模型在因果推理方面的能力不足。
Details
Motivation: 尽管视觉-语言模型(VLM)在感知和推理方面取得了显著进展,但其在多模态环境下进行因果推断的能力仍被低估。 Method: 构建了一个名为InfoCausalQA的新基准,包括两个任务:任务1基于推断的数值趋势进行定量因果推理,任务2涉及五种因果关系类型的语义因果推理。使用GPT-4o生成1,482个高质量的多选问答对,并由人工修订确保问题需要真实的视觉基础来回答。 Result: 实验结果显示,当前VLM在定量因果推理和语义因果推理方面表现有限,与人类相比存在显著差距。 Conclusion: 当前的VLM在计算推理和语义因果推理方面表现出局限性,表明多模态AI系统在因果推断方面仍有较大提升空间。 Abstract: Recent advances in Vision-Language Models (VLMs) have demonstrated impressive capabilities in perception and reasoning. However, the ability to perform causal inference -- a core aspect of human cognition -- remains underexplored, particularly in multimodal settings. In this study, we introduce InfoCausalQA, a novel benchmark designed to evaluate causal reasoning grounded in infographics that combine structured visual data with textual context. The benchmark comprises two tasks: Task 1 focuses on quantitative causal reasoning based on inferred numerical trends, while Task 2 targets semantic causal reasoning involving five types of causal relations: cause, effect, intervention, counterfactual, and temporal. We manually collected 494 infographic-text pairs from four public sources and used GPT-4o to generate 1,482 high-quality multiple-choice QA pairs. These questions were then carefully revised by humans to ensure they cannot be answered based on surface-level cues alone but instead require genuine visual grounding. Our experimental results reveal that current VLMs exhibit limited capability in computational reasoning and even more pronounced limitations in semantic causal reasoning. Their significantly lower performance compared to humans indicates a substantial gap in leveraging infographic-based information for causal inference. Through InfoCausalQA, we highlight the need for advancing the causal reasoning abilities of multimodal AI systems.[31] Large Language Model Data Generation for Enhanced Intent Recognition in German Speech
Theresa Pekarek Rosin,Burak Can Kaplan,Stefan Wermter
Main category: cs.CL
TL;DR: 该论文提出了一种新方法,用于提升德语老年语音的意图识别性能,通过结合适应性Whisper ASR模型和基于Transformer的语言模型。
Details
Motivation: 意图识别在AI助手机统中至关重要,但现有方法主要局限于英语短命令,缺乏对老年德语语音的研究。 Method: 使用经过调整的Whisper ASR模型,微调在老年德语语音(SVC-de)上,并结合基于Transformer的语言模型,这些模型训练在由LeoLM、Llama3和ChatGPT生成的合成文本数据集上。 Result: 实验结果显示,合成LLM生成的数据显著提升了分类性能和对不同说话风格及未见过词汇的鲁棒性。 Conclusion: 该论文提出了一种结合适应性Whisper ASR模型和基于Transformer的语言模型的新方法,用于解决德语老年语音的意图识别问题,并展示了生成式AI在低资源领域中的有效性。 Abstract: Intent recognition (IR) for speech commands is essential for artificial intelligence (AI) assistant systems; however, most existing approaches are limited to short commands and are predominantly developed for English. This paper addresses these limitations by focusing on IR from speech by elderly German speakers. We propose a novel approach that combines an adapted Whisper ASR model, fine-tuned on elderly German speech (SVC-de), with Transformer-based language models trained on synthetic text datasets generated by three well-known large language models (LLMs): LeoLM, Llama3, and ChatGPT. To evaluate the robustness of our approach, we generate synthetic speech with a text-to-speech model and conduct extensive cross-dataset testing. Our results show that synthetic LLM-generated data significantly boosts classification performance and robustness to different speaking styles and unseen vocabulary. Notably, we find that LeoLM, a smaller, domain-specific 13B LLM, surpasses the much larger ChatGPT (175B) in dataset quality for German intent recognition. Our approach demonstrates that generative AI can effectively bridge data gaps in low-resource domains. We provide detailed documentation of our data generation and training process to ensure transparency and reproducibility.[32] Matrix-Driven Instant Review: Confident Detection and Reconstruction of LLM Plagiarism on PC
Ruichong Zhang
Main category: cs.CL
TL;DR: Matrix-Driven Instant Review (MDIR) improves LLM plagiarism detection through matrix analysis and statistical methods.
Details
Motivation: Concerns about intellectual property in large language models have grown, and current plagiarism detection methods have significant limitations. Method: Matrix-Driven Instant Review (MDIR) uses matrix analysis and Large Deviation Theory to reconstruct weight relationships and estimate p-values. Result: MDIR accurately detects plagiarism, computes statistical significance, and works efficiently on a single PC within an hour. Conclusion: MDIR is an efficient and accessible method for detecting LLM plagiarism, even after extensive transformations. Abstract: In recent years, concerns about intellectual property (IP) in large language models (LLMs) have grown significantly. Plagiarizing other LLMs (through direct weight copying, upcycling, pruning, or continual pretraining) and claiming authorship without properly attributing to the original license, is a serious misconduct that can lead to significant financial and reputational harm to the original developers. However, existing methods for detecting LLM plagiarism fall short in key areas. They fail to accurately reconstruct weight correspondences, lack the ability to compute statistical significance measures such as $p$-values, and may mistakenly flag models trained on similar data as being related. To address these limitations, we propose Matrix-Driven Instant Review (MDIR), a novel method that leverages matrix analysis and Large Deviation Theory. MDIR achieves accurate reconstruction of weight relationships, provides rigorous $p$-value estimation, and focuses exclusively on weight similarity without requiring full model inference. Experimental results demonstrate that MDIR reliably detects plagiarism even after extensive transformations, such as random permutations and continual pretraining with trillions of tokens. Moreover, all detections can be performed on a single PC within an hour, making MDIR both efficient and accessible.[33] Harnessing Adaptive Topology Representations for Zero-Shot Graph Question Answering
Yanbin Wei,Jiangyue Yan,Chun Kang,Yang Chen,Hua Liu,James T. Kwok,Yu Zhang
Main category: cs.CL
TL;DR: 本文提出DynamicTRF框架,通过自适应选择最适合的图表示方法,显著提高了大型多模态模型在零样本图问答任务中的准确性和简洁性。
Details
Motivation: 现有的单一图表示方法未能考虑不同模型或任务的具体偏好,导致回答错误或冗长,因此需要一种能够自适应选择最佳图表示方法的框架。 Method: 首先分析现有图表示形式的特性和弱点,设计了一组适用于零样本图问答的图表示形式$F_{ZS}$,引入了一种新的度量标准Graph Response Efficiency (GRE),构建了DynamicTRF框架,该框架通过创建TRF Preference (TRFP)数据集并训练一个TRF路由器来为每个问题自适应分配最佳TRF。 Result: 在7个领域内算法图问答任务和2个领域外下游任务的广泛实验表明,DynamicTRF显著提高了零样本图问答的准确性和效率。 Conclusion: DynamicTRF框架显著提高了大型多模态模型在零样本图问答任务中的准确性和简洁性。 Abstract: Large Multimodal Models (LMMs) have shown generalized zero-shot capabilities in diverse domain question-answering (QA) tasks, including graph QA that involves complex graph topologies. However, most current approaches use only a single type of graph representation, namely Topology Representation Form (TRF), such as prompt-unified text descriptions or style-fixed visual styles. Those "one-size-fits-all" approaches fail to consider the specific preferences of different models or tasks, often leading to incorrect or overly long responses. To address this, we first analyze the characteristics and weaknesses of existing TRFs, and then design a set of TRFs, denoted by $F_{ZS}$, tailored to zero-shot graph QA. We then introduce a new metric, Graph Response Efficiency (GRE), which measures the balance between the performance and the brevity in graph QA. Built on these, we develop the DynamicTRF framework, which aims to improve both the accuracy and conciseness of graph QA. To be specific, DynamicTRF first creates a TRF Preference (TRFP) dataset that ranks TRFs based on their GRE scores, to probe the question-specific TRF preferences. Then it trains a TRF router on the TRFP dataset, to adaptively assign the best TRF from $F_{ZS}$ for each question during the inference. Extensive experiments across 7 in-domain algorithmic graph QA tasks and 2 out-of-domain downstream tasks show that DynamicTRF significantly enhances the zero-shot graph QA of LMMs in terms of accuracy[34] Cyberbullying Detection via Aggression-Enhanced Prompting
Aisha Saeid,Anu Sabu,Girish A. Koushik,Ferrante Neri,Diptesh Kanojia
Main category: cs.CL
TL;DR: This study explores how incorporating aggression detection improves cyberbullying detection using LLMs, proposing an enriched prompt pipeline that enhances performance through contextual augmentation.
Details
Motivation: Detecting cyberbullying on social media is a critical challenge due to its subtle and varied expressions, and this study aims to improve the generalization and performance of LLMs by integrating aggression detection as an auxiliary task. Method: The study uses instruction-tuned LLMs and evaluates multiple strategies including zero-shot, few-shot, independent LoRA fine-tuning, and multi-task learning (MTL). An enriched prompt pipeline is proposed, embedding aggression predictions into cyberbullying detection prompts for contextual augmentation. Result: Preliminary results show that the enriched prompt pipeline approach consistently outperforms standard LoRA fine-tuning in cyberbullying detection, demonstrating the value of aggression-informed context. Conclusion: The study concludes that integrating aggression detection as an auxiliary task enhances the performance and generalization of large language models (LLMs) in detecting cyberbullying on social media, particularly through the proposed enriched prompt pipeline approach. Abstract: Detecting cyberbullying on social media remains a critical challenge due to its subtle and varied expressions. This study investigates whether integrating aggression detection as an auxiliary task within a unified training framework can enhance the generalisation and performance of large language models (LLMs) in cyberbullying detection. Experiments are conducted on five aggression datasets and one cyberbullying dataset using instruction-tuned LLMs. We evaluated multiple strategies: zero-shot, few-shot, independent LoRA fine-tuning, and multi-task learning (MTL). Given the inconsistent results of MTL, we propose an enriched prompt pipeline approach in which aggression predictions are embedded into cyberbullying detection prompts to provide contextual augmentation. Preliminary results show that the enriched prompt pipeline consistently outperforms standard LoRA fine-tuning, indicating that aggression-informed context significantly boosts cyberbullying detection. This study highlights the potential of auxiliary tasks, such as aggression detection, to improve the generalisation of LLMs for safety-critical applications on social networks.[35] Evaluating Style-Personalized Text Generation: Challenges and Directions
Anubhav Jangra,Bahareh Sarrafzadeh,Adrian de Wynter,Silviu Cucerzan,Sujay Kumar Jauhar
Main category: cs.CL
TL;DR: 本文质疑了传统评估指标如BLEU和ROUGE在风格个性化文本生成中的有效性,并提出了使用多样化评估指标集成方法的新方案。
Details
Motivation: 动机是探索低资源作者风格个性化文本生成空间中的评估问题,质疑广泛采用的评估指标如BLEU和ROUGE的有效性。 Method: 本文通过使用风格嵌入和LLM作为评判者等方法,对包括BLEU和ROUGE在内的广泛采用的评估指标进行了评估,并使用了风格辨别基准进行跨八个写作任务的评估。 Result: 研究结果显示,应采用多样化的评估指标集成方法来有效评估风格个性化文本生成任务。 Conclusion: 本文的结论是,为了有效评估风格个性化文本生成,应采用多样化评估指标的集成方法。 Abstract: While prior research has built tools and benchmarks towards style personalized text generation, there has been limited exploration of evaluation in low-resource author style personalized text generation space. Through this work, we question the effectiveness of the widely adopted evaluation metrics like BLEU and ROUGE, and explore other evaluation paradigms such as style embeddings and LLM-as-judge to holistically evaluate the style personalized text generation task. We evaluate these metrics and their ensembles using our style discrimination benchmark, that spans eight writing tasks, and evaluates across three settings, domain discrimination, authorship attribution, and LLM personalized vs non-personalized discrimination. We provide conclusive evidence to adopt ensemble of diverse evaluation metrics to effectively evaluate style personalized text generation.[36] LLMs vs. Chinese Anime Enthusiasts: A Comparative Study on Emotionally Supportive Role-Playing
Lanlan Qiu,Xiao Pu,Yeqi Feng,Tianxing He
Main category: cs.CL
TL;DR: 该论文介绍了ChatAnime,这是一个用于评估大型语言模型在角色扮演和情感支持能力方面的数据集和评估系统。
Details
Motivation: 该论文旨在填补一个重要的研究空白,即将大型语言模型(LLMs)在角色扮演对话和情感支持方面的研究结合起来,以实现与虚拟角色的情感支持互动。 Method: 该论文介绍了ChatAnime,第一个情感支持角色扮演(ESRP)数据集。研究人员精心挑选了20个顶级动漫角色,并设计了60个以情感为中心的真实场景问题,然后在全国范围内筛选出40名中国动漫爱好者,系统地收集了对话数据。此外,还设计了一个用户体验导向的评估系统。 Result: 实验结果表明,顶级的大型语言模型(LLMs)在角色扮演和情感支持方面超越了人类粉丝,而人类在回应多样性方面仍然领先。 Conclusion: 该论文得出的结论是,表现最佳的大型语言模型(LLMs)在角色扮演和情感支持方面超过了人类粉丝,而人类在回应多样性方面仍然领先。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing conversations and providing emotional support as separate research directions. However, there remains a significant research gap in combining these capabilities to enable emotionally supportive interactions with virtual characters. To address this research gap, we focus on anime characters as a case study because of their well-defined personalities and large fan bases. This choice enables us to effectively evaluate how well LLMs can provide emotional support while maintaining specific character traits. We introduce ChatAnime, the first Emotionally Supportive Role-Playing (ESRP) dataset. We first thoughtfully select 20 top-tier characters from popular anime communities and design 60 emotion-centric real-world scenario questions. Then, we execute a nationwide selection process to identify 40 Chinese anime enthusiasts with profound knowledge of specific characters and extensive experience in role-playing. Next, we systematically collect two rounds of dialogue data from 10 LLMs and these 40 Chinese anime enthusiasts. To evaluate the ESRP performance of LLMs, we design a user experience-oriented evaluation system featuring 9 fine-grained metrics across three dimensions: basic dialogue, role-playing and emotional support, along with an overall metric for response diversity. In total, the dataset comprises 2,400 human-written and 24,000 LLM-generated answers, supported by over 132,000 human annotations. Experimental results show that top-performing LLMs surpass human fans in role-playing and emotional support, while humans still lead in response diversity. We hope this work can provide valuable resources and insights for future research on optimizing LLMs in ESRP. Our datasets are available at https://github.com/LanlanQiu/ChatAnime.[37] Quantifying Conversation Drift in MCP via Latent Polytope
Haoran Shi,Hongwei Yao,Shuo Shao,Shaopeng Jiao,Ziqi Peng,Zhan Qin,Cong Wang
Main category: cs.CL
TL;DR: SecMCP is a secure framework that proactively detects and quantifies conversation drift caused by adversarial external knowledge in large language models, demonstrating strong performance with AUROC scores exceeding 0.915.
Details
Motivation: The Model Context Protocol (MCP) introduces security and privacy risks such as tool poisoning and prompt injection. Existing defenses are inadequate due to static signatures, computational inefficiency, and inability to quantify conversational hijacking. Method: SecMCP models LLM activation vectors within a latent polytope space to identify anomalous shifts in conversational dynamics, enabling proactive detection of hijacking, misleading, and data exfiltration. Result: SecMCP was evaluated on Llama3, Vicuna, and Mistral across MS MARCO, HotpotQA, and FinQA datasets, demonstrating robust detection performance with AUROC scores exceeding 0.915. Conclusion: SecMCP effectively detects and quantifies conversation drift caused by adversarial external knowledge, demonstrating robust detection with AUROC scores exceeding 0.915 across three state-of-the-art LLMs while maintaining system usability. Abstract: The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools, enabling dynamic aggregation of real-time data to improve task execution. However, its non-isolated execution context introduces critical security and privacy risks. In particular, adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration. Existing defenses, such as rule-based filters or LLM-driven detection, remain inadequate due to their reliance on static signatures, computational inefficiency, and inability to quantify conversational hijacking. To address these limitations, we propose SecMCP, a secure framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge. By modeling LLM activation vectors within a latent polytope space, SecMCP identifies anomalous shifts in conversational dynamics, enabling proactive detection of hijacking, misleading, and data exfiltration. We evaluate SecMCP on three state-of-the-art LLMs (Llama3, Vicuna, Mistral) across benchmark datasets (MS MARCO, HotpotQA, FinQA), demonstrating robust detection with AUROC scores exceeding 0.915 while maintaining system usability. Our contributions include a systematic categorization of MCP security threats, a novel latent polytope-based methodology for quantifying conversation drift, and empirical validation of SecMCP's efficacy.[38] Memp: Exploring Agent Procedural Memory
Runnan Fang,Yuan Liang,Xiaobin Wang,Jialong Wu,Shuofei Qiao,Pengjun Xie,Fei Huang,Huajun Chen,Ningyu Zhang
Main category: cs.CL
TL;DR: The paper introduces Memp, a strategy to equip LLM-based agents with a learnable and updatable procedural memory, resulting in improved performance and efficiency across tasks, particularly when transferring memory from stronger to weaker models.
Details
Motivation: The motivation of this paper is to overcome the limitations of brittle procedural memory in Large Language Model (LLM) based agents, which are typically manually engineered or entangled in static parameters. The authors aim to provide agents with a more robust, adaptable form of procedural memory that can evolve with new experiences. Method: The paper proposes Memp, which distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions to create a learnable, updatable, and lifelong procedural memory. The approach explores the impact of different strategies for building, retrieving, and updating procedural memory in conjunction with a dynamic regimen that continuously updates, corrects, and deprecates its contents. Result: Empirical evaluations on TravelPlanner and ALFWorld demonstrate that as the memory repository is refined, agents achieve higher success rates and greater efficiency on analogous tasks. Additionally, procedural memory built from a stronger model retains its value, showing substantial performance gains when migrated to a weaker model. Conclusion: The paper concludes that procedural memory built from a stronger model retains its value and can substantially improve the performance of weaker models when applied to similar tasks. Abstract: Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model yields substantial performance gains.[39] Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages
Andrea Nasuto,Stefano Maria Iacus,Francisco Rowe,Devika Jain
Main category: cs.CL
TL;DR: 研究发现,通过微调,大型语言模型(LLM)可以在未见过的语言中可靠地分类移民相关内容,且少量的微调即可显著提高性能。
Details
Motivation: 研究通过微调获得的知识是否可以转移到仅在预训练期间出现的语言上。 Method: 在单语、双语或多语数据集上对LLaMA 3.2-3B模型进行微调,以跨13种语言分类与移民相关的推文。 Result: 经过一种或两种语言微调的LLM可以可靠地分类未见过语言中的移民相关内容,推断偏差偏向主导语言,但即使是对微调中代表性不足的语言进行少量接触也能显著提高性能。 Conclusion: 知识可以通过微调转移到未见过的语言,预训练偏差可通过少量微调进行纠正。 Abstract: Large language models (LLMs) are transforming social-science research by enabling scalable, precise analysis. Their adaptability raises the question of whether knowledge acquired through fine-tuning in a few languages can transfer to unseen languages that only appeared during pre-training. To examine this, we fine-tune lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual data sets to classify immigration-related tweets from X/Twitter across 13 languages, a domain characterised by polarised, culturally specific discourse. We evaluate whether minimal language-specific fine-tuning enables cross-lingual topic detection and whether adding targeted languages corrects pre-training biases. Results show that LLMs fine-tuned in one or two languages can reliably classify immigration-related content in unseen languages. However, identifying whether a tweet expresses a pro- or anti-immigration stance benefits from multilingual fine-tuning. Pre-training bias favours dominant languages, but even minimal exposure to under-represented languages during fine-tuning (as little as $9.62\times10^{-11}$ of the original pre-training token volume) yields significant gains. These findings challenge the assumption that cross-lingual mastery requires extensive multilingual training: limited language coverage suffices for topic-level generalisation, and structural biases can be corrected with lightweight interventions. By releasing 4-bit-quantised, LoRA fine-tuned models, we provide an open-source, reproducible alternative to proprietary LLMs that delivers 35 times faster inference at just 0.00000989% of the dollar cost of the OpenAI GPT-4o model, enabling scalable, inclusive research.[40] Echoes of Automation: The Increasing Use of LLMs in Newsmaking
Abolfazl Ansari,Delvin Ce Zhang,Nafis Irtiza Tripto,Dongwon Lee
Main category: cs.CL
TL;DR: This study explores the growing use of Generative AI in journalism, finding increased reliance on AI for content creation, particularly in local and college news, which impacts linguistic diversity and formality in writing styles.
Details
Motivation: The rapid development of Generative AI, especially Large Language Models (LLMs), has raised concerns regarding journalistic integrity and authorship, prompting an investigation into its usage and impact. Method: The research analyzed over 40,000 news articles from various media sources using three AI-text detectors: Binoculars, Fast-Detect GPT, and GPTZero, with a focus on sentence-level analysis. Result: Findings indicate that LLMs are frequently used in the introduction of news articles while conclusions are typically manually written. AI usage enhances word richness and readability but reduces formality and diversity in writing styles. Conclusion: The study concludes that the use of Generative AI in journalism, particularly in local and college news, has seen a substantial increase, affecting linguistic aspects of writing and leading to more uniform styles. Abstract: The rapid rise of Generative AI (GenAI), particularly LLMs, poses concerns for journalistic integrity and authorship. This study examines AI-generated content across over 40,000 news articles from major, local, and college news media, in various media formats. Using three advanced AI-text detectors (e.g., Binoculars, Fast-Detect GPT, and GPTZero), we find substantial increase of GenAI use in recent years, especially in local and college news. Sentence-level analysis reveals LLMs are often used in the introduction of news, while conclusions usually written manually. Linguistic analysis shows GenAI boosts word richness and readability but lowers formality, leading to more uniform writing styles, particularly in local media.[41] SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
Lingkun Long,Rubing Yang,Yushi Huang,Desheng Hui,Ao Zhou,Jianlei Yang
Main category: cs.CL
TL;DR: SlimInfer 通过动态剪枝冗余token和优化KV缓存管理,在不损失性能的前提下显著提升LLM推理效率。
Details
Motivation: 长上下文推理受计算资源限制,现有方法未能有效减少每层隐藏状态的处理量,因此需要更高效的推理方法。 Method: 引入了一种基于信息扩散现象的动态细粒度剪枝机制,并结合异步KV缓存管理器减少内存使用和I/O开销。 Result: 在LLaMA3.1-8B-Instruct模型上,实现了最高2.53倍的TTFT加速和1.88倍的端到端延迟减少,同时在LongBench任务上性能保持不变。 Conclusion: SlimInfer 提出了一种新的推理加速框架,通过在前向传播过程中动态剪枝冗余的提示词,有效降低了计算和内存需求,同时保持了模型性能。 Abstract: Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code will be released upon acceptance.[42] GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4. 5 Team,:,Aohan Zeng,Xin Lv,Qinkai Zheng,Zhenyu Hou,Bin Chen,Chengxing Xie,Cunxiang Wang,Da Yin,Hao Zeng,Jiajie Zhang,Kedong Wang,Lucen Zhong,Mingdao Liu,Rui Lu,Shulin Cao,Xiaohan Zhang,Xuancheng Huang,Yao Wei,Yean Cheng,Yifan An,Yilin Niu,Yuanhao Wen,Yushi Bai,Zhengxiao Du,Zihan Wang,Zilin Zhu,Bohan Zhang,Bosi Wen,Bowen Wu,Bowen Xu,Can Huang,Casey Zhao,Changpeng Cai,Chao Yu,Chen Li,Chendi Ge,Chenghua Huang,Chenhui Zhang,Chenxi Xu,Chenzheng Zhu,Chuang Li,Congfeng Yin,Daoyan Lin,Dayong Yang,Dazhi Jiang,Ding Ai,Erle Zhu,Fei Wang,Gengzheng Pan,Guo Wang,Hailong Sun,Haitao Li,Haiyang Li,Haiyi Hu,Hanyu Zhang,Hao Peng,Hao Tai,Haoke Zhang,Haoran Wang,Haoyu Yang,He Liu,He Zhao,Hongwei Liu,Hongxi Yan,Huan Liu,Huilong Chen,Ji Li,Jiajing Zhao,Jiamin Ren,Jian Jiao,Jiani Zhao,Jianyang Yan,Jiaqi Wang,Jiayi Gui,Jiayue Zhao,Jie Liu,Jijie Li,Jing Li,Jing Lu,Jingsen Wang,Jingwei Yuan,Jingxuan Li,Jingzhao Du,Jinhua Du,Jinxin Liu,Junkai Zhi,Junli Gao,Ke Wang,Lekang Yang,Liang Xu,Lin Fan,Lindong Wu,Lintao Ding,Lu Wang,Man Zhang,Minghao Li,Minghuan Xu,Mingming Zhao,Mingshu Zhai,Pengfan Du,Qian Dong,Shangde Lei,Shangqing Tu,Shangtong Yang,Shaoyou Lu,Shijie Li,Shuang Li,Shuang-Li,Shuxun Yang,Sibo Yi,Tianshu Yu,Wei Tian,Weihan Wang,Wenbo Yu,Weng Lam Tam,Wenjie Liang,Wentao Liu,Xiao Wang,Xiaohan Jia,Xiaotao Gu,Xiaoying Ling,Xin Wang,Xing Fan,Xingru Pan,Xinyuan Zhang,Xinze Zhang,Xiuqing Fu,Xunkai Zhang,Yabo Xu,Yandong Wu,Yida Lu,Yidong Wang,Yilin Zhou,Yiming Pan,Ying Zhang,Yingli Wang,Yingru Li,Yinpei Su,Yipeng Geng,Yitong Zhu,Yongkun Yang,Yuhang Li,Yuhao Wu,Yujiang Li,Yunan Liu,Yunqing Wang,Yuntao Li,Yuxuan Zhang,Zezhen Liu,Zhen Yang,Zhengda Zhou,Zhongpei Qiao,Zhuoer Feng,Zhuorui Liu,Zichen Zhang,Zihan Wang,Zijun Yao,Zikang Wang,Ziqiang Liu,Ziwei Chai,Zixuan Li,Zuodong Zhao,Wenguang Chen,Jidong Zhai,Bin Xu,Minlie Huang,Hongning Wang,Juanzi Li,Yuxiao Dong,Jie Tang
Main category: cs.CL
TL;DR: GLM-4.5 is a 355B parameter open-source MoE language model that excels in reasoning and agentic tasks, achieving top scores while offering a compact 106B version for research advancement.
Details
Motivation: To develop a powerful, open-source language model that excels in agentic, reasoning, and coding tasks while maintaining efficiency through a compact version. Method: The model uses a Mixture-of-Experts (MoE) architecture with 355B total parameters and 32B activated parameters, incorporating a hybrid reasoning method and undergoing multi-stage training on 23T tokens with expert model iteration and reinforcement learning. Result: GLM-4.5 achieved 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified, ranking 3rd overall and 2nd on agentic benchmarks, outperforming models with larger parameter counts. Conclusion: GLM-4.5 is a high-performing, open-source Mixture-of-Experts large language model that advances research in reasoning and agentic AI systems, offering both a full and compact version. Abstract: We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.[43] HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
Guimin Hu,Daniel Hershcovich,Hasti Seifi
Main category: cs.CL
TL;DR: 本文提出了HapticLLaMA模型,用于将触觉振动信号转化为自然语言描述,并在自动化指标和人类评估中表现出色,突出了大语言模型在处理感官数据方面的潜力。
Details
Motivation: 触觉信号在虚拟现实、可访问性和康复应用中具有重要意义,但相比于视觉和音频信号,其研究仍显不足。研究旨在填补这一空白,形式化触觉字幕任务,并探索触觉信号的潜力。 Method: 提出了HapticLLaMA模型,使用两种类型的触觉分词器(基于频率的分词器和基于EnCodec的分词器)将触觉信号转换为离散单位序列,并通过LLaMA架构进行监督微调和人类反馈强化学习微调。 Result: HapticLLaMA在触觉振动信号的字幕生成中表现出色,METEOR得分为59.98,BLEU-4得分为32.06。超过61%的生成字幕在7分制中获得高于3.5的人类评分,并通过人类反馈强化学习实现了10%的整体评分提升。 Conclusion: HapticLLaMA展现了处理和适应感官数据的潜力,为未来的研究提供了基础,特别是在虚拟现实、可访问性和康复应用中。 Abstract: Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA's captioning performance using both automated n-gram metrics and human evaluation. HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively. Additionally, over 61% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 10% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.[44] Post-training for Efficient Communication via Convention Formation
Yilun Hua,Evan Wang,Yoav Artzi
Main category: cs.CL
TL;DR: 本文介绍了一种后训练过程,通过针对性微调使 LLMs 在多轮交互中形成临时约定的能力显著提高,并通过两种新基准测试验证了这一效果。
Details
Motivation: 与人类在多轮互动中通过适应语言和形成临时约定来提高沟通效率不同,LLMs 天然不具备这种能力。 Method: 通过针对性微调开发了一种后训练过程,使用启发式识别的约定形成示例进行训练。 Result: 在两种新的基准测试中,post-trained LLMs 的约定形成能力显著提升。 Conclusion: post-trained LLMs 显示出显著提高的约定形成能力。 Abstract: Humans communicate with increasing efficiency in multi-turn interactions, by adapting their language and forming ad-hoc conventions. In contrast, prior work shows that LLMs do not naturally show this behavior. We develop a post-training process to develop this ability through targeted fine-tuning on heuristically identified demonstrations of convention formation. We evaluate with two new benchmarks focused on this capability. First, we design a focused, cognitively-motivated interaction benchmark that consistently elicits strong convention formation trends in humans. Second, we create a new document-grounded reference completion task that reflects in-the-wild convention formation behavior. Our studies show significantly improved convention formation abilities in post-trained LLMs across the two evaluation methods.cs.CV [Back]
[45] Boosting Adversarial Transferability via Residual Perturbation Attack
Jinjia Peng,Zeze Tao,Huibing Wang,Meng Wang,Yang Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为ResPA的新型攻击方法,通过利用残差梯度作为扰动方向来生成具有更好迁移能力的对抗样本。
Details
Motivation: 对抗样本在平坦损失景观中表现出较好的迁移性,但现有方法忽视了扰动方向的影响,导致迁移性有限。 Method: ResPA通过对输入梯度进行指数移动平均获得第一动量作为参考梯度,并考虑当前梯度与参考梯度之间的残差来捕捉全局扰动方向的变化。 Result: 实验结果表明,ResPA比现有的典型迁移攻击方法具有更好的迁移性,并且通过与当前输入转换方法结合可以进一步提高迁移性。 Conclusion: ResPA方法能够有效生成迁移性更强的对抗样本,为深度神经网络的安全性研究提供了新思路。 Abstract: Deep neural networks are susceptible to adversarial examples while suffering from incorrect predictions via imperceptible perturbations. Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to target models under black-box scenarios. Recent studies reveal that adversarial examples in flat loss landscapes exhibit superior transferability to alleviate overfitting on surrogate models. However, the prior arts overlook the influence of perturbation directions, resulting in limited transferability. In this paper, we propose a novel attack method, named Residual Perturbation Attack (ResPA), relying on the residual gradient as the perturbation direction to guide the adversarial examples toward the flat regions of the loss function. Specifically, ResPA conducts an exponential moving average on the input gradients to obtain the first moment as the reference gradient, which encompasses the direction of historical gradients. Instead of heavily relying on the local flatness that stems from the current gradients as the perturbation direction, ResPA further considers the residual between the current gradient and the reference gradient to capture the changes in the global perturbation direction. The experimental results demonstrate the better transferability of ResPA than the existing typical transfer-based attack methods, while the transferability can be further improved by combining ResPA with the current input transformation methods. The code is available at https://github.com/ZezeTao/ResPA.[46] Generalized Few-Shot Out-of-Distribution Detection
Pinxuan Li,Bing Cao,Changqing Zhang,Qinghua Hu
Main category: cs.CV
TL;DR: This paper proposes the GOOD framework for Generalized Few-shot OOD Detection, using a General Knowledge Model and a dynamic embedding mechanism to enhance generalization and balance specificity with generality.
Details
Motivation: Existing few-shot OOD detection methods struggle with generalization due to overfitting on limited training data, prompting the need for a more robust and adaptable framework. Method: The study introduces a Generality-Specificity balance (GS-balance) and a Knowledge Dynamic Embedding (KDE) mechanism that dynamically aligns the OOD detection model's outputs with a General Knowledge Model (GKM). Result: The proposed method demonstrates superior performance on real-world OOD benchmarks, validating the effectiveness of the GOOD framework in improving generalization. Conclusion: The proposed GOOD framework enhances few-shot OOD detection by incorporating a general knowledge model, thereby improving generalization and performance consistency across scenarios. Abstract: Few-shot Out-of-Distribution (OOD) detection has emerged as a critical research direction in machine learning for practical deployment. Most existing Few-shot OOD detection methods suffer from insufficient generalization capability for the open world. Due to the few-shot learning paradigm, the OOD detection ability is often overfit to the limited training data itself, thus degrading the performance on generalized data and performing inconsistently across different scenarios. To address this challenge, we proposed a Generalized Few-shot OOD Detection (GOOD) framework, which empowers the general knowledge of the OOD detection model with an auxiliary General Knowledge Model (GKM), instead of directly learning from few-shot data. We proceed to reveal the few-shot OOD detection from a generalization perspective and theoretically derive the Generality-Specificity balance (GS-balance) for OOD detection, which provably reduces the upper bound of generalization error with a general knowledge model. Accordingly, we propose a Knowledge Dynamic Embedding (KDE) mechanism to adaptively modulate the guidance of general knowledge. KDE dynamically aligns the output distributions of the OOD detection model to the general knowledge model based on the Generalized Belief (G-Belief) of GKM, thereby boosting the GS-balance. Experiments on real-world OOD benchmarks demonstrate our superiority. Codes will be available.[47] UnGuide: Learning to Forget with LoRA-Guided Diffusion Models
Agnieszka Polowczyk,Alicja Polowczyk,Dawid Malarz,Artur Kasymov,Marcin Mazur,Jacek Tabor,Przemysław Spurek
Main category: cs.CV
TL;DR: UnGuide is a novel approach for machine unlearning in text-to-image diffusion models, utilizing UnGuidance to selectively remove specific concepts without compromising overall performance, effectively outperforming LoRA-based methods.
Details
Motivation: Concerns about the misuse of large-scale text-to-image diffusion models necessitate the development of effective machine unlearning techniques that can remove specific harmful or misleading concepts without compromising overall model performance. Method: UnGuide utilizes a dynamic inference mechanism called UnGuidance, which leverages Classifier-Free Guidance (CFG) to modulate the guidance scale based on the stability of the initial denoising steps, enabling selective unlearning via a LoRA adapter. Result: UnGuide successfully enables targeted unlearning by the LoRA module for prompts containing the erased concept, while preserving content fidelity for unrelated prompts. It outperforms existing LoRA-based methods in both object erasure and explicit content removal tasks. Conclusion: UnGuide is effective in achieving controlled concept removal while preserving the expressive power of diffusion models, surpassing current LoRA-based methods in object and explicit content removal tasks. Abstract: Recent advances in large-scale text-to-image diffusion models have heightened concerns about their potential misuse, especially in generating harmful or misleading content. This underscores the urgent need for effective machine unlearning, i.e., removing specific knowledge or concepts from pretrained models without compromising overall performance. One possible approach is Low-Rank Adaptation (LoRA), which offers an efficient means to fine-tune models for targeted unlearning. However, LoRA often inadvertently alters unrelated content, leading to diminished image fidelity and realism. To address this limitation, we introduce UnGuide -- a novel approach which incorporates UnGuidance, a dynamic inference mechanism that leverages Classifier-Free Guidance (CFG) to exert precise control over the unlearning process. UnGuide modulates the guidance scale based on the stability of a few first steps of denoising processes, enabling selective unlearning by LoRA adapter. For prompts containing the erased concept, the LoRA module predominates and is counterbalanced by the base model; for unrelated prompts, the base model governs generation, preserving content fidelity. Empirical results demonstrate that UnGuide achieves controlled concept removal and retains the expressive power of diffusion models, outperforming existing LoRA-based methods in both object erasure and explicit content removal tasks.[48] Improving Masked Style Transfer using Blended Partial Convolution
Seyed Hadi Seyed,Ayberk Cansever,David Hart
Main category: cs.CV
TL;DR: 该论文提出了一种新的部分卷积风格迁移方法,能够精确地将风格特征应用于图像的特定区域,并通过内部混合技术提高迁移质量。
Details
Motivation: 现有的风格迁移算法通常将风格迁移应用于整张图像,但用户可能仅需要对图像中的特定区域进行风格迁移。因此,本文提出了一种更精确的解决方案。 Method: 本文采用了基于部分卷积的风格迁移网络,并提出了网络内部的混合技术来优化风格迁移效果。 Result: 通过SA-1B数据集的示例,证明了该方法在视觉和定量评估上都能显著提升风格迁移效果。 Conclusion: 本文提出了一种基于部分卷积的风格迁移网络,能够准确地将风格特征仅应用于感兴趣的区域,并通过网络内部混合技术改善区域选择的不完美问题。 Abstract: Artistic style transfer has long been possible with the advancements of convolution- and transformer-based neural networks. Most algorithms apply the artistic style transfer to the whole image, but individual users may only need to apply a style transfer to a specific region in the image. The standard practice is to simply mask the image after the stylization. This work shows that this approach tends to improperly capture the style features in the region of interest. We propose a partial-convolution-based style transfer network that accurately applies the style features exclusively to the region of interest. Additionally, we present network-internal blending techniques that account for imperfections in the region selection. We show that this visually and quantitatively improves stylization using examples from the SA-1B dataset. Code is publicly available at https://github.com/davidmhart/StyleTransferMasked.[49] MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss
Can Zhao,Pengfei Guo,Dong Yang,Yucheng Tang,Yufan He,Benjamin Simon,Mason Belue,Stephanie Harmon,Baris Turkbey,Daguang Xu
Main category: cs.CV
TL;DR: MAISI-v2是一种加速的3D医学图像合成框架,它结合了校正流以实现快速高质量的生成,并引入了一种新的区域特定对比损失来增强对感兴趣区域的敏感性。
Details
Motivation: 现有的医学图像合成方法在通用性、推理速度和与输入条件的一致性方面存在局限性。尤其是扩散模型通常推理速度较慢,且在医学成像中与输入条件的对齐性较差。 Method: MAISI-v2利用校正流加速生成过程,并通过区域特定的对比损失提高生成图像与输入条件的一致性。 Result: 实验表明,MAISI-v2在保持最先进图像质量的同时,对潜在扩散模型实现了33倍的加速。此外,合成图像可用于数据增强,下游分割实验验证了其有效性。 Conclusion: MAISI-v2是首个加速的3D医学图像合成框架,不仅提高了生成速度,还增强了与输入条件的对齐性,具有良好的应用前景。 Abstract: Medical image synthesis is an important topic for both clinical and research applications. Recently, diffusion models have become a leading approach in this area. Despite their strengths, many existing methods struggle with (1) limited generalizability that only work for specific body regions or voxel spacings, (2) slow inference, which is a common issue for diffusion models, and (3) weak alignment with input conditions, which is a critical issue for medical imaging. MAISI, a previously proposed framework, addresses generalizability issues but still suffers from slow inference and limited condition consistency. In this work, we present MAISI-v2, the first accelerated 3D medical image synthesis framework that integrates rectified flow to enable fast and high quality generation. To further enhance condition fidelity, we introduce a novel region-specific contrastive loss to enhance the sensitivity to region of interest. Our experiments show that MAISI-v2 can achieve SOTA image quality with $33 \times$ acceleration for latent diffusion model. We also conducted a downstream segmentation experiment to show that the synthetic images can be used for data augmentation. We release our code, training details, model weights, and a GUI demo to facilitate reproducibility and promote further development within the community.[50] Few-Shot Deployment of Pretrained MRI Transformers in Brain Imaging Tasks
Mengyu Li,Guoyao Shen,Chad W. Farris,Xin Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于Masked Autoencoder的框架,通过预训练MRI变换模型,解决了医学成像中注释数据稀缺的问题,从而在低资源临床环境中实现高效的分类和分割任务。
Details
Motivation: 由于注释数据的稀缺,基于变换的机器学习在医学成像中的实际应用仍然有限。该研究旨在提出一个适用于不同脑成像任务的少样本部署框架。 Method: 利用Masked Autoencoder(MAE)预训练策略,在包含超过3100万张切片的大规模、多队列脑MRI数据集上获取高度可迁移的潜在表示。对于高层任务,采用冻结的MAE编码器结合轻量级线性头;对于低层任务,如分割,提出了MAE-FUnet混合架构,融合多尺度CNN特征和预训练MAE嵌入。 Result: 在高层任务如分类中,该方法在最小监督下实现了MRI序列识别的最先进的准确性。在低层任务如分割中,MAE-FUnet在数据受限条件下始终优于其他强基线模型。 Conclusion: 该研究提出了一种实用的框架,通过使用预训练的MRI变换模型,解决了医学成像中注释数据稀缺的问题。这种方法在低资源临床环境中具有适用性和广泛的神经影像应用潜力。 Abstract: Machine learning using transformers has shown great potential in medical imaging, but its real-world applicability remains limited due to the scarcity of annotated data. In this study, we propose a practical framework for the few-shot deployment of pretrained MRI transformers in diverse brain imaging tasks. By utilizing the Masked Autoencoder (MAE) pretraining strategy on a large-scale, multi-cohort brain MRI dataset comprising over 31 million slices, we obtain highly transferable latent representations that generalize well across tasks and datasets. For high-level tasks such as classification, a frozen MAE encoder combined with a lightweight linear head achieves state-of-the-art accuracy in MRI sequence identification with minimal supervision. For low-level tasks such as segmentation, we propose MAE-FUnet, a hybrid architecture that fuses multiscale CNN features with pretrained MAE embeddings. This model consistently outperforms other strong baselines in both skull stripping and multi-class anatomical segmentation under data-limited conditions. With extensive quantitative and qualitative evaluations, our framework demonstrates efficiency, stability, and scalability, suggesting its suitability for low-resource clinical environments and broader neuroimaging applications.[51] Optimization-Free Style Transfer for 3D Gaussian Splats
Raphael Du Sablon,David Hart
Main category: cs.CV
TL;DR: The paper proposes a fast and efficient style transfer method for 3D Gaussian splats that requires no additional training or optimization.
Details
Motivation: The motivation is to simplify the style transfer process for 3D Gaussian splats by eliminating the need for reconstruction, fine-tuning, or optimization of the splat or feature extraction network. Method: The method involves generating a graph structure across the implicit surface of the splat representation and using a feed-forward, surface-based stylization technique which is then interpolated back to the individual splats. Result: The result is a fast stylization method that works with any style image and 3D Gaussian splat, achieving speeds under 2 minutes on consumer-grade hardware. Conclusion: The paper concludes that their proposed method for style transfer in 3D Gaussian splats is efficient and effective, achieving fast stylization without the need for additional training or optimization. Abstract: The task of style transfer for 3D Gaussian splats has been explored in many previous works, but these require reconstructing or fine-tuning the splat while incorporating style information or optimizing a feature extraction network on the splat representation. We propose a reconstruction- and optimization-free approach to stylizing 3D Gaussian splats. This is done by generating a graph structure across the implicit surface of the splat representation. A feed-forward, surface-based stylization method is then used and interpolated back to the individual splats in the scene. This allows for any style image and 3D Gaussian splat to be used without any additional training or optimization. This also allows for fast stylization of splats, achieving speeds under 2 minutes even on consumer-grade hardware. We demonstrate the quality results this approach achieves and compare to other 3D Gaussian splat style transfer methods. Code is publicly available at https://github.com/davidmhart/FastSplatStyler.[52] MZEN: Multi-Zoom Enhanced NeRF for 3-D Reconstruction with Unknown Camera Poses
Jong-Ik Park,Carlee Joe-Wong,Gary K. Fedder
Main category: cs.CV
TL;DR: Multi-Zoom Enhanced NeRF (MZEN) is the first framework to handle multi-zoom image sets, enhancing 3D reconstruction for industrial inspection by preserving global accuracy and capturing fine details.
Details
Motivation: NeRF methods struggle with capturing fine-detailed structures crucial in industrial inspection, such as sub-micron defects, due to fixed sensor resolution and tight compute budgets. Method: MZEN augments the pin-hole camera model with a learnable zoom scalar and introduces a novel pose strategy involving wide-field and zoom-in images for consistent multi-zoom handling. Result: Across eight forward-facing scenes, MZEN outperforms pose-free baselines and high-resolution variants by improving PSNR up to 28%, SSIM by 10%, and reducing LPIPS by up to 222%. Conclusion: The proposed MZEN framework extends NeRF to real-world factory settings, preserving global accuracy while capturing micron-level details for industrial inspection. Abstract: Neural Radiance Fields (NeRF) methods excel at 3D reconstruction from multiple 2D images, even those taken with unknown camera poses. However, they still miss the fine-detailed structures that matter in industrial inspection, e.g., detecting sub-micron defects on a production line or analyzing chips with Scanning Electron Microscopy (SEM). In these scenarios, the sensor resolution is fixed and compute budgets are tight, so the only way to expose fine structure is to add zoom-in images; yet, this breaks the multi-view consistency that pose-free NeRF training relies on. We propose Multi-Zoom Enhanced NeRF (MZEN), the first NeRF framework that natively handles multi-zoom image sets. MZEN (i) augments the pin-hole camera model with an explicit, learnable zoom scalar that scales the focal length, and (ii) introduces a novel pose strategy: wide-field images are solved first to establish a global metric frame, and zoom-in images are then pose-primed to the nearest wide-field counterpart via a zoom-consistent crop-and-match procedure before joint refinement. Across eight forward-facing scenes$\unicode{x2013}$synthetic TCAD models, real SEM of micro-structures, and BLEFF objects$\unicode{x2013}$MZEN consistently outperforms pose-free baselines and even high-resolution variants, boosting PSNR by up to $28 \%$, SSIM by $10 \%$, and reducing LPIPS by up to $222 \%$. MZEN, therefore, extends NeRF to real-world factory settings, preserving global accuracy while capturing the micron-level details essential for industrial inspection.[53] TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios
Guoping Xu,Hua-Chieh Shao,You Zhang
Main category: cs.CV
TL;DR: TSMS-SAM2通过多时间尺度视频采样增强和记忆分割剪枝机制,有效解决了手术视频分析中的快速物体运动和记忆冗余问题,实现了高效准确的视频对象分割与跟踪。
Details
Motivation: 手术视频分析中存在复杂的运动动态及记忆冗余问题,阻碍了基础模型如SAM2的有效应用。 Method: TSMS-SAM2框架引入了两种关键策略:多时间尺度视频采样增强以提高对运动变化的鲁棒性,以及记忆分割和剪枝机制以更高效、准确地进行分割。 Result: 在EndoVis2017和EndoVis2018数据集上评估,TSMS-SAM2分别取得了95.24和86.73的平均Dice分数,优于现有的SAM-based方法和特定任务方法。 Conclusion: TSMS-SAM2实现了对复杂手术视频场景中快速物体运动和记忆冗余问题的有效解决,展示了其在鲁棒性和分割效率方面的潜力。 Abstract: Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS-SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM-based and task-specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework's potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at https://github.com/apple1986/TSMS-SAM2.[54] Temporal Cluster Assignment for Efficient Real-Time Video Segmentation
Ka-Wai Yung,Felix J. S. Bragman,Jialang Xu,Imanol Luengo,Danail Stoyanov,Evangelos B. Mazomenos
Main category: cs.CV
TL;DR: This paper introduces Temporal Cluster Assignment (TCA), a training-free method that improves video segmentation efficiency by leveraging temporal coherence to refine token clustering, reducing computational cost while maintaining accuracy.
Details
Motivation: Despite the effectiveness of Swin Transformers in capturing hierarchical representations for video segmentation, their window-attention scheme incurs high computational costs, especially for larger variants. Existing token reduction methods are limited due to the fixed token-per-window requirement and the failure to exploit temporal redundancy in video data. Method: TCA utilizes temporal correlations to refine token clusters instead of indiscriminately dropping redundant tokens, ensuring fine-grained detail retention while reducing computation. Result: Extensive evaluations on YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and a private surgical video dataset demonstrate that TCA consistently improves the accuracy-speed trade-off of clustering-based methods and generalizes well across both natural and domain-specific videos. Conclusion: Temporal Cluster Assignment (TCA) is a lightweight and effective strategy that enhances token clustering by leveraging temporal coherence across frames, improving the accuracy-speed trade-off for video segmentation. Abstract: Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains. Among them, the Swin Transformer stands out for its ability to capture hierarchical, multi-scale representations, making it a popular backbone for segmentation in videos. However, despite its window-attention scheme, it still incurs a high computational cost, especially in larger variants commonly used for dense prediction in videos. This remains a major bottleneck for real-time, resource-constrained applications. Whilst token reduction methods have been proposed to alleviate this, the window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques. Meanwhile, training-free token clustering approaches have shown promise in image segmentation while maintaining window consistency. Nevertheless, they fail to exploit temporal redundancy, missing a key opportunity to further optimize video segmentation performance. We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine-tuning-free strategy that enhances token clustering by leveraging temporal coherence across frames. Instead of indiscriminately dropping redundant tokens, TCA refines token clusters using temporal correlations, thereby retaining fine-grained details while significantly reducing computation. Extensive evaluations on YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and a private surgical video dataset show that TCA consistently boosts the accuracy-speed trade-off of existing clustering-based methods. Our results demonstrate that TCA generalizes competently across both natural and domain-specific videos.[55] VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments
Kaiser Hamid,Khandakar Ashrafi Akbar,Nade Liang
Main category: cs.CV
TL;DR: 本文提出了一种利用自然语言预测驾驶员视觉注意力的新方法,结合低级线索和高级语境,提升了注意力转移检测的性能,并为自动驾驶中的可解释AI提供了基础。
Details
Motivation: 大多数先前的研究集中在单一时刻估计注意力分配,通常使用静态RGB图像,而忽略了上下文和动态变化。 Method: 通过使用少量样本和零样本学习,基于单个RGB图像利用自然语言建模驾驶员的注视变化,对LLaVA进行微调以对齐视觉感知与注意力中心的场景理解。 Result: 该方法在注意力转移检测和可解释性方面优于通用视觉语言模型,并引入了领域特定的指标来评估语义对齐和响应多样性。 Conclusion: 本文提出了一种基于视觉-语言框架的方法,用于预测驾驶员的视觉注意力分配和转移,为自动驾驶中可解释人工智能提供了新的方向。 Abstract: Driver visual attention prediction is a critical task in autonomous driving and human-computer interaction (HCI) research. Most prior studies focus on estimating attention allocation at a single moment in time, typically using static RGB images such as driving scene pictures. In this work, we propose a vision-language framework that models the changing landscape of drivers' gaze through natural language, using few-shot and zero-shot learning on single RGB images. We curate and refine high-quality captions from the BDD-A dataset using human-in-the-loop feedback, then fine-tune LLaVA to align visual perception with attention-centric scene understanding. Our approach integrates both low-level cues and top-down context (e.g., route semantics, risk anticipation), enabling language-based descriptions of gaze behavior. We evaluate performance across training regimes (few shot, and one-shot) and introduce domain-specific metrics for semantic alignment and response diversity. Results show that our fine-tuned model outperforms general-purpose VLMs in attention shift detection and interpretability. To our knowledge, this is among the first attempts to generate driver visual attention allocation and shifting predictions in natural language, offering a new direction for explainable AI in autonomous driving. Our approach provides a foundation for downstream tasks such as behavior forecasting, human-AI teaming, and multi-agent coordination.[56] Multi-view Gaze Target Estimation
Qiaomu Miao,Vivek Raju Golani,Jingyi Xu,Progga Paromita Dutta,Minh Hoai,Dimitris Samaras
Main category: cs.CV
TL;DR: The paper introduces a multi-view method for gaze target estimation that overcomes the limitations of single-view methods by integrating information from multiple camera views.
Details
Motivation: The motivation is to address the limitations of existing single-view methods for gaze target estimation, including issues like face occlusion, target ambiguity, and out-of-view targets. Method: The method involves a Head Information Aggregation (HIA) module, an Uncertainty-based Gaze Selection (UGS), and an Epipolar-based Scene Attention (ESA) module to integrate information from multiple camera views. Result: The approach significantly outperforms single-view baselines, particularly when the second camera provides a clear view of the person's face. It also allows for gaze target estimation using only the second view's image. Conclusion: The paper concludes that the proposed multi-view method for gaze target estimation outperforms single-view baselines and enables new capabilities, such as estimating gaze from one view using only the image from another view. Abstract: This paper presents a method that utilizes multiple camera views for the gaze target estimation (GTE) task. The approach integrates information from different camera views to improve accuracy and expand applicability, addressing limitations in existing single-view methods that face challenges such as face occlusion, target ambiguity, and out-of-view targets. Our method processes a pair of camera views as input, incorporating a Head Information Aggregation (HIA) module for leveraging head information from both views for more accurate gaze estimation, an Uncertainty-based Gaze Selection (UGS) for identifying the most reliable gaze output, and an Epipolar-based Scene Attention (ESA) module for cross-view background information sharing. This approach significantly outperforms single-view baselines, especially when the second camera provides a clear view of the person's face. Additionally, our method can estimate the gaze target in the first view using the image of the person in the second view only, a capability not possessed by single-view GTE methods. Furthermore, the paper introduces a multi-view dataset for developing and evaluating multi-view GTE methods. Data and code are available at https://www3.cs.stonybrook.edu/~cvl/multiview_gte.html[57] ETTA: Efficient Test-Time Adaptation for Vision-Language Models through Dynamic Embedding Updates
Hamidreza Dastmalchi,Aijun An,Ali cheraghian
Main category: cs.CV
TL;DR: ETTA通过递归更新模块和自适应集成模块,有效提升预训练视觉-语言模型在测试时自适应任务中的准确性和计算效率。
Details
Motivation: 预训练的视觉-语言模型(VLMs)在零样本任务中表现良好,但在分布偏移情况下泛化能力有限。现有的基于缓存的测试时自适应方法仅存储高置信度样本,忽略了其他测试数据的影响,限制了决策边界优化。 Method: ETTA引入递归更新模块,利用所有传入的测试样本逐步优化决策边界,并设计自适应集成模块动态选择最优提示词。此外,ETTA根据置信度水平结合两个模块的输出结果,以发挥各自优势。 Result: 在两个基准数据集上的实验表明,ETTA在计算复杂度和准确性方面均优于当前最先进的测试时自适应模型,成为高效自适应的新标准。 Conclusion: ETTA提供了一种高效的测试时自适应策略,显著提升了视觉-语言模型在新领域中的泛化性能。 Abstract: Pretrained vision-language models (VLMs) like CLIP show strong zero-shot performance but struggle with generalization under distribution shifts. Test-Time Adaptation (TTA) addresses this by adapting VLMs to unlabeled test data in new domains. While some TTA methods rely on prompt-tuning, training-free cache-based approaches are preferred for efficiency. However, current cache-based TTA models store only a limited set of high-confidence samples, restricting the decision boundary to these samples and ignoring the influence of other incoming test data. To address this, we propose Efficient Test-Time Adaptation (ETTA), introducing a Recursive Updating module that integrates all incoming test samples, progressively refining the decision boundary. This strategy mimics an unbounded cache, dynamically updating contextual embeddings for improved accuracy with minimal memory and computational overhead. ETTA also includes an Adaptive Ensemble module to reduce prompt dependency in image-to-text scores by dynamically selecting optimal prompts for each class. Furthermore, ETTA adaptively combines scores from both modules based on confidence levels, leveraging their complementary strengths. Extensive experiments on two benchmarks confirm that ETTA surpasses the state-of-the-art TTA models in computational complexity and accuracy, setting a new standard for effective, efficient test-time adaptation. The code has been released at https://github.com/hamidreza-dastmalchi/ETTA.[58] HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing
Zixuan Bian,Ruohan Ren,Yue Yang,Chris Callison-Burch
Main category: cs.CV
TL;DR: HOLODECK 2.0 是一个基于视觉-语言引导的先进3D世界生成框架,支持交互式编辑,能够根据详细文本描述生成高质量、语义一致的3D场景。
Details
Motivation: 3D场景设计目前仍需大量人工努力,而现有自动化方法难以生成开放领域场景或支持灵活编辑,因此需要一种能够直接从文本生成3D世界的工具。 Method: HOLODECK 2.0 利用视觉-语言模型(VLMs)解析场景所需对象,并通过先进的3D生成模型创建高质量资产。它通过迭代应用从VLMs中提取的空间约束,实现语义一致且物理合理的布局。 Result: HOLODECK 2.0 能够生成多样化且风格丰富的3D场景,适用于室内和开放领域环境,并在人类评估和CLIP-based评估中表现出优于基线模型的效果。 Conclusion: HOLODECK 2.0 提供了一个高效的3D世界生成框架,支持基于人类反馈的交互式场景编辑,并在生成质量和应用潜力方面表现出色。 Abstract: 3D scene generation plays a crucial role in gaming, artistic creation, virtual reality and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. As a result, generating 3D worlds directly from text has garnered increasing attention. In this paper, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. It then iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Human evaluations and CLIP-based assessments demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, we provide editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling, generating visually rich and immersive environments, potentially boosting efficiency.[59] Robust Image Stitching with Optimal Plane
Lang Nie,Yuan Mei,Kang Liao,Yunqiu Xu,Chunyu Lin,Bin Xiao
Main category: cs.CV
TL;DR: RopStitch是一种新的无监督深度图像拼接框架,它通过双分支架构和虚拟最优平面的概念,在保持内容自然性的同时提高了场景鲁棒性。
Details
Motivation: 图像拼接需要在保持内容自然性的同时提高场景鲁棒性。传统的图像拼接方法在处理多样化的现实场景时存在局限性,内容对齐和结构保持之间存在矛盾。 Method: RopStitch采用了双分支架构,一个预训练分支用于捕获语义不变表示,一个可学习分支用于提取细粒度判别特征,并通过相关级别的可控因子将它们合并。此外,还提出了虚拟最优平面的概念,通过估计单应性分解系数,并设计迭代系数预测器和最小语义失真约束来识别最优平面。 Result: 在各种数据集上的广泛实验表明,RopStitch在场景鲁棒性和内容自然性方面显著优于现有方法。 Conclusion: RopStitch是一个无监督的深度图像拼接框架,它通过引入内容感知的通用先验和虚拟最优平面的概念,在保证鲁棒性的同时实现了自然性。 Abstract: We present \textit{RopStitch}, an unsupervised deep image stitching framework with both robustness and naturalness. To ensure the robustness of \textit{RopStitch}, we propose to incorporate the universal prior of content perception into the image stitching model by a dual-branch architecture. It separately captures coarse and fine features and integrates them to achieve highly generalizable performance across diverse unseen real-world scenes. Concretely, the dual-branch model consists of a pretrained branch to capture semantically invariant representations and a learnable branch to extract fine-grained discriminative features, which are then merged into a whole by a controllable factor at the correlation level. Besides, considering that content alignment and structural preservation are often contradictory to each other, we propose a concept of virtual optimal planes to relieve this conflict. To this end, we model this problem as a process of estimating homography decomposition coefficients, and design an iterative coefficient predictor and minimal semantic distortion constraint to identify the optimal plane. This scheme is finally incorporated into \textit{RopStitch} by warping both views onto the optimal plane bidirectionally. Extensive experiments across various datasets demonstrate that \textit{RopStitch} significantly outperforms existing methods, particularly in scene robustness and content naturalness. The code is available at {\color{red}https://github.com/MmelodYy/RopStitch}.[60] Neural Field Representations of Mobile Computational Photography
Ilya Chugunov
Main category: cs.CV
TL;DR: 本论文研究了基于神经场模型的手机摄影数据处理方法,实现了高效、准确的深度估计和图像处理。
Details
Motivation: 手机成像技术的发展使手机成为一个多功能的掌上计算成像平台,而神经场技术能够无需显式数据表示即可重建复杂场景。 Method: 设计了能够紧凑表示复杂几何和光照效果的神经场模型,利用良好的自正则化模型,通过随机梯度下降直接拟合智能手机的原始测量数据。 Result: 该方法在深度估计、图层分离和图像拼接等应用中表现优于现有最先进的方法。 Conclusion: 神经场模型能够在没有复杂预处理步骤、标记的地面真实数据或机器学习先验的情况下,通过直接拟合智能手机的原始测量数据解决具有挑战性的逆问题。 Abstract: Over the past two decades, mobile imaging has experienced a profound transformation, with cell phones rapidly eclipsing all other forms of digital photography in popularity. Today's cell phones are equipped with a diverse range of imaging technologies - laser depth ranging, multi-focal camera arrays, and split-pixel sensors - alongside non-visual sensors such as gyroscopes, accelerometers, and magnetometers. This, combined with on-board integrated chips for image and signal processing, makes the cell phone a versatile pocket-sized computational imaging platform. Parallel to this, we have seen in recent years how neural fields - small neural networks trained to map continuous spatial input coordinates to output signals - enable the reconstruction of complex scenes without explicit data representations such as pixel arrays or point clouds. In this thesis, I demonstrate how carefully designed neural field models can compactly represent complex geometry and lighting effects. Enabling applications such as depth estimation, layer separation, and image stitching directly from collected in-the-wild mobile photography data. These methods outperform state-of-the-art approaches without relying on complex pre-processing steps, labeled ground truth data, or machine learning priors. Instead, they leverage well-constructed, self-regularized models that tackle challenging inverse problems through stochastic gradient descent, fitting directly to raw measurements from a smartphone.[61] Enhancing Construction Site Analysis and Understanding with 3D Segmentation
Sri Ramana Saketh Vasanthawada,Pengkun Liu,Pingbo Tang
Main category: cs.CV
TL;DR: 这篇论文研究了使用计算机视觉技术提高建筑施工进度监控效率的方法,评估了两种先进的3D分割方法在室内外环境中的性能,指出了户外场景中缺乏基准测试的问题,并提出了需要定制化分割流程来提取建筑工地数据的见解。
Details
Motivation: 监控施工进度至关重要但资源消耗大,促使探索基于计算机视觉的方法以提高效率和可扩展性。传统的数据采集方法在施工场地复杂、杂乱和动态变化的条件下表现不佳。 Method: 本文批判性地评估了两种先进的3D分割方法,即Segment Anything Model (SAM)和Mask3D,在具有挑战性的室内外环境中的应用。 Result: 经过初始室内数据集训练后,评估了两种模型在现实建筑环境中的适应性和性能。 Conclusion: 论文强调了在建筑环境中,特别是户外场景下,缺乏基准测试导致当前分割方法存在差距。通过比较分析,研究不仅展示了SAM和Mask3D的相对有效性,还提出了需要定制化的分割工作流程,以从建筑工地数据中提取可操作的见解,从而推进建筑监控技术向更自动化和精确的方向发展。 Abstract: Monitoring construction progress is crucial yet resource-intensive, prompting the exploration of computer-vision-based methodologies for enhanced efficiency and scalability. Traditional data acquisition methods, primarily focusing on indoor environments, falter in construction site's complex, cluttered, and dynamically changing conditions. This paper critically evaluates the application of two advanced 3D segmentation methods, Segment Anything Model (SAM) and Mask3D, in challenging outdoor and indoor conditions. Trained initially on indoor datasets, both models' adaptability and performance are assessed in real-world construction settings, highlighting the gap in current segmentation approaches due to the absence of benchmarks for outdoor scenarios. Through a comparative analysis, this study not only showcases the relative effectiveness of SAM and Mask3D but also addresses the critical need for tailored segmentation workflows capable of extracting actionable insights from construction site data, thereby advancing the field towards more automated and precise monitoring techniques.[62] A 3DGS-Diffusion Self-Supervised Framework for Normal Estimation from a Single Image
Yanxing Liang,Yinghui Wang,Jinlong Yang,Wei Li
Main category: cs.CV
TL;DR: SINGAD是一种新的自监督单图像法线估计框架,通过3D高斯散射引导的扩散,解决了多视角几何不一致性和数据依赖性问题。
Details
Motivation: 从单个图像进行法线估计时,缺乏空间维度信息仍然是一个挑战。最近的基于扩散的方法依赖于数据驱动的统计先验,并且缺少对光表面相互作用的显式建模,导致多视角法线方向冲突。此外,扩散模型的离散采样机制导致了在可微渲染重建模块中梯度不连续,阻碍了3D几何误差向法线生成网络的反向传播,迫使现有方法依赖于密集的法线注释。 Method: SINGAD提出了一种新的自监督框架,用于通过3D高斯散射引导的扩散进行单图像法线估计。该方法整合了物理驱动的光相互作用建模和可微渲染的重投影策略,以直接将3D几何误差转化为法线优化信号。 Result: 在Google Scanned Objects数据集上的定量评估表明,该方法在多个指标上优于最先进的方法。 Conclusion: SINGAD通过结合物理驱动的光相互作用建模和基于可微渲染的重投影策略,直接将3D几何误差转化为法线优化信号,解决了多视角几何不一致性和数据依赖性问题。 Abstract: The lack of spatial dimensional information remains a challenge in normal estimation from a single image. Recent diffusion-based methods have demonstrated significant potential in 2D-to-3D implicit mapping, they rely on data-driven statistical priors and miss the explicit modeling of light-surface interaction, leading to multi-view normal direction conflicts. Moreover, the discrete sampling mechanism of diffusion models causes gradient discontinuity in differentiable rendering reconstruction modules, preventing 3D geometric errors from being backpropagated to the normal generation network, thereby forcing existing methods to depend on dense normal annotations. This paper proposes SINGAD, a novel Self-supervised framework from a single Image for Normal estimation via 3D GAussian splatting guided Diffusion. By integrating physics-driven light-interaction modeling and a differentiable rendering-based reprojection strategy, our framework directly converts 3D geometric errors into normal optimization signals, solving the challenges of multi-view geometric inconsistency and data dependency. Specifically, the framework constructs a light-interaction-driven 3DGS reparameterization model to generate multi-scale geometric features consistent with light transport principles, ensuring multi-view normal consistency. A cross-domain feature fusion module is designed within a conditional diffusion model, embedding geometric priors to constrain normal generation while maintaining accurate geometric error propagation. Furthermore, a differentiable 3D reprojection loss strategy is introduced for self-supervised optimization that minimizes geometric error between the reconstructed and input image, eliminating dependence on annotated normal datasets. Quantitative evaluations on the Google Scanned Objects dataset demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.[63] Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
Han Lin,Jaemin Cho,Amir Zadeh,Chuan Li,Mohit Bansal
Main category: cs.CV
TL;DR: Bifrost-1通过使用patch-level CLIP图像嵌入作为潜在变量,将预训练的多模态LLM(MLLM)和扩散模型结合起来,实现了高效的高保真可控图像生成。
Details
Motivation: 将高保真视觉合成能力集成到大语言模型(LLMs)中,同时保持其强大的推理能力,是当前研究的一个热点。现有的方法由于需要昂贵的训练成本,限制了其应用。 Method: Bifrost-1使用patch-level CLIP图像嵌入作为潜在变量,将预训练的MLLM和扩散模型连接起来。这些图像嵌入通过对其ControlNet进行轻量级适配集成到扩散模型中。为了保持MLLM的多模态推理能力,当预测patch-level图像嵌入时,为MLLM配备了一个从原始MLLM参数初始化的视觉生成分支。 Result: 实验表明,与之前的方法相比,Bifrost-1在视觉保真度和多模态理解方面表现相当或更好,同时在训练期间计算量显著减少。此外,还提供了全面的消融研究以证明设计选择的有效性。 Conclusion: Bifrost-1框架能够高效地集成预训练的MLLM和扩散模型,实现高保真可控图像生成,同时保持MLLM的多模态推理能力。 Abstract: There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM's CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.[64] PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation
Zhihao Zhu,Yifan Zheng,Siyu Pan,Yaohui Jin,Yao Mu
Main category: cs.CV
TL;DR: This paper proposes PASG, a closed-loop framework that bridges geometric features and task semantics in robotic manipulation using automatic primitive extraction, VLM-driven semantic anchoring, and spatial-semantic reasoning.
Details
Motivation: The fragmentation between high-level task semantics and low-level geometric features poses a challenge in robotic manipulation. Vision-language models (VLMs) lack semantic grounding in canonical spaces and rely on manual annotations, limiting their ability to capture dynamic semantic-affordance relationships. Method: The proposed Primitive-Aware Semantic Grounding (PASG) framework includes automatic primitive extraction, VLM-driven semantic anchoring, and spatial-semantic reasoning. A fine-tuned VLM (Qwen2.5VL-PA) and a benchmark are also introduced. Result: PASG demonstrates effectiveness in robotic manipulation tasks across diverse scenarios with performance comparable to manual annotations, achieving finer-grained semantic-affordance understanding. Conclusion: PASG establishes a unified paradigm for bridging geometric primitives with task semantics in robotic manipulation, achieving performance comparable to manual annotations and enabling finer-grained semantic-affordance understanding of objects. Abstract: The fragmentation between high-level task semantics and low-level geometric features remains a persistent challenge in robotic manipulation. While vision-language models (VLMs) have shown promise in generating affordance-aware visual representations, the lack of semantic grounding in canonical spaces and reliance on manual annotations severely limit their ability to capture dynamic semantic-affordance relationships. To address these, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). We demonstrate PASG's effectiveness in practical robotic manipulation tasks across diverse scenarios, achieving performance comparable to manual annotations. PASG achieves a finer-grained semantic-affordance understanding of objects, establishing a unified paradigm for bridging geometric primitives with task semantics in robotic manipulation.[65] AnimateScene: Camera-controllable Animation in Any Scene
Qingyang Liu,Bingjie Gao,Weiheng Huang,Jun Zhang,Zhongqian Sun,Yang Wei,Zelin Peng,Qianli Ma,Shuai Yang,Zhaohe Liao,Haonan Zhao,Li Niu
Main category: cs.CV
TL;DR: AnimateScene是一个解决3D场景重建和4D人体动画集成问题的统一框架,生成高质量的动态场景视频。
Details
Motivation: 将3D场景重建和4D人体动画集成以产生视觉吸引人的结果仍然具有挑战性。 Method: AnimateScene设计了一个精确的放置模块,一个无需训练的风格对齐方法,以及一个用于4D人体和3D场景的联合后重建方法。 Result: AnimateScene在各种相机和动作组合中生成具有高几何细节和时空一致性的动态场景视频。 Conclusion: AnimateScene是一个统一的框架,可以解决3D场景重建和4D人体动画集成中的问题,生成具有高几何细节和时空一致性的动态场景视频。 Abstract: 3D scene reconstruction and 4D human animation have seen rapid progress and broad adoption in recent years. However, seamlessly integrating reconstructed scenes with 4D human animation to produce visually engaging results remains challenging. One key difficulty lies in placing the human at the correct location and scale within the scene while avoiding unrealistic interpenetration. Another challenge is that the human and the background may exhibit different lighting and style, leading to unrealistic composites. In addition, appealing character motion videos are often accompanied by camera movements, which means that the viewpoints need to be reconstructed along a specified trajectory. We present AnimateScene, which addresses the above issues in a unified framework. First, we design an accurate placement module that automatically determines a plausible 3D position for the human and prevents any interpenetration within the scene during motion. Second, we propose a training-free style alignment method that adapts the 4D human representation to match the background's lighting and style, achieving coherent visual integration. Finally, we design a joint post-reconstruction method for both the 4D human and the 3D scene that allows camera trajectories to be inserted, enabling the final rendered video to feature visually appealing camera movements. Extensive experiments show that AnimateScene generates dynamic scene videos with high geometric detail and spatiotemporal coherence across various camera and action combinations.[66] ETA: Energy-based Test-time Adaptation for Depth Completion
Younjoon Chung,Hyoungseob Park,Patrick Rim,Xiaoran Zhang,Jihe He,Ziyao Zeng,Safa Cicek,Byung-Woo Hong,James S. Duncan,Alex Wong
Main category: cs.CV
TL;DR: The paper introduces ETA, an energy-based test-time adaptation method for depth completion models, which improves performance in new environments without prior target data knowledge.
Details
Motivation: Depth completion models often fail when transferred to new environments due to covariate shift. Existing methods assume prior knowledge of the target distribution, which may not be available. Method: ETA uses adversarial perturbations to train an energy model that scores depth predictions as in- or out-of-distribution. Model parameters are then updated at test time to minimize energy, aligning predictions with the source distribution. Result: Evaluated on three indoor and three outdoor datasets, ETA outperforms the previous state-of-the-art method by 6.94% for outdoor and 10.23% for indoor scenarios. Conclusion: The proposed ETA method effectively improves test-time adaptation of depth completion models by aligning predictions with the source distribution, leading to better performance than existing approaches. Abstract: We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some ``source'' data, often predict erroneous outputs when transferred to ``target'' data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-of-distribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method ``Energy-based Test-time Adaptation'', or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.[67] Fast Motion Estimation and Context-Aware Refinement for Efficient Bayer-Domain Video Vision
Haichao Wang,Xinyue Xi,Jiangtao Wen,Yuxing Han
Main category: cs.CV
TL;DR: 本文提出了一种高效的视频计算机视觉系统,通过去除图像信号处理器和使用基于快速块匹配的运动估计算法,减少了视频处理的冗余和计算开销。
Details
Motivation: 视频计算机视觉系统的效率受限于视频内部的高时间冗余,以及现有方法未能完全减少冗余和忽略前端计算开销。 Method: 首先,去除图像信号处理器,将Bayer格式数据直接输入模型,节省前端计算。其次,采用基于快速块匹配的运动估计算法,并引入MV优化模块和上下文感知块优化网络来修正误差,同时采用帧选择策略以平衡准确性和效率。 Result: 实验表明,该方法在多个视频计算机视觉任务中显著提高了处理速度,且性能损失较小。 Conclusion: 该研究成功提出了一种更高效的视频计算机视觉系统,在减少计算冗余和提升处理速度方面表现突出。 Abstract: The efficiency of video computer vision system remains a challenging task due to the high temporal redundancy inside a video. Existing works have been proposed for efficient vision computer vision. However, they do not fully reduce the temporal redundancy and neglect the front end computation overhead. In this paper, we propose an efficient video computer vision system. First, image signal processor is removed and Bayer-format data is directly fed into video computer vision models, thus saving the front end computation. Second, instead of optical flow models and video codecs, a fast block matching-based motion estimation algorithm is proposed specifically for efficient video computer vision, with a MV refinement module. To correct the error, context-aware block refinement network is introduced to refine regions with large error. To further balance the accuracy and efficiency, a frame selection strategy is employed. Experiments on multiple video computer vision tasks demonstrate that our method achieves significant acceleration with slight performance loss.[68] ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge
Juewen Hu,Yexin Li,Jiulin Li,Shuo Chen,Pring Wong
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态情感识别框架,通过利用大规模预训练模型和创新的融合策略,在MER2025竞赛的MER-SEMI挑战中取得了比官方基线更好的性能。
Details
Motivation: 情感识别在增强人机交互中起着关键作用。本文旨在通过提出一种新的多模态情感识别框架来解决MER-SEMI挑战。 Method: 为了解决数据稀缺问题,本文使用大规模预训练模型从视觉、音频和文本模态中提取信息特征。对于视觉模态,设计了一个双分支视觉编码器来捕捉全局帧级特征和局部面部表示。对于文本模态,引入了一种上下文丰富的方法,利用大语言模型来增强输入文本中的情感线索。为了有效整合这些多模态特征,提出了一种包含自注意力机制和残差连接的融合策略。 Result: 该方法在MER2025-SEMI数据集上相比官方基线实现了显著的性能提升,加权F分数达到了87.49%,而基线为78.63%。 Conclusion: 本文提出的方法验证了所提框架的有效性,并在MER2025竞赛的MER-SEMI挑战中取得了优异的成绩。 Abstract: Emotion recognition plays a vital role in enhancing human-computer interaction. In this study, we tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework. To address the issue of data scarcity, we leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities. Specifically, for the visual modality, we design a dual-branch visual encoder that captures both global frame-level features and localized facial representations. For the textual modality, we introduce a context-enriched method that employs large language models to enrich emotional cues within the input text. To effectively integrate these multimodal features, we propose a fusion strategy comprising two key components, i.e., self-attention mechanisms for dynamic modality weighting, and residual connections to preserve original representations. Beyond architectural design, we further refine noisy labels in the training set by a multi-source labeling strategy. Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset, attaining a weighted F-score of 87.49% compared to 78.63%, thereby validating the effectiveness of the proposed framework.[69] EvoMakeup: High-Fidelity and Controllable Makeup Editing with MakeupQuad
Huadong Wu,Yi Fu,Yunhao Li,Yuan Gao,Kang Du
Main category: cs.CV
TL;DR: 本文提出了EvoMakeup,一种基于大规模高质量数据集的面部化妆编辑方法,实现了高保真度和身份保留。
Details
Motivation: 现有方法在面部化妆编辑方面效果不佳,缺乏结构化配对数据,本文旨在解决这一问题。 Method: 构建了MakeupQuad数据集,并提出了EvoMakeup统一训练框架,缓解了多阶段蒸馏中的图像退化问题。 Result: 实验结果表明,该方法在化妆保真度和身份保留方面均优于先前方法。 Conclusion: EvoMakeup实现了高质量、可控的多任务面部化妆编辑,并在真实世界基准测试中表现出色。 Abstract: Facial makeup editing aims to realistically transfer makeup from a reference to a target face. Existing methods often produce low-quality results with coarse makeup details and struggle to preserve both identity and makeup fidelity, mainly due to the lack of structured paired data -- where source and result share identity, and reference and result share identical makeup. To address this, we introduce MakeupQuad, a large-scale, high-quality dataset with non-makeup faces, references, edited results, and textual makeup descriptions. Building on this, we propose EvoMakeup, a unified training framework that mitigates image degradation during multi-stage distillation, enabling iterative improvement of both data and model quality. Although trained solely on synthetic data, EvoMakeup generalizes well and outperforms prior methods on real-world benchmarks. It supports high-fidelity, controllable, multi-task makeup editing -- including full-face and partial reference-based editing, as well as text-driven makeup editing -- within a single model. Experimental results demonstrate that our method achieves superior makeup fidelity and identity preservation, effectively balancing both aspects. Code and dataset will be released upon acceptance.[70] MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models
Jun Feng,Zixin Wang,Zhentao Zhang,Yue Guo,Zhihan Zhou,Xiuyi Chen,Zhenyang Li,Dawei Yin
Main category: cs.CV
TL;DR: 本文提出了MathReal数据集,用于评估多模态大语言模型在真实教育场景中的数学推理能力,并发现现有模型在该场景下面临较大挑战。
Details
Motivation: 现有的多模态基准测试主要基于干净或处理过的输入,缺乏真实世界中K-12教育用户提供的图像数据。 Method: 构建了一个名为MathReal的数据集,包含2000个真实场景中拍摄的数学问题图像,并设计了六种实验设置来系统评估模型性能。 Result: 实验表明,当前最先进的多模态大语言模型在真实教育场景中解决问题的能力受到显著挑战。 Conclusion: 现有的多模态大语言模型在真实的教育场景中面临显著挑战,需要进一步改进其识别、理解和推理能力。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.[71] ExploreGS: Explorable 3D Scene Reconstruction with Virtual Camera Samplings and Diffusion Priors
Minsu Kim,Subin Jeon,In Cho,Mijin Yoo,Seon Joo Kim
Main category: cs.CV
TL;DR: 本文提出了一種基於3D高斯濺射的管道,通過生成額外的訓練視圖來增強重建,解決了從偏離訓練軌跡的視點渲染時的偽影和缺失區域問題。
Details
Motivation: 現有方法在偏離訓練軌跡的視點渲染時存在偽影和缺失區域,限制了場景探索的無縫性。 Method: 提出了一種基於3D高斯濺射的管道,結合信息增益驅動的虛擬相機放置策略和視頻擴散先驗來生成和優化額外的訓練視圖。 Result: 實驗表明,該方法在Wild-Explore基準上表現優異,能夠從任意視點實現高質量、無偽影的渲染。 Conclusion: 該方法有效提升了場景重建質量,解決了現有方法在挑戰性視點渲染中的問題。 Abstract: Recent advances in novel view synthesis (NVS) have enabled real-time rendering with 3D Gaussian Splatting (3DGS). However, existing methods struggle with artifacts and missing regions when rendering from viewpoints that deviate from the training trajectory, limiting seamless scene exploration. To address this, we propose a 3DGS-based pipeline that generates additional training views to enhance reconstruction. We introduce an information-gain-driven virtual camera placement strategy to maximize scene coverage, followed by video diffusion priors to refine rendered results. Fine-tuning 3D Gaussians with these enhanced views significantly improves reconstruction quality. To evaluate our method, we present Wild-Explore, a benchmark designed for challenging scene exploration. Experiments demonstrate that our approach outperforms existing 3DGS-based methods, enabling high-quality, artifact-free rendering from arbitrary viewpoints. https://exploregs.github.io[72] Improved Sub-Visible Particle Classification in Flow Imaging Microscopy via Generative AI-Based Image Synthesis
Utku Ozbulak,Michaela Cohrs,Hristo L. Svilenov,Joris Vankerschaver,Wesley De Neve
Main category: cs.CV
TL;DR: This paper introduces a diffusion model to generate synthetic particle images, effectively addressing data imbalance in particle analysis and enhancing deep learning classification performance.
Details
Motivation: The motivation stems from the challenge of data scarcity and imbalance in sub-visible particle analysis, especially for rare particles like silicone oil and air bubbles. Method: The researchers developed a state-of-the-art diffusion model to generate high-fidelity particle images, augmenting training datasets and enabling better classifier training. Result: The proposed method successfully generated realistic particle images and significantly improved classification performance in experiments involving 500,000 protein particle images. Conclusion: The study concludes that using diffusion models to generate high-fidelity images can address data imbalance in particle analysis, improving the performance of multi-class deep neural networks. Abstract: Sub-visible particle analysis using flow imaging microscopy combined with deep learning has proven effective in identifying particle types, enabling the distinction of harmless components such as silicone oil from protein particles. However, the scarcity of available data and severe imbalance between particle types within datasets remain substantial hurdles when applying multi-class classifiers to such problems, often forcing researchers to rely on less effective methods. The aforementioned issue is particularly challenging for particle types that appear unintentionally and in lower numbers, such as silicone oil and air bubbles, as opposed to protein particles, where obtaining large numbers of images through controlled settings is comparatively straightforward. In this work, we develop a state-of-the-art diffusion model to address data imbalance by generating high-fidelity images that can augment training datasets, enabling the effective training of multi-class deep neural networks. We validate this approach by demonstrating that the generated samples closely resemble real particle images in terms of visual quality and structure. To assess the effectiveness of using diffusion-generated images in training datasets, we conduct large-scale experiments on a validation dataset comprising 500,000 protein particle images and demonstrate that this approach improves classification performance with no negligible downside. Finally, to promote open research and reproducibility, we publicly release both our diffusion models and the trained multi-class deep neural network classifiers, along with a straightforward interface for easy integration into future studies, at https://github.com/utkuozbulak/svp-generative-ai.[73] Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts
Kiran Chhatre,Christopher Peters,Srikrishna Karanam
Main category: cs.CV
TL;DR: Spectrum improves human parsing by using a repurposed Image-to-Texture diffusion model to accurately segment body parts and diverse clothing categories, outperforming existing methods.
Details
Motivation: Existing human parsing methods use fixed mask categories that obscure fine-grained clothing types, while open-vocabulary segmentation approaches typically group entire humans into a single category, failing to distinguish detailed clothing or body parts. Spectrum addresses these limitations by providing more precise parsing. Method: Spectrum repurposes an Image-to-Texture (I2Tx) diffusion model, derived by fine-tuning a text-to-image model on 3D human texture maps, to extract human-part internal features and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Result: Spectrum consistently outperforms baseline methods in prompt-based segmentation across extensive cross-dataset experiments, including assessments of body parts, clothing parts, unseen clothing categories, and full-body masks. Conclusion: Spectrum is a unified network for part-level pixel parsing and instance-level grouping that outperforms baseline methods in prompt-based segmentation by leveraging an Image-to-Texture diffusion model for improved alignment with body parts and clothing. Abstract: Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model -- obtained by fine-tuning a T2I model on 3D human texture maps -- for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments -- separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks -- and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.[74] InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow
Yiming Gong,Zhen Zhu,Minjia Zhang
Main category: cs.CV
TL;DR: InstantEdit is a fast text-guided image editing method based on RectifiedFlow that improves editability, detail preservation, and coherence using novel inversion and regeneration techniques.
Details
Motivation: The motivation is to develop a fast, text-guided image editing method that preserves critical content while closely following textual instructions. Method: InstantEdit uses the RectifiedFlow framework with PerRFI inversion strategy, Inversion Latent Injection regeneration method, Disentangled Prompt Guidance, and Canny-conditioned ControlNet for structural cues. Result: InstantEdit achieves faster performance and better qualitative and quantitative results on the PIE dataset compared to existing few-step editing methods. Conclusion: InstantEdit is a fast and efficient text-guided image editing method that outperforms state-of-the-art few-step editing methods in both qualitative and quantitative evaluations. Abstract: We propose a fast text-guided image editing method called InstantEdit based on the RectifiedFlow framework, which is structured as a few-step editing process that preserves critical content while following closely to textual instructions. Our approach leverages the straight sampling trajectories of RectifiedFlow by introducing a specialized inversion strategy called PerRFI. To maintain consistent while editable results for RectifiedFlow model, we further propose a novel regeneration method, Inversion Latent Injection, which effectively reuses latent information obtained during inversion to facilitate more coherent and detailed regeneration. Additionally, we propose a Disentangled Prompt Guidance technique to balance editability with detail preservation, and integrate a Canny-conditioned ControlNet to incorporate structural cues and suppress artifacts. Evaluation on the PIE image editing dataset demonstrates that InstantEdit is not only fast but also achieves better qualitative and quantitative results compared to state-of-the-art few-step editing methods.[75] More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment
Jun Xie,Yingjian Zhu,Feng Chen,Zhenghao Zhang,Xiaohui Fan,Hongzhu Yi,Xinming Wang,Chen Yu,Yue Bi,Zhaoran Zhao,Xiongjun Guan,Zhepeng Wang
Main category: cs.CV
TL;DR: This paper proposes a semi-supervised Mixture of Experts (MoE) emotion recognition system with a consensus-based pseudo-labeling strategy, achieving a high F1-score of 0.8772 and securing 2nd place in the MER2025-SEMI challenge.
Details
Motivation: The motivation is to build a robust emotion recognition system by leveraging semi-supervised learning and incorporating diverse data modalities, thus improving performance and aligning predictions with human preferences. Method: The method involves constructing a Mixture of Experts (MoE) framework using diverse input modalities, including Vision-Language Models (VLMs) and Action Unit (AU) information. A consensus-based pseudo-labeling strategy is introduced to leverage unlabeled data, and a two-stage training paradigm is employed alongside a multi-expert voting ensemble and rule-based re-ranking for improved prediction. Result: The framework achieved an F1-score of 0.8772 on the MER2025-SEMI challenge dataset, ranking 2nd in the competition. Conclusion: The proposed framework for the semi-supervised learning track in MER2025 achieved a high F1-score of 0.8772 and ranked 2nd in the track, demonstrating the effectiveness of the MoE emotion recognition system and the pseudo-labeling strategy. Abstract: In this paper, we present our solution for the semi-supervised learning track (MER-SEMI) in MER2025. We propose a comprehensive framework, grounded in the principle that "more is better," to construct a robust Mixture of Experts (MoE) emotion recognition system. Our approach integrates a diverse range of input modalities as independent experts, including novel signals such as knowledge from large Vision-Language Models (VLMs) and temporal Action Unit (AU) information. To effectively utilize unlabeled data, we introduce a consensus-based pseudo-labeling strategy, generating high-quality labels from the agreement between a baseline model and Gemini, which are then used in a two-stage training paradigm. Finally, we employ a multi-expert voting ensemble combined with a rule-based re-ranking process to correct prediction bias and better align the outputs with human preferences. Evaluated on the MER2025-SEMI challenge dataset, our method achieves an F1-score of 0.8772 on the test set, ranking 2nd in the track. Our code is available at https://github.com/zhuyjan/MER2025-MRAC25.[76] Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models
Huanyu Wang,Jushi Kai,Haoli Bai,Lu Hou,Bo Jiang,Ziwei He,Zhouhan Lin
Main category: cs.CV
TL;DR: Fourier-VLM 利用频域压缩方法减少视觉特征的标记数量,从而显著降低计算开销并提高视觉-语言模型的推理效率。
Details
Motivation: 视觉编码器输出的特征在低频分量上能量集中,因此可以利用频域压缩方法减少视觉标记数量,降低计算开销。 Method: 通过二维离散余弦变换(DCT)将视觉特征转换到频域,应用低通滤波器保留低频分量,减少视觉标记数量。 Result: Fourier-VLM 在减少推理 FLOPs 高达 83.8% 的同时,提升了生成速度 31.2%,并在多个视觉-语言模型架构中表现出色。 Conclusion: Fourier-VLM 是一种简单而有效的方法,可在不增加额外参数的情况下显著减少视觉标记数量,从而提升生成速度并降低计算开销,适用于多种视觉-语言模型架构。 Abstract: Vision-Language Models (VLMs) typically replace the predefined image placeholder token ([77] NEP: Autoregressive Image Editing via Next Editing Token Prediction
Huimin Wu,Xiaojian Ma,Haozhe Zhao,Yanpeng Zhao,Qing Li
Main category: cs.CV
TL;DR: The paper introduces NEP, a method for efficient and precise text-guided image editing that avoids unnecessary modifications and achieves top performance.
Details
Motivation: To address the limitations of existing approaches that unnecessarily regenerate entire images and compromise edit quality. Method: Formulating image editing as Next Editing-token Prediction (NEP) using an any-order autoregressive text-to-image (T2I) model. Result: Achieves state-of-the-art on image editing benchmarks, enables zero-shot editing, and supports test-time scaling for iterative refinement. Conclusion: The proposed NEP method for image editing achieves state-of-the-art results and supports zero-shot editing and test-time scaling. Abstract: Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: https://nep-bigai.github.io/[78] VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning
Linhan Cao,Wei Sun,Weixia Zhang,Xiangyang Zhu,Jun Jia,Kaiwei Zhang,Dandan Zhu,Guangtao Zhai,Xiongkuo Min
Main category: cs.CV
TL;DR: 为了解决视频质量评估模型的泛化性和可解释性问题,提出了VQAThinker框架,通过强化学习和多模态模型实现了视频质量评估的强泛化能力。
Details
Motivation: 现有的视频质量评估模型在对分布外视频的泛化能力较差且缺乏可解释性,限制了它们在现实场景中的应用。 Method: 采用了一种规则引导的强化学习算法GRPO,并引入了三个特定于视频质量评估的奖励机制:钟形回归奖励、成对排序奖励和时间一致性奖励。 Result: VQAThinker在域内和分布外的视频质量评估基准测试中均达到了最先进的性能,并且在扭曲归因和质量描述方面优于现有的可解释性视频质量评估模型。 Conclusion: VQAThinker是一个基于推理的视频质量评估框架,它利用具有强化学习的多模态模型来共同建模视频质量理解和评分,从而实现视频质量评分的强泛化性。 Abstract: Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: \textit{poor generalization to out-of-distribution (OOD) videos} and \textit{limited explainability}, which restrict their applicability in real-world scenarios. To address these challenges, we propose \textbf{VQAThinker}, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a \textbf{bell-shaped regression reward} that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a \textbf{pairwise ranking reward} that guides the model to correctly determine the relative quality between video pairs; and (3) a \textbf{temporal consistency reward} that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.[79] LV-Net: Anatomy-aware lateral ventricle shape modeling with a case study on Alzheimer's disease, the Australian Imaging Biomarkers and Lifestyle flagship study of ageing
Wonjung Park,Suhyun Ahn,Jinah Park
Main category: cs.CV
TL;DR: LV-Net 是一种新的3D侧脑室形状分析框架,通过变形解剖感知的联合模板网格,提高了重建的准确性和鲁棒性,并应用于阿尔茨海默病分析。
Details
Motivation: 侧脑室形状分析是神经系统疾病的潜在生物标志物,但由于个体间形状差异大和MRI分辨率限制导致的分割困难,传统方法存在不足。 Method: LV-Net通过引入解剖感知的联合侧脑室-海马模板网格,采用网格变形技术生成个体化的3D侧脑室模型,并通过对模板顶点进行解剖邻接分类来增强点对应关系。 Result: LV-Net在多种数据集上表现出优越的重建精度和更可靠的形状描述符,即使在分割不完美的情况下也具有良好的性能。 Conclusion: LV-Net是一种有效的侧脑室形状建模方法,在阿尔茨海默病研究中展现出潜在应用价值。 Abstract: Lateral ventricle (LV) shape analysis holds promise as a biomarker for neurological diseases; however, challenges remain due to substantial shape variability across individuals and segmentation difficulties arising from limited MRI resolution. We introduce LV-Net, a novel framework for producing individualized 3D LV meshes from brain MRI by deforming an anatomy-aware joint LV-hippocampus template mesh. By incorporating anatomical relationships embedded within the joint template, LV-Net reduces boundary segmentation artifacts and improves reconstruction robustness. In addition, by classifying the vertices of the template mesh based on their anatomical adjacency, our method enhances point correspondence across subjects, leading to more accurate LV shape statistics. We demonstrate that LV-Net achieves superior reconstruction accuracy, even in the presence of segmentation imperfections, and delivers more reliable shape descriptors across diverse datasets. Finally, we apply LV-Net to Alzheimer's disease analysis, identifying LV subregions that show significantly associations with the disease relative to cognitively normal controls. The codes for LV shape modeling are available at https://github.com/PWonjung/LV_Shape_Modeling.[80] AGI for the Earth, the path, possibilities and how to evaluate intelligence of models that work with Earth Observation Data?
Mojtaba Valipour,Kelly Zheng,James Lowman,Spencer Szabados,Mike Gartner,Bobby Braswell
Main category: cs.CV
TL;DR: 本文讨论了卫星光谱图像在人工智能通用智能(AGI)中的应用潜力,指出现有评估基准的不足,并提出了一套更全面的评估任务集合。
Details
Motivation: 卫星光谱图像作为一种额外的模态尚未得到应有的关注,而本文强调了其在推动AGI理解自然界能力方面的独特挑战和巨大潜力。 Method: 作者回顾了现有的基准,并强调了它们在评估此领域基础模型的泛化能力方面的局限性。 Result: 本文指出地球观测数据对于智能模型是有用的,并提出了一套任务集合,以促进更全面的基准测试。 Conclusion: 本文强调了需要一个更全面的基准来评估地球观测模型,并提出了一套全面的任务集合,以有效评估模型理解和交互地球观测数据的能力。 Abstract: Artificial General Intelligence (AGI) is closer than ever to becoming a reality, sparking widespread enthusiasm in the research community to collect and work with various modalities, including text, image, video, and audio. Despite recent efforts, satellite spectral imagery, as an additional modality, has yet to receive the attention it deserves. This area presents unique challenges, but also holds great promise in advancing the capabilities of AGI in understanding the natural world. In this paper, we argue why Earth Observation data is useful for an intelligent model, and then we review existing benchmarks and highlight their limitations in evaluating the generalization ability of foundation models in this domain. This paper emphasizes the need for a more comprehensive benchmark to evaluate earth observation models. To facilitate this, we propose a comprehensive set of tasks that a benchmark should encompass to effectively assess a model's ability to understand and interact with Earth observation data.[81] Lightweight Quad Bayer HybridEVS Demosaicing via State Space Augmented Cross-Attention
Shiyang Zhou,Haijin Zeng,Yunfan Lu,Yongyong Chen,Jie Liu,Jingyong Su
Main category: cs.CV
TL;DR: This paper proposes TSANet, a lightweight two-stage network for efficient image demosaicing on mobile devices, outperforming existing methods in performance and computational efficiency.
Details
Motivation: The motivation stems from the challenges of combining a Quad Bayer Color Filter Array sensor with event pixels lacking color information in HybridEVS cameras, leading to aliasing and artifacts in the demosaicing process. Current methods struggle to address these issues, especially on resource-limited mobile devices. Method: The paper introduces TSANet, a lightweight Two-stage network using State space augmented cross-Attention, along with a lightweight Cross-Swin State Block that utilizes positional prior for demosaicing and enhances global dependencies through the state space model. Result: TSANet demonstrates excellent demosaicing performance on both simulated and real data of HybridEVS while maintaining a lightweight model, averaging better results than DemosaicFormer across seven diverse datasets in both PSNR and SSIM, while reducing parameter and computation costs by 1.86x and 3.29x. Conclusion: TSANet presents new possibilities for efficient image demosaicing on mobile devices by dividing complex tasks into manageable subtasks, achieving better performance with reduced parameter and computation costs compared to existing methods. Abstract: Event cameras like the Hybrid Event-based Vision Sensor (HybridEVS) camera capture brightness changes as asynchronous "events" instead of frames, offering advanced application on mobile photography. However, challenges arise from combining a Quad Bayer Color Filter Array (CFA) sensor with event pixels lacking color information, resulting in aliasing and artifacts on the demosaicing process before downstream application. Current methods struggle to address these issues, especially on resource-limited mobile devices. In response, we introduce \textbf{TSANet}, a lightweight \textbf{T}wo-stage network via \textbf{S}tate space augmented cross-\textbf{A}ttention, which can handle event pixels inpainting and demosaicing separately, leveraging the benefits of dividing complex tasks into manageable subtasks. Furthermore, we introduce a lightweight Cross-Swin State Block that uniquely utilizes positional prior for demosaicing and enhances global dependencies through the state space model with linear complexity. In summary, TSANet demonstrates excellent demosaicing performance on both simulated and real data of HybridEVS while maintaining a lightweight model, averaging better results than the previous state-of-the-art method DemosaicFormer across seven diverse datasets in both PSNR and SSIM, while respectively reducing parameter and computation costs by $1.86\times$ and $3.29\times$. Our approach presents new possibilities for efficient image demosaicing on mobile devices. Code is available in the supplementary materials.[82] Distribution-Specific Learning for Joint Salient and Camouflaged Object Detection
Chao Hao,Zitong Yu,Xin Liu,Yuhao Wang,Weicheng Xie,Jingang Shi,Huanjing Yue,Jingyu Yang
Main category: cs.CV
TL;DR: 本文提出SCJoint和SBSS方法,实现显著物体和伪装物体的联合检测,训练出性能优越的通用网络JoNet。
Details
Motivation: 显著物体检测(SOD)和伪装物体检测(COD)是两个相关但不同的视觉任务,传统方法认为联合学习会降低性能,但作者认为通过适当方法可以实现双赢。 Method: 提出SCJoint方法,通过共享网络结构并插入少量任务特定的可学习参数,分别学习解码过程的均值和方差,实现两个任务的联合学习。同时提出基于显著性的采样策略(SBSS)以平衡训练集规模并提高训练效率。 Result: 训练出一个名为JoNet的强大网络,能够同时检测显著物体和伪装物体,并在实验中展示了其竞争力和有效性。 Conclusion: 研究得出,通过适当的联合学习方法,网络可以同时具备识别显著物体和伪装物体的能力,从而在两个任务上都获得性能提升。 Abstract: Salient object detection (SOD) and camouflaged object detection (COD) are two closely related but distinct computer vision tasks. Although both are class-agnostic segmentation tasks that map from RGB space to binary space, the former aims to identify the most salient objects in the image, while the latter focuses on detecting perfectly camouflaged objects that blend into the background in the image. These two tasks exhibit strong contradictory attributes. Previous works have mostly believed that joint learning of these two tasks would confuse the network, reducing its performance on both tasks. However, here we present an opposite perspective: with the correct approach to learning, the network can simultaneously possess the capability to find both salient and camouflaged objects, allowing both tasks to benefit from joint learning. We propose SCJoint, a joint learning scheme for SOD and COD tasks, assuming that the decoding processes of SOD and COD have different distribution characteristics. The key to our method is to learn the respective means and variances of the decoding processes for both tasks by inserting a minimal amount of task-specific learnable parameters within a fully shared network structure, thereby decoupling the contradictory attributes of the two tasks at a minimal cost. Furthermore, we propose a saliency-based sampling strategy (SBSS) to sample the training set of the SOD task to balance the training set sizes of the two tasks. In addition, SBSS improves the training set quality and shortens the training time. Based on the proposed SCJoint and SBSS, we train a powerful generalist network, named JoNet, which has the ability to simultaneously capture both ``salient" and ``camouflaged". Extensive experiments demonstrate the competitive performance and effectiveness of our proposed method. The code is available at https://github.com/linuxsino/JoNet.[83] Can Large Models Fool the Eye? A New Turing Test for Biological Animation
Zijian Chen,Lirong Deng,Zhengyu Chen,Kaiwei Zhang,Qi Jia,Yuan Tian,Yucheng Zhu,Guangtao Zhai
Main category: cs.CV
TL;DR: BioMotion Arena introduces a visual animation-based evaluation framework for large language models, offering intuitive and discriminative feedback through human evaluation, highlighting significant performance gaps in current models.
Details
Motivation: Current benchmarks for evaluating large language models often lack immediate, intuitive feedback on performance differences. This work aims to address this by introducing a novel visual animation-based evaluation framework. Method: BioMotion Arena uses pairwise comparison evaluation, collecting over 45k votes from human participants on 53 mainstream LLMs and MLLMs across 90 biological motion variants, validated by comparison with expert raters. Result: The crowd-sourced human votes align well with expert ratings, showing the framework's effectiveness in providing discriminative feedback. Over 90% of evaluated models, including advanced ones like InternVL3 and Claude-4 series, failed to produce coherent humanoid point-light motions. Conclusion: BioMotion Arena serves as a challenging benchmark and flexible evaluation framework for evaluating large language models and multimodal large language models through visual animation, without reliance on ground-truth. Abstract: Evaluating the abilities of large models and manifesting their gaps are challenging. Current benchmarks adopt either ground-truth-based score-form evaluation on static datasets or indistinct textual chatbot-style human preferences collection, which may not provide users with immediate, intuitive, and perceptible feedback on performance differences. In this paper, we introduce BioMotion Arena, a novel framework for evaluating large language models (LLMs) and multimodal large language models (MLLMs) via visual animation. Our methodology draws inspiration from the inherent visual perception of motion patterns characteristic of living organisms that utilizes point-light source imaging to amplify the performance discrepancies between models. Specifically, we employ a pairwise comparison evaluation and collect more than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants. Data analyses show that the crowd-sourced human votes are in good agreement with those of expert raters, demonstrating the superiority of our BioMotion Arena in offering discriminative feedback. We also find that over 90\% of evaluated models, including the cutting-edge open-source InternVL3 and proprietary Claude-4 series, fail to produce fundamental humanoid point-light groups, much less smooth and biologically plausible motions. This enables BioMotion Arena to serve as a challenging benchmark for performance visualization and a flexible evaluation framework without restrictions on ground-truth.[84] Towards MR-Based Trochleoplasty Planning
Michael Wehrli,Alicia Durrer,Paul Friedrich,Sidaty El Hadramy,Edwin Li,Luana Brahaj,Carol C. Hasler,Philippe C. Cattin
Main category: cs.CV
TL;DR: 本文提出了一种基于常规临床MR扫描的自动化管道,用于生成亚毫米分辨率的患者特异性3D伪健康目标形态,以改进对滑车发育不良的治疗。
Details
Motivation: 当前治疗滑车发育不良的方法主要依赖低分辨率的临床磁共振扫描和外科医生的直觉,导致结果不一致且微创技术应用有限。 Method: 首先,使用隐式神经表示计算各向同性超分辨率MR体积。接下来,使用多标签自训练网络分割股骨、胫骨、髌骨和腓骨。最后,训练小波扩散模型生成滑车区域的伪健康目标形态。 Result: 该方法能够在术前和术中使用,生成亚毫米分辨率的3D形状,并显著改善沟角和滑车沟深度。此外,该方法无需CT扫描,减少了辐射量。 Conclusion: 提出的方法为滑车发育不良的治疗提供了更精确且个性化的解决方案,提高了手术的一致性和效果。 Abstract: To treat Trochlear Dysplasia (TD), current approaches rely mainly on low-resolution clinical Magnetic Resonance (MR) scans and surgical intuition. The surgeries are planned based on surgeons experience, have limited adoption of minimally invasive techniques, and lead to inconsistent outcomes. We propose a pipeline that generates super-resolved, patient-specific 3D pseudo-healthy target morphologies from conventional clinical MR scans. First, we compute an isotropic super-resolved MR volume using an Implicit Neural Representation (INR). Next, we segment femur, tibia, patella, and fibula with a multi-label custom-trained network. Finally, we train a Wavelet Diffusion Model (WDM) to generate pseudo-healthy target morphologies of the trochlear region. In contrast to prior work producing pseudo-healthy low-resolution 3D MR images, our approach enables the generation of sub-millimeter resolved 3D shapes compatible for pre- and intraoperative use. These can serve as preoperative blueprints for reshaping the femoral groove while preserving the native patella articulation. Furthermore, and in contrast to other work, we do not require a CT for our pipeline - reducing the amount of radiation. We evaluated our approach on 25 TD patients and could show that our target morphologies significantly improve the sulcus angle (SA) and trochlear groove depth (TGD). The code and interactive visualization are available at https://wehrlimi.github.io/sr-3d-planning/.[85] DreamVE: Unified Instruction-based Image and Video Editing
Bin Xia,Jiyang Liu,Yuechen Zhang,Bohao Peng,Ruihang Chu,Yitong Wang,Xinglong Wu,Bei Yu,Jiaya Jia
Main category: cs.CV
TL;DR: DreamVE是一个统一的基于指令的图像和视频编辑模型,利用两阶段训练策略和数据合成方法提高性能和泛化能力。
Details
Motivation: 基于指令的视频编辑受限于训练数据的匮乏,需要开发更高效、更实用的编辑模型。 Method: 提出了一种两阶段训练策略(先图像编辑,后视频编辑),并设计了基于拼贴和生成模型的数据合成流水线,以及高效的编辑框架。 Result: DreamVE通过拼贴数据预训练实现了强大的编辑性能,但属性编辑性能有所下降,因此通过生成模型进一步微调。 Conclusion: DreamVE实现了基于指令的图像和视频编辑的统一模型,通过两阶段训练策略和高效编辑框架,取得了良好的性能和泛化能力。 Abstract: Instruction-based editing holds vast potential due to its simple and efficient interactive editing format. However, instruction-based editing, particularly for video, has been constrained by limited training data, hindering its practical application. To this end, we introduce DreamVE, a unified model for instruction-based image and video editing. Specifically, We propose a two-stage training strategy: first image editing, then video editing. This offers two main benefits: (1) Image data scales more easily, and models are more efficient to train, providing useful priors for faster and better video editing training. (2) Unifying image and video generation is natural and aligns with current trends. Moreover, we present comprehensive training data synthesis pipelines, including collage-based and generative model-based data synthesis. The collage-based data synthesis combines foreground objects and backgrounds to generate diverse editing data, such as object manipulation, background changes, and text modifications. It can easily generate billions of accurate, consistent, realistic, and diverse editing pairs. We pretrain DreamVE on extensive collage-based data to achieve strong performance in key editing types and enhance generalization and transfer capabilities. However, collage-based data lacks some attribute editing cases, leading to a relative drop in performance. In contrast, the generative model-based pipeline, despite being hard to scale up, offers flexibility in handling attribute editing cases. Therefore, we use generative model-based data to further fine-tune DreamVE. Besides, we design an efficient and powerful editing framework for DreamVE. We build on the SOTA T2V model and use a token concatenation with early drop approach to inject source image guidance, ensuring strong consistency and editability. The codes and models will be released.[86] SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment
Yanxiao Sun,Jiafu Wu,Yun Cao,Chengming Xu,Yabiao Wang,Weijian Cao,Donghao Luo,Chengjie Wang,Yanwei Fu
Main category: cs.CV
TL;DR: SwiftVideo is a unified and stable distillation framework that combines trajectory-preserving and distribution-matching strategies to accelerate video generation while maintaining high quality.
Details
Motivation: Diffusion-based or flow-based video synthesis models require multiple iterative sampling steps, leading to high computational overhead. Existing distillation methods based on trajectory-preserving or distribution-matching suffer from performance breakdown or increased artifacts in few-step settings. Method: The method introduces continuous-time consistency distillation to preserve ODE trajectories and proposes a dual-perspective alignment, including distribution alignment between synthetic and real data and trajectory alignment across different inference steps. Result: Quantitative evaluations on the OpenVid-1M benchmark demonstrate that SwiftVideo significantly outperforms existing approaches in few-step video generation. Conclusion: SwiftVideo maintains high-quality video generation while significantly reducing the number of inference steps, outperforming existing approaches in few-step video generation. Abstract: Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts under few-step settings. To address these limitations, we propose \textbf{\emph{SwiftVideo}}, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, we propose a dual-perspective alignment that includes distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation.[87] AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance
Weichen Zhang,Zhui Zhu,Ningbo Li,Kebin Liu,Yunhao Liu
Main category: cs.CV
TL;DR: AdaptInfer是一个即插即用的框架,用于在视觉语言模型(VLMs)中进行自适应视觉标记修剪,从而降低推理成本并提高准确性。
Details
Motivation: 现有的修剪方法通常依赖于直接使用注意力模式或静态文本提示指导,未能利用推理过程中生成的动态内部信号。 Method: 引入了一个细粒度、动态文本引导的修剪机制,并对跨模态注意力变化进行离线分析,提出更合理和高效的修剪计划。 Result: 实验结果验证了该方法的有效性,例如在保持平均准确率为92.9%的同时,减少了61.3%的CUDA延迟。 Conclusion: AdaptInfer是一个轻量级、即插即用且适用于多模态任务的框架。 Abstract: Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering (VQA), but their inference cost remains a significant challenge due to the large number of vision tokens processed during the prefill stage. Existing pruning methods often rely on directly using the attention patterns or static text prompt guidance, failing to exploit the dynamic internal signals generated during inference. To address these issues, we propose AdaptInfer, a plug-and-play framework for adaptive vision token pruning in VLMs. First, we introduce a fine-grained, dynamic text-guided pruning mechanism that reuses layer-wise text-to-text attention maps to construct soft priors over text-token importance, allowing more informed scoring of vision tokens at each stage. Second, we perform an offline analysis of cross-modal attention shifts and identify consistent inflection locations in inference, which inspire us to propose a more principled and efficient pruning schedule. Our method is lightweight and plug-and-play, also generalizable across multi-modal tasks. Experimental results have verified the effectiveness of the proposed method. For example, it reduces CUDA latency by 61.3\% while maintaining an average accuracy of 92.9\% on vanilla LLaVA-1.5-7B. Under the same token budget, AdaptInfer surpasses SOTA in accuracy.[88] Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
Yachun Mi,Yu Li,Yanting Li,Shixin Sun,Chen Hui,Tong Zhang,Yuanyuan Liu,Chenyue Song,Shaohui Liu
Main category: cs.CV
TL;DR: Q-CLIP是一个基于视觉语言模型的视频质量评估框架,通过共享跨模态适配器和可学习质量级别提示来增强视觉和文本表示,仅需训练少量参数,大大降低了计算成本。
Details
Motivation: 当前主流的视频质量评估方法通过在大规模分类数据集上进行预训练,然后在视频质量评估数据集上进行微调来提高性能,但这种方法存在两个显著问题:仅转移预训练中学习的语义知识不足以进行视频质量评估,且预训练需要巨大的计算资源。 Method: Q-CLIP通过共享跨模态适配器(SCMA)增强视觉和文本表示,并引入可学习质量级别提示以指导模型感知细微质量变化。 Result: Q-CLIP在多个视频质量评估数据集中表现出色,并通过帧差采样策略提高了跨数据集的泛化性能。 Conclusion: Q-CLIP是一个完全基于视觉语言模型的视频质量评估框架,它通过共享跨模态适配器和可学习质量级别提示来增强视觉和文本表示,从而在多个视频质量评估数据集中表现出色。 Abstract: Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model's sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.[89] E-React: Towards Emotionally Controlled Synthesis of Human Reactions
Chen Zhu,Buzhen Huang,Zijing Wu,Binghui Zuo,Yangang Wang
Main category: cs.CV
TL;DR: 该研究提出了一种结合情感分析与动作生成的新方法,通过半监督学习构建情感先验,并将其应用于扩散模型中以生成更加自然的反应动作,从而提升人机交互体验。
Details
Motivation: 情感在日常人际交往中起着至关重要的作用,而现有动作生成框架忽略了情感的影响,这降低了动作的自然性并限制了其在交互任务中的应用。因此,论文旨在解决这一问题,通过引入情感因素来提升反应动作生成的自然性和适用性。 Method: 论文中采用了半监督情感先验和扩散模型相结合的方法。首先,基于运动片段在短序列中倾向于共享相同情感的观察,设计了一个半监督学习框架来训练情感先验。然后,利用此先验训练了一个结合空间交互和情感反应的actor-reactor扩散模型,从而生成反应动作。 Result: 实验结果表明,该方法在反应生成任务中表现优于现有技术,能够根据给定的演员动作序列和情感条件生成多样且真实的反应动作。 Conclusion: 该论文提出了一种基于情感先验的反应生成方法,能够根据不同的感情状态生成真实的反应动作,并在实验中表现优于现有的反应生成方法。 Abstract: Emotion serves as an essential component in daily human interactions. Existing human motion generation frameworks do not consider the impact of emotions, which reduces naturalness and limits their application in interactive tasks, such as human reaction synthesis. In this work, we introduce a novel task: generating diverse reaction motions in response to different emotional cues. However, learning emotion representation from limited motion data and incorporating it into a motion generation framework remains a challenging problem. To address the above obstacles, we introduce a semi-supervised emotion prior in an actor-reactor diffusion model to facilitate emotion-driven reaction synthesis. Specifically, based on the observation that motion clips within a short sequence tend to share the same emotion, we first devise a semi-supervised learning framework to train an emotion prior. With this prior, we further train an actor-reactor diffusion model to generate reactions by considering both spatial interaction and emotional response. Finally, given a motion sequence of an actor, our approach can generate realistic reactions under various emotional conditions. Experimental results demonstrate that our model outperforms existing reaction generation methods. The code and data will be made publicly available at https://ereact.github.io/[90] UGD-IML: A Unified Generative Diffusion-based Framework for Constrained and Unconstrained Image Manipulation Localization
Yachun Mi,Xingyang He,Shixin Sun,Yu Li,Yanting Li,Zhixuan Li,Jian Jin,Chen Hui,Shaohui Liu
Main category: cs.CV
TL;DR: 本文提出UGD-IML,一种基于扩散模型的新型生成框架,统一IML与CIML任务,减少对大规模标注数据的依赖,提升模型在有限数据条件下的性能。
Details
Motivation: 现有IML方法依赖于大量高质量标注数据,而当前数据集规模和多样性不足,限制了模型在现实场景中的表现;现有CIML方法通常依赖复杂的多阶段流水线,导致标注效率低下。 Method: 基于扩散模型的生成框架UGD-IML,利用类嵌入机制和参数共享设计,实现IML与CIML模式的无缝切换,并通过端到端设计简化数据标注过程。 Result: 在多个数据集上的实验结果显示,UGD-IML在IML和CIML任务的F1指标上分别平均超过SOTA方法9.66和4.36,并在不确定性估计、可视化和鲁棒性方面表现出色。 Conclusion: UGD-IML通过生成扩散模型统一了IML和CIML任务,减少了对大规模标注数据集的依赖,提高了模型在数据受限条件下的有效性,并在多个数据集中表现出优越的性能。 Abstract: In the digital age, advanced image editing tools pose a serious threat to the integrity of visual content, making image forgery detection and localization a key research focus. Most existing Image Manipulation Localization (IML) methods rely on discriminative learning and require large, high-quality annotated datasets. However, current datasets lack sufficient scale and diversity, limiting model performance in real-world scenarios. To overcome this, recent studies have explored Constrained IML (CIML), which generates pixel-level annotations through algorithmic supervision. However, existing CIML approaches often depend on complex multi-stage pipelines, making the annotation process inefficient. In this work, we propose a novel generative framework based on diffusion models, named UGD-IML, which for the first time unifies both IML and CIML tasks within a single framework. By learning the underlying data distribution, generative diffusion models inherently reduce the reliance on large-scale labeled datasets, allowing our approach to perform effectively even under limited data conditions. In addition, by leveraging a class embedding mechanism and a parameter-sharing design, our model seamlessly switches between IML and CIML modes without extra components or training overhead. Furthermore, the end-to-end design enables our model to avoid cumbersome steps in the data annotation process. Extensive experimental results on multiple datasets demonstrate that UGD-IML outperforms the SOTA methods by an average of 9.66 and 4.36 in terms of F1 metrics for IML and CIML tasks, respectively. Moreover, the proposed method also excels in uncertainty estimation, visualization and robustness.[91] MCA: 2D-3D Retrieval with Noisy Labels via Multi-level Adaptive Correction and Alignment
Gui Zou,Chaofan Gan,Chern Hong Lim,Supavadee Aramvith,Weiyao Lin
Main category: cs.CV
TL;DR: 本文提出了一种名为MCA的新框架,通过多模态联合标签校正和多层次对齐策略,显著提升了2D-3D跨模态检索在噪声标签环境下的性能。
Details
Motivation: 随着2D和3D数据的增多,跨模态检索技术取得进步,但不完美的标注数据对模型的鲁棒性提出了挑战。 Method: 引入多模态联合标签校正机制(MJC)和多层次自适应对齐策略(MAA),以提高跨模态检索的鲁棒性和性能。 Result: 实验表明,MCA方法在常规和带有噪声的3D基准数据集上均达到了最先进的性能。 Conclusion: MCA框架在处理带有噪声标签的2D-3D跨模态检索任务中表现出色,具有广泛的应用前景和有效性。 Abstract: With the increasing availability of 2D and 3D data, significant advancements have been made in the field of cross-modal retrieval. Nevertheless, the existence of imperfect annotations presents considerable challenges, demanding robust solutions for 2D-3D cross-modal retrieval in the presence of noisy label conditions. Existing methods generally address the issue of noise by dividing samples independently within each modality, making them susceptible to overfitting on corrupted labels. To address these issues, we propose a robust 2D-3D \textbf{M}ulti-level cross-modal adaptive \textbf{C}orrection and \textbf{A}lignment framework (MCA). Specifically, we introduce a Multimodal Joint label Correction (MJC) mechanism that leverages multimodal historical self-predictions to jointly model the modality prediction consistency, enabling reliable label refinement. Additionally, we propose a Multi-level Adaptive Alignment (MAA) strategy to effectively enhance cross-modal feature semantics and discrimination across different levels. Extensive experiments demonstrate the superiority of our method, MCA, which achieves state-of-the-art performance on both conventional and realistic noisy 3D benchmarks, highlighting its generality and effectiveness.[92] Mask & Match: Learning to Recognize Handwritten Math with Self-Supervised Attention
Shree Mitra,Ritabrata Chakraborty,Nilkanta Sahu
Main category: cs.CV
TL;DR: 本文提出了一种基于自监督学习的手写数学表达式识别方法,通过结合图像编码器预训练、自监督注意力网络和Transformer解码器,在不需要大量标注数据的情况下实现了优异的识别性能。
Details
Motivation: 由于手写数学表达式的二维结构、符号尺度变化以及符号间的复杂空间关系,识别任务极具挑战性,因此需要一种不需要昂贵标注数据的自监督学习方法。 Method: 该方法首先使用全局和局部对比损失预训练图像编码器,然后提出了一种新颖的自监督注意力网络,通过渐进式空间掩码策略进行训练,最后结合编码器和注意力机制,使用Transformer解码器进行监督微调以生成LATEX序列。 Result: 实验结果表明,该方法在CROHME基准数据集上优于现有的自监督学习和完全监督基线方法,验证了渐进式注意力机制在提升手写数学表达式识别性能方面的有效性。 Conclusion: 本文提出了一种用于手写数学表达式识别的自监督学习框架,消除了对标记数据的需求,并通过实验验证了该方法在CROHME基准数据集上的有效性。 Abstract: Recognizing handwritten mathematical expressions (HMER) is a challenging task due to the inherent two-dimensional structure, varying symbol scales, and complex spatial relationships among symbols. In this paper, we present a self-supervised learning (SSL) framework for HMER that eliminates the need for expensive labeled data. Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss, enabling the model to learn both holistic and fine-grained representations. A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy. This attention mechanism is designed to learn semantically meaningful focus regions, such as operators, exponents, and nested mathematical notation, without requiring any supervision. The progressive masking curriculum encourages the network to become increasingly robust to missing or occluded visual information, ultimately improving structural understanding. Our complete pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention learning, and (3) supervised fine-tuning with a transformer decoder to generate LATEX sequences. Extensive experiments on CROHME benchmarks demonstrate that our method outperforms existing SSL and fully supervised baselines, validating the effectiveness of our progressive attention mechanism in enhancing HMER performance. Our codebase can be found here.[93] FMCE-Net++: Feature Map Convergence Evaluation and Training
Zhibo Zhu,Renyu Huang,Lei He
Main category: cs.CV
TL;DR: 该论文提出了一种名为FMCE-Net++的新训练框架,通过将预训练的FMCE-Net作为辅助头来生成FMCS预测,并与任务标签共同监督主干优化,从而解决特征图收敛评估(FMCE)缺乏实验验证和闭环集成的局限性。实验结果表明,FMCE-Net++能够持续提升模型性能,并且能够有效提升最先进的性能上限。
Details
Motivation: 为了解决特征图收敛评估(FMCE)缺乏实验验证和闭环集成的局限性。 Method: 将预训练的FMCE-Net作为辅助头,生成FMCS预测,并通过一个表示辅助损失和任务标签共同监督主干优化。 Result: 在MNIST、CIFAR-10、FashionMNIST和CIFAR-100上进行的大量实验表明,FMCE-Net++能够持续提升模型性能,关键实验结果包括提升+1.16个百分点(ResNet-50/CIFAR-10)和+1.08个百分点(ShuffleNet v2/CIFAR-100)的准确率。 Conclusion: FMCE-Net++可以有效提升最先进的性能上限。 Abstract: Deep Neural Networks (DNNs) face interpretability challenges due to their opaque internal representations. While Feature Map Convergence Evaluation (FMCE) quantifies module-level convergence via Feature Map Convergence Scores (FMCS), it lacks experimental validation and closed-loop integration. To address this limitation, we propose FMCE-Net++, a novel training framework that integrates a pretrained, frozen FMCE-Net as an auxiliary head. This module generates FMCS predictions, which, combined with task labels, jointly supervise backbone optimization through a Representation Auxiliary Loss. The RAL dynamically balances the primary classification loss and feature convergence optimization via a tunable \Representation Abstraction Factor. Extensive experiments conducted on MNIST, CIFAR-10, FashionMNIST, and CIFAR-100 demonstrate that FMCE-Net++ consistently enhances model performance without architectural modifications or additional data. Key experimental outcomes include accuracy gains of $+1.16$ pp (ResNet-50/CIFAR-10) and $+1.08$ pp (ShuffleNet v2/CIFAR-100), validating that FMCE-Net++ can effectively elevate state-of-the-art performance ceilings.[94] Effective Training Data Synthesis for Improving MLLM Chart Understanding
Yuwei Yang,Zeyu Zhang,Yunzhong Hou,Zhuowan Li,Gaowen Liu,Ali Payani,Yuan-Sen Ting,Liang Zheng
Main category: cs.CV
TL;DR: This study introduces the Effective Chart Dataset (ECD), which improves the ability of multimodal large language models (MLLMs) to understand complex scientific plots by using a modular and visually diverse data synthesis approach.
Details
Motivation: The motivation behind the study is to improve the performance of open-source multimodal large language models (MLLMs) in understanding scientific plots, as existing models have a low success rate on challenging benchmarks. Method: The researchers designed a five-step data synthesis pipeline to generate a fine-tuning dataset called the Effective Chart Dataset (ECD), which includes 10k+ chart images and 300k+ QA pairs. This dataset was used to train and evaluate various MLLMs. Result: The proposed approach improved the performance of various MLLMs on both real-world and synthetic test sets, demonstrating the effectiveness of the ECD dataset. Conclusion: The study concludes that modularizing chart generation and diversifying visual details enhance chart understanding capabilities of MLLMs. Abstract: Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: https://github.com/yuweiyang-anu/ECD.[95] GMF-Drive: Gated Mamba Fusion with Spatial-Aware BEV Representation for End-to-End Autonomous Driving
Jian Wang,Chaokang Jiang,Haitao Xu
Main category: cs.CV
TL;DR: This paper introduces GMF-Drive, an end-to-end autonomous driving framework that improves upon diffusion-based models by replacing transformer-based fusion with a more efficient and spatially-aware state-space model.
Details
Motivation: The motivation is to address the limitations of diffusion-based models in autonomous driving, which rely heavily on transformer-based fusion. These transformers have quadratic computational complexity and lack spatial priors, limiting their effectiveness in modeling Bird's Eye View (BEV) representations. Method: The paper proposes GMF-Drive, which includes a geometrically-augmented pillar format for LiDAR representation and a hierarchical gated mamba fusion (GM-Fusion) architecture that replaces transformers with a spatially-aware state-space model (SSM). The method aims to overcome the limitations of transformer-based fusion by using directional sequencing and adaptive fusion mechanisms. Result: The result shows that GMF-Drive achieves new state-of-the-art performance on the NAVSIM benchmark, significantly outperforming DiffusionDrive. Ablation studies confirm the effectiveness of each component of GMF-Drive. Conclusion: The paper concludes that GMF-Drive, with its innovative geometrically-augmented pillar format and hierarchical gated mamba fusion architecture, achieves state-of-the-art performance on the NAVSIM benchmark, surpassing DiffusionDrive in both performance and efficiency. Abstract: Diffusion-based models are redefining the state-of-the-art in end-to-end autonomous driving, yet their performance is increasingly hampered by a reliance on transformer-based fusion. These architectures face fundamental limitations: quadratic computational complexity restricts the use of high-resolution features, and a lack of spatial priors prevents them from effectively modeling the inherent structure of Bird's Eye View (BEV) representations. This paper introduces GMF-Drive (Gated Mamba Fusion for Driving), an end-to-end framework that overcomes these challenges through two principled innovations. First, we supersede the information-limited histogram-based LiDAR representation with a geometrically-augmented pillar format encoding shape descriptors and statistical features, preserving critical 3D geometric details. Second, we propose a novel hierarchical gated mamba fusion (GM-Fusion) architecture that substitutes an expensive transformer with a highly efficient, spatially-aware state-space model (SSM). Our core BEV-SSM leverages directional sequencing and adaptive fusion mechanisms to capture long-range dependencies with linear complexity, while explicitly respecting the unique spatial properties of the driving scene. Extensive experiments on the challenging NAVSIM benchmark demonstrate that GMF-Drive achieves a new state-of-the-art performance, significantly outperforming DiffusionDrive. Comprehensive ablation studies validate the efficacy of each component, demonstrating that task-specific SSMs can surpass a general-purpose transformer in both performance and efficiency for autonomous driving.[96] SynSeg: Feature Synergy for Multi-Category Contrastive Learning in Open-Vocabulary Semantic Segmentation
Weichen Zhang,Kebin Liu,Fan Dang,Zhui Zhu,Xikai Sun,Yunhao Liu
Main category: cs.CV
TL;DR: 这项研究提出了一种新的弱监督语义分割方法SynSeg,通过多类别对比学习和特征协同结构解决语义分割中的挑战。
Details
Motivation: 现有的弱监督方法在对比学习中依赖类别特定的监督和不合适的特征构建方法,导致语义不对齐和性能差。 Method: 提出了一种新的弱监督方法SynSeg,其中包括多类别对比学习(MCCL)和特征协同结构(FSS)。 Result: 实验结果表明,该方法在基准测试中表现优于最先进的性能,在VOC上提高了4.5%,在Context上提高了8.9%,在Object上提高了2.6%,在City上提高了2.0%。 Conclusion: SynSeg有效地提高了在弱监督下的语义定位和辨别能力。 Abstract: Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories. Existing weakly-supervised methods often rely on category-specific supervision and ill-suited feature construction methods for contrastive learning, leading to semantic misalignment and poor performance. In this work, we propose a novel weakly-supervised approach, SynSeg, to address the challenges. SynSeg performs Multi-Category Contrastive Learning (MCCL) as a stronger training signal with a new feature reconstruction framework named Feature Synergy Structure (FSS). Specifically, MCCL strategy robustly combines both intra- and inter-category alignment and separation in order to make the model learn the knowledge of correlations from different categories within the same image. Moreover, FSS reconstructs discriminative features for contrastive learning through prior fusion and semantic-activation-map enhancement, effectively avoiding the foreground bias introduced by the visual encoder. In general, SynSeg effectively improves the abilities in semantic localization and discrimination under weak supervision. Extensive experiments on benchmarks demonstrate that our method outperforms state-of-the-art (SOTA) performance. For instance, SynSeg achieves higher accuracy than SOTA baselines by 4.5\% on VOC, 8.9\% on Context, 2.6\% on Object and 2.0\% on City.[97] Learning Representations of Satellite Images with Evaluations on Synoptic Weather Events
Ting-Shuo Yo,Shih-Hao Su,Chien-Ming Wu,Wei-Ting Chen,Jung-Lien Chu,Chiao-Wei Chang,Hung-Chi Kuo
Main category: cs.CV
TL;DR: 研究比较了不同表示学习算法在卫星图像上的表现,发现卷积自编码器(CAE)在所有分类任务中表现最佳,但其学习表示缺乏与物理属性的直接联系,提出了开发物理信息增强的CAE作为未来研究方向。
Details
Motivation: 探索和评估不同表示学习算法在卫星图像上的潜在空间,以提高天气事件分类的准确性。 Method: 应用了主成分分析(PCA)、卷积自编码器(CAE)和预训练残差网络(PT)三种表示学习算法,并通过分类任务评估其性能。 Result: CAE在所有分类任务中表现出较高的威胁分数;PCA分类具有高命中率但也伴随高误报率;PT在识别热带气旋方面表现出色但在其他任务中表现较差;高分辨率数据集学习的潜在空间对深度学习算法更优;较小的潜在空间维度会显著增加误报率。 Conclusion: CAE在性能上表现最佳,但缺乏与物理属性的直接联系,未来的研究方向是开发物理信息增强的CAE模型。 Abstract: This study applied representation learning algorithms to satellite images and evaluated the learned latent spaces with classifications of various weather events. The algorithms investigated include the classical linear transformation, i.e., principal component analysis (PCA), state-of-the-art deep learning method, i.e., convolutional autoencoder (CAE), and a residual network pre-trained with large image datasets (PT). The experiment results indicated that the latent space learned by CAE consistently showed higher threat scores for all classification tasks. The classifications with PCA yielded high hit rates but also high false-alarm rates. In addition, the PT performed exceptionally well at recognizing tropical cyclones but was inferior in other tasks. Further experiments suggested that representations learned from higher-resolution datasets are superior in all classification tasks for deep-learning algorithms, i.e., CAE and PT. We also found that smaller latent space sizes had minor impact on the classification task's hit rate. Still, a latent space dimension smaller than 128 caused a significantly higher false alarm rate. Though the CAE can learn latent spaces effectively and efficiently, the interpretation of the learned representation lacks direct connections to physical attributions. Therefore, developing a physics-informed version of CAE can be a promising outlook for the current work.[98] SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning
Lin Zhang,Xianfang Zeng,Kangcong Li,Gang Yu,Tao Chen
Main category: cs.CV
TL;DR: This paper proposes SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models, and a new dataset named RefinedCaps. The proposed method outperforms the direct preference optimization training strategy in generating image captions.
Details
Motivation: Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Method: We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. We collect a fine-grained annotated image caption dataset, RefinedCaps, consisting of 6.5K diverse images from COCO dataset. Result: Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy. Conclusion: Applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy. Abstract: We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Furthermore, we collect a fine-grained annotated image caption dataset, RefinedCaps, consisting of 6.5K diverse images from COCO dataset. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.[99] SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures
Yi Qin,Rui Wang,Tao Huang,Tong Xiao,Liping Jing
Main category: cs.CV
TL;DR: 本文提出了一种新的对抗攻击方法VeSCA,利用Segment Anything Model (SAM) 的编码器生成具有迁移能力的对抗样本,以识别SAM和下游模型之间的共享脆弱区域。
Details
Motivation: 由于Segment Anything Model (SAM) 在交互式分割方面的零样本能力,其潜在的脆弱性可能导致许多下游应用失败,因此需要主动评估这些可迁移的脆弱性。 Method: 提出VeSCA方法,通过参数化的单纯复形显式刻画SAM与下游模型之间的共享脆弱区域,并通过迭代顶点细化识别这些复形。引入轻量级领域再适应策略,利用少量参考数据初始化单纯复形,并通过随机单纯复形采样生成对抗样本。 Result: 实验表明,VeSCA在五个特定领域数据集的三类下游模型中,性能较现有最先进方法提升了12.7%。 Conclusion: VeSCA能够有效识别SAM的脆弱性,并生成具有迁移能力的对抗样本,揭示了SAM的脆弱性对下游模型构成的风险,并强调了开发更稳健基础模型的紧迫性。 Abstract: While the Segment Anything Model (SAM) transforms interactive segmentation with zero-shot abilities, its inherent vulnerabilities present a single-point risk, potentially leading to the failure of numerous downstream applications. Proactively evaluating these transferable vulnerabilities is thus imperative. Prior adversarial attacks on SAM often present limited transferability due to insufficient exploration of common weakness across domains. To address this, we propose Vertex-Refining Simplicial Complex Attack (VeSCA), a novel method that leverages only the encoder of SAM for generating transferable adversarial examples. Specifically, it achieves this by explicitly characterizing the shared vulnerable regions between SAM and downstream models through a parametric simplicial complex. Our goal is to identify such complexes within adversarially potent regions by iterative vertex-wise refinement. A lightweight domain re-adaptation strategy is introduced to bridge domain divergence using minimal reference data during the initialization of simplicial complex. Ultimately, VeSCA generates consistently transferable adversarial examples through random simplicial complex sampling. Extensive experiments demonstrate that VeSCA achieves performance improved by 12.7% compared to state-of-the-art methods across three downstream model categories across five domain-specific datasets. Our findings further highlight the downstream model risks posed by SAM's vulnerabilities and emphasize the urgency of developing more robust foundation models.[100] Roll Your Eyes: Gaze Redirection via Explicit 3D Eyeball Rotation
YoungChan Choi,HengFei Wang,YiHua Cheng,Boeun Kim,Hyung Jin Chang,YoungGeun Choi,Sang-Il Choi
Main category: cs.CV
TL;DR: 本文提出了一种基于显式3D眼球结构的注视重定向框架,利用3D高斯点阵和自适应变形模块,生成高质量的注视图像,并在性能上优于现有方法。
Details
Motivation: 现有的注视重定向方法通常基于神经辐射场(NeRF),其隐式神经表示在体积渲染中无法显式建模3D表示的旋转和平移。本文旨在通过显式3D眼球结构解决这一问题。 Method: 该论文采用3D高斯点阵(3DGS)来表示眼球结构,并引入自适应变形模块,以重现眼部周围细微的肌肉运动。 Result: 在ETH-XGaze数据集上的实验表明,该方法能够生成多样化的新型注视图像,并在图像质量和注视估计准确性方面优于现有最先进方法。 Conclusion: 该论文提出了一种基于显式3D眼球结构的新型3D注视重定向框架,通过显式建模眼球的旋转和平移,实现了比现有最先进方法更优的图像质量和注视估计准确性。 Abstract: We propose a novel 3D gaze redirection framework that leverages an explicit 3D eyeball structure. Existing gaze redirection methods are typically based on neural radiance fields, which employ implicit neural representations via volume rendering. Unlike these NeRF-based approaches, where the rotation and translation of 3D representations are not explicitly modeled, we introduce a dedicated 3D eyeball structure to represent the eyeballs with 3D Gaussian Splatting (3DGS). Our method generates photorealistic images that faithfully reproduce the desired gaze direction by explicitly rotating and translating the 3D eyeball structure. In addition, we propose an adaptive deformation module that enables the replication of subtle muscle movements around the eyes. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our framework is capable of generating diverse novel gaze images, achieving superior image quality and gaze estimation accuracy compared to previous state-of-the-art methods.[101] DiffCap: Diffusion-based Real-time Human Motion Capture using Sparse IMUs and a Monocular Camera
Shaohua Pan,Xinyu Yi,Yan Zhou,Weihua Jian,Yuan Zhang,Pengfei Wan,Feng Xu
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的方法,将稀疏IMU和单目相机信号融合,实现高效实时的人体运动捕捉。
Details
Motivation: 结合稀疏IMU和单目相机是实时人体运动捕捉的新有前景的方法。 Method: 提出了一种基于扩散模型的解决方案,将稀疏IMU和单目相机信号融合在一个统一框架中。 Result: 实验验证了系统设计的有效性及其在姿态估计中的优越性能。 Conclusion: 实验结果表明,该系统设计在姿态估计中表现优异,达到最先进的性能。 Abstract: Combining sparse IMUs and a monocular camera is a new promising setting to perform real-time human motion capture. This paper proposes a diffusion-based solution to learn human motion priors and fuse the two modalities of signals together seamlessly in a unified framework. By delicately considering the characteristics of the two signals, the sequential visual information is considered as a whole and transformed into a condition embedding, while the inertial measurement is concatenated with the noisy body pose frame by frame to construct a sequential input for the diffusion model. Firstly, we observe that the visual information may be unavailable in some frames due to occlusions or subjects moving out of the camera view. Thus incorporating the sequential visual features as a whole to get a single feature embedding is robust to the occasional degenerations of visual information in those frames. On the other hand, the IMU measurements are robust to occlusions and always stable when signal transmission has no problem. So incorporating them frame-wisely could better explore the temporal information for the system. Experiments have demonstrated the effectiveness of the system design and its state-of-the-art performance in pose estimation compared with the previous works. Our codes are available for research at https://shaohua-pan.github.io/diffcap-page.[102] SDEval: Safety Dynamic Evaluation for Multimodal Large Language Models
Hanqing Wang,Yuan Tian,Mingyu Liu,Zhenhao Zhang,Xiangyang Zhu
Main category: cs.CV
TL;DR: 本文提出了一种名为 SDEval 的安全动态评估框架,用于解决多模态大语言模型输出的安全性问题,并通过多种评估基准验证了其有效性。
Details
Motivation: 现有的多模态大语言模型安全评估数据集可能随着模型的发展而过时,并容易受到数据污染问题的影响。 Method: SDEval 主要采用文本、图像和图文动态策略生成新样本,并探索了不同动态策略对模型安全性的影响。 Result: 实验表明,SDEval 在安全性和能力评估基准上均显著影响安全评估,缓解了数据污染问题,并揭示了多模态大语言模型的安全限制。 Conclusion: SDEval 是一个通用的安全动态评估框架,能够有效应对数据污染问题,并揭示多模态大语言模型的安全局限性。 Abstract: In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), the safety concerns of their outputs have earned significant attention. Although numerous datasets have been proposed, they may become outdated with MLLM advancements and are susceptible to data contamination issues. To address these problems, we propose \textbf{SDEval}, the \textit{first} safety dynamic evaluation framework to controllably adjust the distribution and complexity of safety benchmarks. Specifically, SDEval mainly adopts three dynamic strategies: text, image, and text-image dynamics to generate new samples from original benchmarks. We first explore the individual effects of text and image dynamics on model safety. Then, we find that injecting text dynamics into images can further impact safety, and conversely, injecting image dynamics into text also leads to safety risks. SDEval is general enough to be applied to various existing safety and even capability benchmarks. Experiments across safety benchmarks, MLLMGuard and VLSBench, and capability benchmarks, MMBench and MMVet, show that SDEval significantly influences safety evaluation, mitigates data contamination, and exposes safety limitations of MLLMs. Code is available at https://github.com/hq-King/SDEval[103] Text-guided Visual Prompt DINO for Generic Segmentation
Yuchen Guan,Chong Sun,Canmiao Fu,Zhipeng Huang,Chun Yuan,Chen Li
Main category: cs.CV
TL;DR: Prompt-DINO是一种新型的多模态检测框架,结合文本和视觉信息,实现更高效的开放世界目标检测。
Details
Motivation: 解决多模态视觉模型在后期特征融合、混合提示查询选择和基于描述词汇的限制问题。 Method: 提出Prompt-DINO框架,包括早期文本/视觉提示融合、顺序对齐查询选择和基于RAP模型的生成数据引擎。 Result: 实验表明Prompt-DINO在开放世界检测任务中表现最佳,生成数据引擎减少80.5%的标签噪声。 Conclusion: Prompt-DINO通过早期融合机制、顺序对齐查询选择和生成数据引擎,实现了开放世界检测任务的最先进性能,并显著扩展了语义覆盖范围。 Abstract: Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5% compared to conventional approaches. Extensive experiments demonstrate that Prompt-DINO achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data&Code are available at https://github.com/WeChatCV/WeVisionOne.[104] DSConv: Dynamic Splitting Convolution for Pansharpening
Xuanyu Liu,Bonan An
Main category: cs.CV
TL;DR: DSConv是一种新的动态分割卷积核方法,结合注意力机制,能够有效提高pansharpening任务的性能。
Details
Motivation: 现有的pansharpening方法主要依赖标准卷积,而很少使用自适应卷积。DSConv的提出旨在通过动态分割卷积核策略来提高特征提取的效果,从而提升网络的性能。 Method: DSConv通过动态分割卷积核并与注意力机制结合,选择感兴趣的区域,将原始卷积核分割成多个较小的卷积核。 Result: 实验表明,DSConv在pansharpening任务中表现优异,具有state-of-the-art的性能,并且通过综合和严格的讨论证明了其优越性和最佳使用条件。 Conclusion: DSConv是一种新颖的动态分割卷积核策略,与注意力机制相结合,能够更有效地提取感受野内不同位置的特征,从而提高网络的泛化性、优化能力和特征表示能力。DSConv在pansharpening任务中表现出色,实验充分证明了其有效性和最先进的性能。 Abstract: Aiming to obtain a high-resolution image, pansharpening involves the fusion of a multi-spectral image (MS) and a panchromatic image (PAN), the low-level vision task remaining significant and challenging in contemporary research. Most existing approaches rely predominantly on standard convolutions, few making the effort to adaptive convolutions, which are effective owing to the inter-pixel correlations of remote sensing images. In this paper, we propose a novel strategy for dynamically splitting convolution kernels in conjunction with attention, selecting positions of interest, and splitting the original convolution kernel into multiple smaller kernels, named DSConv. The proposed DSConv more effectively extracts features of different positions within the receptive field, enhancing the network's generalization, optimization, and feature representation capabilities. Furthermore, we innovate and enrich concepts of dynamic splitting convolution and provide a novel network architecture for pansharpening capable of achieving the tasks more efficiently, building upon this methodology. Adequate fair experiments illustrate the effectiveness and the state-of-the-art performance attained by DSConv.Comprehensive and rigorous discussions proved the superiority and optimal usage conditions of DSConv.[105] VISTAR:A User-Centric and Role-Driven Benchmark for Text-to-Image Evaluation
Kaiyuan Jiang,Ruoxi Sun,Ying Cao,Yuqi Xu,Xinran Zhang,Junyan Guo,ChengSheng Deng
Main category: cs.CV
TL;DR: VISTAR是一个用户中心化的多维评估基准,通过确定性指标和分层加权提问方案改进文本到图像生成的评估,具有高人类一致性并提供领域特定的部署指导。
Details
Motivation: 现有文本到图像生成评估指标存在局限性,需要一个更全面、用户中心化的评估基准,以更好地与人类评估对齐并提供可操作的指导。 Method: VISTAR采用了两层混合评估范式,包括用于物理可量化属性的确定性、可脚本化的指标,以及一种新颖的分层加权P/N提问方案(HWPQ),利用受约束的视觉-语言模型来评估抽象语义。 Result: VISTAR的指标实现了较高的人类一致性(>75%),其中HWPQ方案在抽象语义评估中达到了85.9%的准确率,显著优于VQA基线模型。 Conclusion: VISTAR为文本到图像生成领域提供了一个全面、用户中心化的评估基准,解决了现有评估指标的局限性,并通过公开资源促进可重复的T2I评估。 Abstract: We present VISTAR, a user-centric, multi-dimensional benchmark for text-to-image (T2I) evaluation that addresses the limitations of existing metrics. VISTAR introduces a two-tier hybrid paradigm: it employs deterministic, scriptable metrics for physically quantifiable attributes (e.g., text rendering, lighting) and a novel Hierarchical Weighted P/N Questioning (HWPQ) scheme that uses constrained vision-language models to assess abstract semantics (e.g., style fusion, cultural fidelity). Grounded in a Delphi study with 120 experts, we defined seven user roles and nine evaluation angles to construct the benchmark, which comprises 2,845 prompts validated by over 15,000 human pairwise comparisons. Our metrics achieve high human alignment (>75%), with the HWPQ scheme reaching 85.9% accuracy on abstract semantics, significantly outperforming VQA baselines. Comprehensive evaluation of state-of-the-art models reveals no universal champion, as role-weighted scores reorder rankings and provide actionable guidance for domain-specific deployment. All resources are publicly released to foster reproducible T2I assessment.[106] An Interpretable Multi-Plane Fusion Framework With Kolmogorov-Arnold Network Guided Attention Enhancement for Alzheimer's Disease Diagnosis
Xiaoxiao Yang,Meiliang Liu,Yunfang Xu,Zijin Li,Zhengye Si,Xinyue Yang,Zhiwen Zhao
Main category: cs.CV
TL;DR: The paper proposes MPF-KANSC, a deep learning model that improves Alzheimer's disease diagnosis by integrating multi-plane sMRI features and a novel attention mechanism to better capture brain atrophy patterns.
Details
Motivation: Early and accurate diagnosis of Alzheimer's disease is challenging due to complex structural brain changes, and existing deep learning methods are limited by focusing on single-plane sMRI and failing to capture nonlinear relationships between brain regions. Method: The MPF-KANSC framework combines multi-plane fusion (MPF) for feature extraction from multiple sMRI planes and a Kolmogorov-Arnold Network-guided spatial-channel attention mechanism (KANSC) to better capture nonlinear relationships in brain atrophy features. Result: Experiments on the ADNI dataset show that MPF-KANSC achieves superior performance in AD diagnosis and reveals interpretable insights into right-lateralized asymmetry in subcortical structural changes. Conclusion: The proposed MPF-KANSC model outperforms existing methods in diagnosing Alzheimer's disease and provides insights into right-lateralized subcortical structural changes during AD progression. Abstract: Alzheimer's disease (AD) is a progressive neurodegenerative disorder that severely impairs cognitive function and quality of life. Timely intervention in AD relies heavily on early and precise diagnosis, which remains challenging due to the complex and subtle structural changes in the brain. Most existing deep learning methods focus only on a single plane of structural magnetic resonance imaging (sMRI) and struggle to accurately capture the complex and nonlinear relationships among pathological regions of the brain, thus limiting their ability to precisely identify atrophic features. To overcome these limitations, we propose an innovative framework, MPF-KANSC, which integrates multi-plane fusion (MPF) for combining features from the coronal, sagittal, and axial planes, and a Kolmogorov-Arnold Network-guided spatial-channel attention mechanism (KANSC) to more effectively learn and represent sMRI atrophy features. Specifically, the proposed model enables parallel feature extraction from multiple anatomical planes, thus capturing more comprehensive structural information. The KANSC attention mechanism further leverages a more flexible and accurate nonlinear function approximation technique, facilitating precise identification and localization of disease-related abnormalities. Experiments on the ADNI dataset confirm that the proposed MPF-KANSC achieves superior performance in AD diagnosis. Moreover, our findings provide new evidence of right-lateralized asymmetry in subcortical structural changes during AD progression, highlighting the model's promising interpretability.[107] Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment
Zhenbang Du,Yonggan Fu,Lifu Wang,Jiayi Qian,Xiao Luo,Yingyan,Lin
Main category: cs.CV
TL;DR: This paper introduces PostDiff, a training-free method to accelerate diffusion models by optimizing inference efficiency while maintaining generation quality.
Details
Motivation: Diffusion models are computationally heavy, limiting their deployment on resource-constrained platforms. This work explores how to optimally compress models post-training without fine-tuning for better efficiency and fidelity. Method: PostDiff proposes a mixed-resolution denoising scheme at the input level and a hybrid module caching strategy at the module level to reduce redundancy in pre-trained diffusion models. Result: Experiments show that PostDiff significantly enhances both generation fidelity and efficiency, with reducing per-step cost being more effective than reducing denoising steps. Conclusion: PostDiff improves the fidelity-efficiency trade-off of diffusion models without fine-tuning, showing that reducing per-step inference cost is more effective than reducing denoising steps for efficient deployment. Abstract: Diffusion models have shown remarkable success across generative tasks, yet their high computational demands challenge deployment on resource-limited platforms. This paper investigates a critical question for compute-optimal diffusion model deployment: Under a post-training setting without fine-tuning, is it more effective to reduce the number of denoising steps or to use a cheaper per-step inference? Intuitively, reducing the number of denoising steps increases the variability of the distributions across steps, making the model more sensitive to compression. In contrast, keeping more denoising steps makes the differences smaller, preserving redundancy, and making post-training compression more feasible. To systematically examine this, we propose PostDiff, a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at both the input level and module level in a post-training manner. At the input level, we propose a mixed-resolution denoising scheme based on the insight that reducing generation resolution in early denoising steps can enhance low-frequency components and improve final generation fidelity. At the module level, we employ a hybrid module caching strategy to reuse computations across denoising steps. Extensive experiments and ablation studies demonstrate that (1) PostDiff can significantly improve the fidelity-efficiency trade-off of state-of-the-art diffusion models, and (2) to boost efficiency while maintaining decent generation fidelity, reducing per-step inference cost is often more effective than reducing the number of denoising steps. Our code is available at https://github.com/GATECH-EIC/PostDiff.[108] UW-3DGS: Underwater 3D Reconstruction with Physics-Aware Gaussian Splatting
Wenpeng Xing,Jie Chen,Zaifeng Yang,Changting Lin,Jianfeng Dong,Chaochao Chen,Xun Zhou,Meng Han
Main category: cs.CV
TL;DR: UW-3DGS是一种新的水下3D重建框架,利用3D高斯点阵和物理模型解决了水下环境中光吸收和散射带来的问题,提高了重建质量和效率。
Details
Motivation: 水下3D场景重建面临光吸收、散射和浑浊度的挑战,传统方法如NeRF在几何和颜色保真度上存在局限性。因此需要一种更高效、高分辨率的方法。 Method: UW-3DGS采用了3D高斯点阵(3DGS)技术,结合了基于体素的回归方法和物理感知的不确定性剪枝(PAUP)方法,以优化水下图像的重建过程。 Result: 实验结果显示,UW-3DGS在SeaThru-NeRF数据集上取得了PSNR为27.604,SSIM为0.868,LPIPS为0.104的成绩,并减少了约65%的浮动伪影。 Conclusion: UW-3DGS通过引入可学习的水下图像形成模块和物理感知的不确定性剪枝分支,在水下3D场景重建中表现出色,克服了传统方法在几何和颜色保真度上的局限性。 Abstract: Underwater 3D scene reconstruction faces severe challenges from light absorption, scattering, and turbidity, which degrade geometry and color fidelity in traditional methods like Neural Radiance Fields (NeRF). While NeRF extensions such as SeaThru-NeRF incorporate physics-based models, their MLP reliance limits efficiency and spatial resolution in hazy environments. We introduce UW-3DGS, a novel framework adapting 3D Gaussian Splatting (3DGS) for robust underwater reconstruction. Key innovations include: (1) a plug-and-play learnable underwater image formation module using voxel-based regression for spatially varying attenuation and backscatter; and (2) a Physics-Aware Uncertainty Pruning (PAUP) branch that adaptively removes noisy floating Gaussians via uncertainty scoring, ensuring artifact-free geometry. The pipeline operates in training and rendering stages. During training, noisy Gaussians are optimized end-to-end with underwater parameters, guided by PAUP pruning and scattering modeling. In rendering, refined Gaussians produce clean Unattenuated Radiance Images (URIs) free from media effects, while learned physics enable realistic Underwater Images (UWIs) with accurate light transport. Experiments on SeaThru-NeRF and UWBundle datasets show superior performance, achieving PSNR of 27.604, SSIM of 0.868, and LPIPS of 0.104 on SeaThru-NeRF, with ~65% reduction in floating artifacts.[109] Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation
Ojonugwa Oluwafemi Ejiga Peter,Akingbola Oluwapemiisin,Amalahu Chetachi,Adeniran Opeyemi,Fahmi Khalifa,Md Mahmudur Rahman
Main category: cs.CV
TL;DR: This research proposes a multidirectional framework for automated polyp detection in colonoscopy images using synthetic data generation and advanced detection and segmentation models, achieving high performance and addressing challenges in healthcare dataset limitations.
Details
Motivation: Colorectal cancer is a major cause of cancer-related mortality globally, making early diagnosis through colonoscopy crucial. However, challenges such as limited healthcare dataset sizes and annotation complexities hinder effective polyp detection. This research aims to address these issues through automation and advanced modeling. Method: The study implemented a system that utilizes Stable Diffusion for synthetic data generation, Faster R-CNN for initial object localization, and the Segment Anything Model (SAM) for refining segmentation masks. It also evaluated five state-of-the-art segmentation models—U-Net, PSPNet, FPN, LinkNet, and MANet—using ResNet34 as a base model. Result: The Faster R-CNN detection algorithm achieved a recall of 93.08%, precision of 88.97%, and an F1 score of 90.98%. Among the segmentation models evaluated, FPN showed the highest PSNR (7.205893) and SSIM (0.492381), U-Net excelled in recall (84.85%), and LinkNet demonstrated balanced performance in IoU (64.20%) and Dice score (77.53%). Conclusion: The research successfully introduced a multidirectional framework for automated polyp detection and segmentation in colonoscopy images, combining synthetic data generation with advanced detection and segmentation models, thereby addressing challenges related to limited healthcare datasets and annotation complexities. Abstract: Colonoscopy is a vital tool for the early diagnosis of colorectal cancer, which is one of the main causes of cancer-related mortality globally; hence, it is deemed an essential technique for the prevention and early detection of colorectal cancer. The research introduces a unique multidirectional architectural framework to automate polyp detection within colonoscopy images while helping resolve limited healthcare dataset sizes and annotation complexities. The research implements a comprehensive system that delivers synthetic data generation through Stable Diffusion enhancements together with detection and segmentation algorithms. This detection approach combines Faster R-CNN for initial object localization while the Segment Anything Model (SAM) refines the segmentation masks. The faster R-CNN detection algorithm achieved a recall of 93.08% combined with a precision of 88.97% and an F1 score of 90.98%.SAM is then used to generate the image mask. The research evaluated five state-of-the-art segmentation models that included U-Net, PSPNet, FPN, LinkNet, and MANet using ResNet34 as a base model. The results demonstrate the superior performance of FPN with the highest scores of PSNR (7.205893) and SSIM (0.492381), while UNet excels in recall (84.85%) and LinkNet shows balanced performance in IoU (64.20%) and Dice score (77.53%).[110] Graph-based Robot Localization Using a Graph Neural Network with a Floor Camera and a Feature Rich Industrial Floor
Dominik Brämer,Diana Kleingarn,Oliver Urbann
Main category: cs.CV
TL;DR: 本文提出了一种基于图结构和GCN的机器人定位方法,解决了传统方法的可扩展性和适应性问题,实现了高精度和高效定位。
Details
Motivation: 传统方法(如激光雷达或二维码系统)在可扩展性和适应性方面存在局限性,特别是在复杂环境中难以实现高精度和高效率的定位。 Method: 利用图结构表示地板特征,并结合图卷积网络(GCN)进行机器人定位,同时解决了“绑架机器人”问题。 Result: 所提出的方法实现了更高的定位精度(误差仅为0.64厘米)和效率,且无需复杂的滤波过程即可解决绑架机器人问题。 Conclusion: 该研究提出了一种基于图表示和图卷积网络(GCN)的新型定位框架,为机器人导航在复杂环境中的定位问题提供了有效的解决方案。 Abstract: Accurate localization represents a fundamental challenge in robotic navigation. Traditional methodologies, such as Lidar or QR-code based systems, suffer from inherent scalability and adaptability con straints, particularly in complex environments. In this work, we propose an innovative localization framework that harnesses flooring characteris tics by employing graph-based representations and Graph Convolutional Networks (GCNs). Our method uses graphs to represent floor features, which helps localize the robot more accurately (0.64cm error) and more efficiently than comparing individual image features. Additionally, this approach successfully addresses the kidnapped robot problem in every frame without requiring complex filtering processes. These advancements open up new possibilities for robotic navigation in diverse environments.[111] MA-CBP: A Criminal Behavior Prediction Framework Based on Multi-Agent Asynchronous Collaboration
Cheng Liu,Daou Zhang,Tingxu Liu,Yuhan Wang,Jinyang Chen,Yuexuan Li,Xinying Xiao,Chenbo Xin,Ziru Wang,Weichao Wu
Main category: cs.CV
TL;DR: This paper proposes MA-CBP, a multi-agent asynchronous collaboration framework for criminal behavior prediction, effectively combining real-time video analysis with historical context to enable early warnings in urban public safety scenarios.
Details
Motivation: Traditional anomaly detection methods struggle to capture high-level behavioral semantics, while generative approaches based on LLMs often lack real-time performance. This limitation motivates the development of a more effective and timely criminal behavior prediction framework. Method: The study proposes MA-CBP, a framework that transforms real-time video streams into semantic descriptions, constructs historical summaries, and performs joint reasoning over long- and short-term contexts for behavioral prediction. Result: The proposed MA-CBP framework demonstrates superior performance on multiple datasets and enables early warning of potential criminal activities by capturing key behavioral elements such as event subjects, locations, and causes. Conclusion: The study concludes that MA-CBP offers a promising approach for predicting criminal behavior by leveraging multi-agent asynchronous collaboration, achieving superior performance in urban public safety scenarios. Abstract: With the acceleration of urbanization, criminal behavior in public scenes poses an increasingly serious threat to social security. Traditional anomaly detection methods based on feature recognition struggle to capture high-level behavioral semantics from historical information, while generative approaches based on Large Language Models (LLMs) often fail to meet real-time requirements. To address these challenges, we propose MA-CBP, a criminal behavior prediction framework based on multi-agent asynchronous collaboration. This framework transforms real-time video streams into frame-level semantic descriptions, constructs causally consistent historical summaries, and fuses adjacent image frames to perform joint reasoning over long- and short-term contexts. The resulting behavioral decisions include key elements such as event subjects, locations, and causes, enabling early warning of potential criminal activity. In addition, we construct a high-quality criminal behavior dataset that provides multi-scale language supervision, including frame-level, summary-level, and event-level semantic annotations. Experimental results demonstrate that our method achieves superior performance on multiple datasets and offers a promising solution for risk warning in urban public safety scenarios.[112] A Semantic Segmentation Algorithm for Pleural Effusion Based on DBIF-AUNet
Ruixiang Tang,Jianglong Qin,Mingda Zhang,Yan Song,Yi Wu,Wei Wu
Main category: cs.CV
TL;DR: 本文提出了一种新的胸腔积液CT图像分割模型DBIF-AUNet,通过创新的双分支交互融合注意力模型和嵌套深度监督机制显著提高了分割准确性。
Details
Motivation: 胸腔积液的语义分割可以显著提高临床诊断和治疗的准确性和及时性,但目前面临灰度级相似、边缘模糊和形态变化大等挑战,现有方法难以应对。 Method: 提出了DBIF-AUNet模型,该模型包括密集嵌套跳跃连接网络、双域特征解耦模块(DDFD)和分支交互注意融合模块(BIAF),以及嵌套深度监督机制和分层自适应混合损失函数。 Result: 在西南医院的1622张胸腔积液CT图像验证中,DBIF-AUNet分别取得了80.1%和89.0%的IoU和Dice分数,比最先进的医学图像分割模型U-Net++和Swin-UNet高5.7%/2.7%和2.2%/1.5%。 Conclusion: DBIF-AUNet通过其创新的双分支交互融合注意力模型显著提高了复杂胸腔积液CT图像分割的准确性,并优于现有的医学图像分割模型。 Abstract: Pleural effusion semantic segmentation can significantly enhance the accuracy and timeliness of clinical diagnosis and treatment by precisely identifying disease severity and lesion areas. Currently, semantic segmentation of pleural effusion CT images faces multiple challenges. These include similar gray levels between effusion and surrounding tissues, blurred edges, and variable morphology. Existing methods often struggle with diverse image variations and complex edges, primarily because direct feature concatenation causes semantic gaps. To address these challenges, we propose the Dual-Branch Interactive Fusion Attention model (DBIF-AUNet). This model constructs a densely nested skip-connection network and innovatively refines the Dual-Domain Feature Disentanglement module (DDFD). The DDFD module orthogonally decouples the functions of dual-domain modules to achieve multi-scale feature complementarity and enhance characteristics at different levels. Concurrently, we design a Branch Interaction Attention Fusion module (BIAF) that works synergistically with the DDFD. This module dynamically weights and fuses global, local, and frequency band features, thereby improving segmentation robustness. Furthermore, we implement a nested deep supervision mechanism with hierarchical adaptive hybrid loss to effectively address class imbalance. Through validation on 1,622 pleural effusion CT images from Southwest Hospital, DBIF-AUNet achieved IoU and Dice scores of 80.1% and 89.0% respectively. These results outperform state-of-the-art medical image segmentation models U-Net++ and Swin-UNet by 5.7%/2.7% and 2.2%/1.5% respectively, demonstrating significant optimization in segmentation accuracy for complex pleural effusion CT images.[113] LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning
Chang Che,Ziqi Wang,Pengwan Yang,Qi Wang,Hui Ma,Zenglin Shi
Main category: cs.CV
TL;DR: LiLoRA improves continual visual instruction tuning in MLLMs by efficiently managing task-specific parameters and preserving past knowledge with minimal overhead.
Details
Motivation: To address the issue of catastrophic forgetting in MLLMs during continual learning, while avoiding the parameter inefficiency of existing architecture expansion methods. Method: LiLoRA uses a shared LoRA matrix A, applies low-rank decomposition to matrix B, and incorporates a cosine-regularized stability loss. Result: LiLoRA achieves better performance in sequential task learning and significantly improves parameter efficiency compared to existing methods. Conclusion: LiLoRA is a more efficient method for CVIT in MLLMs that reduces parameter overhead and improves scalability while maintaining performance on previously learned tasks. Abstract: Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches.[114] AnomalyMoE: Towards a Language-free Generalist Model for Unified Visual Anomaly Detection
Zhaopeng Gu,Bingke Zhu,Guibo Zhu,Yingying Chen,Wei Ge,Ming Tang,Jinqiao Wang
Main category: cs.CV
TL;DR: 提出了AnomalyMoE,一个通用的异常检测框架,通过混合专家架构解决多个领域的异常检测问题,并取得了最先进的性能。
Details
Motivation: 现有的异常检测方法通常是高度专业化的,限制了它们的泛化能力,因此需要一种通用的解决方案。 Method: 提出了一种名为AnomalyMoE的框架,利用混合专家架构,结合三个级别的专家网络和两个模块(专家信息排斥模块和专家选择平衡模块),进行异常检测。 Result: 实验结果表明,AnomalyMoE在8个具有挑战性的数据集中表现出色,显著优于其各自领域的专用方法。 Conclusion: AnomalyMoE是一个基于混合专家架构的通用异常检测框架,通过将异常检测问题分解为三个不同的语义层次,同时引入专家信息排斥模块和专家选择平衡模块,实现了在多个领域中的先进性能。 Abstract: Anomaly detection is a critical task across numerous domains and modalities, yet existing methods are often highly specialized, limiting their generalizability. These specialized models, tailored for specific anomaly types like textural defects or logical errors, typically exhibit limited performance when deployed outside their designated contexts. To overcome this limitation, we propose AnomalyMoE, a novel and universal anomaly detection framework based on a Mixture-of-Experts (MoE) architecture. Our key insight is to decompose the complex anomaly detection problem into three distinct semantic hierarchies: local structural anomalies, component-level semantic anomalies, and global logical anomalies. AnomalyMoE correspondingly employs three dedicated expert networks at the patch, component, and global levels, and is specialized in reconstructing features and identifying deviations at its designated semantic level. This hierarchical design allows a single model to concurrently understand and detect a wide spectrum of anomalies. Furthermore, we introduce an Expert Information Repulsion (EIR) module to promote expert diversity and an Expert Selection Balancing (ESB) module to ensure the comprehensive utilization of all experts. Experiments on 8 challenging datasets spanning industrial imaging, 3D point clouds, medical imaging, video surveillance, and logical anomaly detection demonstrate that AnomalyMoE establishes new state-of-the-art performance, significantly outperforming specialized methods in their respective domains.[115] PA-HOI: A Physics-Aware Human and Object Interaction Dataset
Ruiyan Wang,Lin Zuo,Zonghao Lin,Qiang Wang,Zhengxue Cheng,Rong Xie,Jun Ling,Li Song
Main category: cs.CV
TL;DR: 本文提出了一個新的PA-HOI動作捕捉數據集,用於研究物體物理屬性對人類動作動態的影響。
Details
Motivation: 現有的HOI數據集側重於可供性細節,往往忽略了物體的物理特性對人類長期動作的影響,因此作者們引入了新的數據集來彌補這一差距。 Method: 作者介紹了PA-HOI Motion Capture數據集,該數據集包含562段人類與物體交互的動作序列,並通過集成現有的動作生成方法來驗證其應用性。 Result: PA-HOI數據集成功擴展了現有數據集的範圍,展示了物體物理屬性對人類姿勢、速度、動作規模和交互策略的影響。 Conclusion: PA-HOI數據集有效地提升了對物體物理屬性對人類動作動態影響的理解,並具有應用於動作生成方法的潛力。 Abstract: The Human-Object Interaction (HOI) task explores the dynamic interactions between humans and objects in physical environments, providing essential biomechanical and cognitive-behavioral foundations for fields such as robotics, virtual reality, and human-computer interaction. However, existing HOI data sets focus on details of affordance, often neglecting the influence of physical properties of objects on human long-term motion. To bridge this gap, we introduce the PA-HOI Motion Capture dataset, which highlights the impact of objects' physical attributes on human motion dynamics, including human posture, moving velocity, and other motion characteristics. The dataset comprises 562 motion sequences of human-object interactions, with each sequence performed by subjects of different genders interacting with 35 3D objects that vary in size, shape, and weight. This dataset stands out by significantly extending the scope of existing ones for understanding how the physical attributes of different objects influence human posture, speed, motion scale, and interacting strategies. We further demonstrate the applicability of the PA-HOI dataset by integrating it with existing motion generation methods, validating its capacity to transfer realistic physical awareness.[116] Interpretable Rheumatoid Arthritis Scoring via Anatomy-aware Multiple Instance Learning
Zhiyan Bo,Laura C. Coates,Bartlomiej W. Papiez
Main category: cs.CV
TL;DR: 本研究开发了一个自动化的两阶段管道,用于可解释的图像级别SvdH评分预测,以提高类风湿性关节炎放射学损伤评估的效率。
Details
Motivation: SvdH评分在RA的临床试验中被广泛使用,但其复杂性限制了其在常规临床实践中的采用。为了解决手动评分的低效率,本研究提出了一个自动化的图像级别SvdH评分预测的两阶段管道。 Method: 该研究提出了一种用于可解释图像级别SvdH评分预测的两阶段管道,使用双手机X光片。该方法包括两个阶段:1)提取与疾病相关的图像区域,2)使用基于注意力的多实例学习整合这些区域以生成图像级别的特征进行预测。两种区域提取方案被提出:1)采样最可能包含异常的图像块,2)裁剪包含疾病相关关节的补丁。 Result: 使用第二种区域提取方案,最佳单个评分预测模型达到了0.943的皮尔逊相关系数(PCC)和15.73的均方根误差(RMSE)。集成学习进一步提高了预测准确性,达到了0.945的PCC和15.57的RMSE,达到了最先进的性能,与经验丰富的放射科医生的性能相当(PCC = 0.97,RMSE = 18.75)。 Conclusion: 研究得出,提出的两阶段管道在图像级别的SvdH评分预测中表现出色,特别是使用第二种区域提取方案时,结合集成学习,达到了与经验丰富的放射科医生相当的性能,并有效地基于RA进展相关的解剖结构进行决策。 Abstract: The Sharp/van der Heijde (SvdH) score has been widely used in clinical trials to quantify radiographic damage in Rheumatoid Arthritis (RA), but its complexity has limited its adoption in routine clinical practice. To address the inefficiency of manual scoring, this work proposes a two-stage pipeline for interpretable image-level SvdH score prediction using dual-hand radiographs. Our approach extracts disease-relevant image regions and integrates them using attention-based multiple instance learning to generate image-level features for prediction. We propose two region extraction schemes: 1) sampling image tiles most likely to contain abnormalities, and 2) cropping patches containing disease-relevant joints. With Scheme 2, our best individual score prediction model achieved a Pearson's correlation coefficient (PCC) of 0.943 and a root mean squared error (RMSE) of 15.73. Ensemble learning further boosted prediction accuracy, yielding a PCC of 0.945 and RMSE of 15.57, achieving state-of-the-art performance that is comparable to that of experienced radiologists (PCC = 0.97, RMSE = 18.75). Finally, our pipeline effectively identified and made decisions based on anatomical structures which clinicians consider relevant to RA progression.[117] TEFormer: Texture-Aware and Edge-Guided Transformer for Semantic Segmentation of Urban Remote Sensing Images
Guoyu Zhou,Jing Zhang,Yi Yan,Hui Zhang,Li Zhuo
Main category: cs.CV
TL;DR: TEFormer 通过引入纹理感知与边缘引导机制,显著提升了城市遥感图像的语义分割精度。
Details
Motivation: 城市遥感图像语义分割面临地物纹理差异小、空间结构相似、形状不规则、边界模糊和语义对象重叠等挑战,容易导致语义歧义和误分类。 Method: TEFormer 提出了纹理感知模块(TaM)、边缘引导三分支解码器(Eg3Head)和边缘引导特征融合模块(EgFFM),以捕捉细粒度的纹理差异、保留局部边缘并融合上下文和边缘信息。 Result: TEFormer 在 Potsdam、Vaihingen 和 LoveDA 数据集上分别达到了 88.57%、81.46% 和 53.55% 的 mIoU,证明了其在 URSI 语义分割中的有效性。 Conclusion: TEFormer 是一种用于城市遥感图像语义分割的新方法,它集成了纹理感知和边缘引导机制,从而有效提升了分割精度。 Abstract: Semantic segmentation of urban remote sensing images (URSIs) is crucial for applications such as urban planning and environmental monitoring. However, geospatial objects often exhibit subtle texture differences and similar spatial structures, which can easily lead to semantic ambiguity and misclassification. Moreover, challenges such as irregular object shapes, blurred boundaries, and overlapping spatial distributions of semantic objects contribute to complex and diverse edge morphologies, further complicating accurate segmentation. To tackle these issues, we propose a texture-aware and edge-guided Transformer (TEFormer) that integrates texture awareness and edge-guidance mechanisms for semantic segmentation of URSIs. In the encoder, a texture-aware module (TaM) is designed to capture fine-grained texture differences between visually similar categories to enhance semantic discrimination. Then, an edge-guided tri-branch decoder (Eg3Head) is constructed to preserve local edges and details for multiscale context-awareness. Finally, an edge-guided feature fusion module (EgFFM) is to fuse contextual and detail information with edge information to realize refined semantic segmentation. Extensive experiments show that TEFormer achieves mIoU of 88.57%, 81.46%, and 53.55% on the Potsdam, Vaihingen, and LoveDA datasets, respectively, shows the effectiveness in URSI semantic segmentation.[118] Depth Jitter: Seeing through the Depth
Md Sazidur Rahman,David Cabecinhas,Ricard Marxer
Main category: cs.CV
TL;DR: 本文提出了一种名为 Depth-Jitter 的深度感知数据增强技术,通过模拟自然深度变化来提升模型在不同深度环境下的稳定性与泛化能力。
Details
Motivation: 传统的数据增强方法忽略了深度感知的变换,限制了模型在真实世界深度变化中的鲁棒性。因此,本文提出了一种新的深度感知增强方法。 Method: Depth-Jitter 使用自适应深度偏移方法,基于深度方差阈值生成合成深度扰动,在保持结构完整性的前提下模拟自然深度变化。 Result: 在 FathomNet 和 UTDAC2020 数据集上评估了 Depth-Jitter 的效果,实验表明,虽然其绝对性能不一定优于传统方法,但在提升模型稳定性和泛化能力方面表现突出。 Conclusion: Depth-Jitter 提出了一种新的基于深度的数据增强方法,有效提升了模型在不同深度条件下的稳定性和泛化能力,为深度感知的增强技术提供了研究基础。 Abstract: Depth information is essential in computer vision, particularly in underwater imaging, robotics, and autonomous navigation. However, conventional augmentation techniques overlook depth aware transformations, limiting model robustness in real world depth variations. In this paper, we introduce Depth-Jitter, a novel depth-based augmentation technique that simulates natural depth variations to improve generalization. Our approach applies adaptive depth offsetting, guided by depth variance thresholds, to generate synthetic depth perturbations while preserving structural integrity. We evaluate Depth-Jitter on two benchmark datasets, FathomNet and UTDAC2020 demonstrating its impact on model stability under diverse depth conditions. Extensive experiments compare Depth-Jitter against traditional augmentation strategies such as ColorJitter, analyzing performance across varying learning rates, encoders, and loss functions. While Depth-Jitter does not always outperform conventional methods in absolute performance, it consistently enhances model stability and generalization in depth-sensitive environments. These findings highlight the potential of depth-aware augmentation for real-world applications and provide a foundation for further research into depth-based learning strategies. The proposed technique is publicly available to support advancements in depth-aware augmentation. The code is publicly available on \href{https://github.com/mim-team/Depth-Jitter}{github}.[119] Towards Unified Image Deblurring using a Mixture-of-Experts Decoder
Daniel Feijoo,Paula Garrido-Mellado,Jaesung Rim,Alvaro Garcia,Marcos V. Conde
Main category: cs.CV
TL;DR: 本文提出了一种通用的图像去模糊方法,通过混合专家解码模块实现多种模糊类型的高效恢复。
Details
Motivation: 现有方法专注于特定模糊类型的解决方案,缺乏泛化能力,需要多个模型覆盖多种模糊类型,这在许多实际场景中并不实用。 Method: 本文引入了一种混合专家(MoE)解码模块,该模块根据识别的模糊退化动态路由图像特征,实现精确高效的端到端恢复。 Result: 该统一方法不仅实现了与专用任务模型相当的性能,还在未见过的模糊退化场景中表现出色。 Conclusion: 本文提出了一种统一的图像去模糊方法,能够有效恢复多种模糊退化类型影响的图像,并展现出卓越的鲁棒性和泛化能力。 Abstract: Image deblurring, removing blurring artifacts from images, is a fundamental task in computational photography and low-level computer vision. Existing approaches focus on specialized solutions tailored to particular blur types, thus, these solutions lack generalization. This limitation in current methods implies requiring multiple models to cover several blur types, which is not practical in many real scenarios. In this paper, we introduce the first all-in-one deblurring method capable of efficiently restoring images affected by diverse blur degradations, including global motion, local motion, blur in low-light conditions, and defocus blur. We propose a mixture-of-experts (MoE) decoding module, which dynamically routes image features based on the recognized blur degradation, enabling precise and efficient restoration in an end-to-end manner. Our unified approach not only achieves performance comparable to dedicated task-specific models, but also demonstrates remarkable robustness and generalization capabilities on unseen blur degradation scenarios.[120] Deepfake Detection that Generalizes Across Benchmarks
Andrii Yermakov,Jan Cech,Jiri Matas,Mario Fritz
Main category: cs.CV
TL;DR: 本文提出了一种高效的深度伪造检测方法LNCLIP-DF,通过微调少量参数并引入特征空间增强策略,实现了对预训练CLIP模型的优化,取得了领先的跨数据集检测性能。
Details
Motivation: 深度伪造检测器在应对未知操作技术时的泛化能力仍然是实际部署的挑战,而现有方法通常通过引入显著的架构复杂度来解决这一问题。 Method: 仅微调CLIP视觉编码器的Layer Normalization参数(占总参数量的0.03%),并通过L2归一化和潜在空间增强来增强泛化能力。 Result: 在13个基准数据集上进行了广泛评估,结果显示LNCLIP-DF在平均跨数据集AUROC方面优于更复杂的最新方法。 Conclusion: 通过仅微调Layer Normalization参数并采用L2归一化和潜在空间增强,LNCLIP-DF实现了对预训练CLIP模型的高效、可重现的改进,并在跨数据集AUROC上表现出最先进的性能。 Abstract: The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of a pre-trained CLIP vision encoder. The proposed method, LNCLIP-DF, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and latent space augmentations. We conducted an extensive evaluation on 13 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained CLIP model. The code will be made publicly available upon acceptance.[121] FedX: Explanation-Guided Pruning for Communication-Efficient Federated Learning in Remote Sensing
Barış Büyüktaş,Jonas Klotz,Begüm Demir
Main category: cs.CV
TL;DR: FedX reduces communication overhead in federated learning for remote sensing by pruning less relevant model components, minimizing transmitted model size without compromising performance.
Details
Motivation: The motivation is to address the communication overhead caused by the frequent exchange of large model updates between clients and the central server in federated learning for remote sensing image classification tasks. Method: A novel strategy called FedX was proposed, which uses explanation-guided pruning to reduce communication overhead in federated learning. Backpropagation-based explanation methods estimate the importance of model components, pruning the least relevant ones at the central server. Result: FedX significantly reduces the number of shared model parameters while enhancing the generalization capability of the global model. It was evaluated on multi-label and single-label scene classification using the BigEarthNet-S2 and EuroSAT datasets, respectively. Conclusion: FedX is successful in reducing shared model parameters and enhancing the global model's generalization capability compared to unpruned models and state-of-the-art pruning methods. Abstract: Federated learning (FL) enables the collaborative training of deep neural networks across decentralized data archives (i.e., clients), where each client stores data locally and only shares model updates with a central server. This makes FL a suitable learning paradigm for remote sensing (RS) image classification tasks, where data centralization may be restricted due to legal and privacy constraints. However, a key challenge in applying FL to RS tasks is the communication overhead caused by the frequent exchange of large model updates between clients and the central server. To address this issue, in this paper we propose a novel strategy (denoted as FedX) that uses explanation-guided pruning to reduce communication overhead by minimizing the size of the transmitted models without compromising performance. FedX leverages backpropagation-based explanation methods to estimate the task-specific importance of model components and prunes the least relevant ones at the central server. The resulting sparse global model is then sent to clients, substantially reducing communication overhead. We evaluate FedX on multi-label scene classification using the BigEarthNet-S2 dataset and single-label scene classification using the EuroSAT dataset. Experimental results show the success of FedX in significantly reducing the number of shared model parameters while enhancing the generalization capability of the global model, compared to both unpruned model and state-of-the-art pruning methods. The code of FedX will be available at https://git.tu-berlin.de/rsim/FedX.[122] XAG-Net: A Cross-Slice Attention and Skip Gating Network for 2.5D Femur MRI Segmentation
Byunghyun Ko,Anning Tian,Jeongkyu Lee
Main category: cs.CV
TL;DR: XAG-Net, a new 2.5D U-Net-based architecture, enhances inter-slice contextual modeling and intra-slice feature refinement for efficient and accurate femur MRI segmentation.
Details
Motivation: Accurate segmentation of femur structures from MRI is crucial for orthopedic diagnosis and surgical planning, but existing 2D and 3D deep learning-based segmentation approaches have limitations. Method: XAG-Net, a novel 2.5D U-Net-based architecture with pixel-wise cross-slice attention (CSA) and skip attention gating (AG) mechanisms was proposed. Result: XAG-Net outperforms baseline 2D, 2.5D, and 3D U-Net models in femur segmentation accuracy while maintaining computational efficiency. Conclusion: XAG-Net proves to be a promising framework for efficient and accurate femur MRI segmentation. Abstract: Accurate segmentation of femur structures from Magnetic Resonance Imaging (MRI) is critical for orthopedic diagnosis and surgical planning but remains challenging due to the limitations of existing 2D and 3D deep learning-based segmentation approaches. In this study, we propose XAG-Net, a novel 2.5D U-Net-based architecture that incorporates pixel-wise cross-slice attention (CSA) and skip attention gating (AG) mechanisms to enhance inter-slice contextual modeling and intra-slice feature refinement. Unlike previous CSA-based models, XAG-Net applies pixel-wise softmax attention across adjacent slices at each spatial location for fine-grained inter-slice modeling. Extensive evaluations demonstrate that XAG-Net surpasses baseline 2D, 2.5D, and 3D U-Net models in femur segmentation accuracy while maintaining computational efficiency. Ablation studies further validate the critical role of the CSA and AG modules, establishing XAG-Net as a promising framework for efficient and accurate femur MRI segmentation.[123] SIFThinker: Spatially-Aware Image Focus for Visual Reasoning
Zhangquan Chen,Ruihui Zhao,Chuwei Luo,Mingze Sun,Xinlei Yu,Yangyang Kang,Ruqi Huang
Main category: cs.CV
TL;DR: 本文提出了SIFThinker框架,通过注意力修正和图像区域聚焦提升多模态大语言模型的视觉任务性能。
Details
Motivation: 当前多模态大语言模型在复杂视觉任务中面临挑战,尤其是空间理解和细粒度感知,需要结合注意力修正和空间线索来提升性能。 Method: 引入了SIFThinker框架,包括reverse-expansion-forward-inference策略和GRPO-SIF训练范式,利用深度增强的边界框和自然语言交错进行注意力修正和图像区域聚焦。 Result: SIFThinker在空间理解和细粒度视觉感知方面优于现有最先进方法,并构建了SIF-50K数据集。 Conclusion: SIFThinker有效地提升了多模态大语言模型在复杂视觉任务中的表现,同时保持了良好的泛化能力。 Abstract: Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method.[124] Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Temporal Grounding
Jian Hu,Zixu Cheng,Shaogang Gong,Isabel Guan,Jianye Hao,Jun Wang,Kun Shao
Main category: cs.CV
TL;DR: This paper proposes URPA, a method for cross-domain video temporal grounding that works with minimal unlabelled data, avoids the need for target domain labels, and maintains low computational cost for real-time applications.
Details
Motivation: The motivation is to address the limitations of existing methods like GRPO, which require labeled data and are computationally heavy, making them unsuitable for unlabelled domains and real-time deployment. Method: The paper introduces URPA, which uses GRPO rollouts to generate multiple predictions, averages them to create pseudo-labels, and incorporates uncertainty quantification through variance estimation to weight training rewards. Result: Experiments show that URPA performs well across six cross-domain settings on three datasets using only a small number of unlabelled videos, achieving good generalization and maintaining efficiency. Conclusion: The paper concludes that URPA effectively enables cross-domain temporal grounding without target labels, generalizes well with minimal unlabelled data, and maintains low computational overhead for real-time use. Abstract: Video Temporal Grounding (TG) aims to temporally locate video segments matching a natural language description (a query) in a long video. While Vision-Language Models (VLMs) are effective at holistic semantic matching, they often struggle with fine-grained temporal localisation. Recently, Group Relative Policy Optimisation (GRPO) reformulates the inference process as a reinforcement learning task, enabling fine-grained grounding and achieving strong in-domain performance. However, GRPO relies on labelled data, making it unsuitable in unlabelled domains. Moreover, because videos are large and expensive to store and process, performing full-scale adaptation introduces prohibitive latency and computational overhead, making it impractical for real-time deployment. To overcome both problems, we introduce a Data-Efficient Unlabelled Cross-domain Temporal Grounding method, from which a model is first trained on a labelled source domain, then adapted to a target domain using only a small number of unlabelled videos from the target domain. This approach eliminates the need for target annotation and keeps both computational and storage overhead low enough to run in real time. Specifically, we introduce. Uncertainty-quantified Rollout Policy Adaptation (URPA) for cross-domain knowledge transfer in learning video temporal grounding without target labels. URPA generates multiple candidate predictions using GRPO rollouts, averages them to form a pseudo label, and estimates confidence from the variance across these rollouts. This confidence then weights the training rewards, guiding the model to focus on reliable supervision. Experiments on three datasets across six cross-domain settings show that URPA generalises well using only a few unlabelled target videos. Codes will be released once published.[125] Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection
Giacomo D'Amicantonio,Snehashis Majhi,Quan Kong,Lorenzo Garattoni,Gianpiero Francesca,François Bremond,Egor Bondarev
Main category: cs.CV
TL;DR: 我们提出了一种新的框架GS-MoE,采用一组专家模型,每个专家模型专门捕捉特定的异常类型,以解决视频异常检测中的弱监督问题。
Details
Motivation: 当前模型无法解决异常类型的多样性,且弱监督信号缺乏精确的时间信息。 Method: 我们提出了一种新的框架GS-MoE,采用一组专家模型,每个专家模型专门捕捉特定的异常类型。 Result: 我们的方法在UCF-Crime数据集上达到了91.58%的AUC,并在XD-Violence和MSAD数据集上展示了优越的结果。 Conclusion: GS-MoE通过利用特定类别的专业知识和时间指导,在弱监督下为VAD设定了新基准。 Abstract: Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking category-specific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a mixture-of-experts mechanism to model complex relationships across diverse anomaly patterns. Our approach achieves state-of-the-art performance, with a 91.58% AUC on the UCF-Crime dataset, and demonstrates superior results on XD-Violence and MSAD datasets. By leveraging category-specific expertise and temporal guidance, GS-MoE sets a new benchmark for VAD under weak supervision.[126] Can Diffusion Models Bridge the Domain Gap in Cardiac MR Imaging?
Xin Ci Wong,Duygu Sarikaya,Kieran Zucker,Marc De Kamps,Nishant Ravikumar
Main category: cs.CV
TL;DR: This paper proposes a diffusion model to generate synthetic cardiac MR images, improving segmentation performance and addressing domain shift issues in AI deployment for medical imaging.
Details
Motivation: Domain shifts in MR imaging due to variations in devices and protocols limit the real-world performance of AI models. Traditional approaches like augmentation and transfer learning have limitations, and synthetic data generation faces anatomical consistency challenges. Method: A diffusion model was trained on a source domain to generate synthetic cardiac MR images resembling a reference, evaluated using segmentation networks like 2D nnU-Net, 3D nnU-Net, and vanilla U-Net for domain generalization and adaptation. Result: The synthetic data generation approach improved segmentation performance on unseen domains, significantly enhancing surface-based metrics (p < 0.01) compared to training on real data alone, reducing the need for transfer or online learning. Conclusion: The proposed diffusion model effectively generates synthetic cardiac MR images that maintain structural fidelity and improve segmentation performance, offering a promising solution for domain shift challenges in real-world applications. Abstract: Magnetic resonance (MR) imaging, including cardiac MR, is prone to domain shift due to variations in imaging devices and acquisition protocols. This challenge limits the deployment of trained AI models in real-world scenarios, where performance degrades on unseen domains. Traditional solutions involve increasing the size of the dataset through ad-hoc image augmentation or additional online training/transfer learning, which have several limitations. Synthetic data offers a promising alternative, but anatomical/structural consistency constraints limit the effectiveness of generative models in creating image-label pairs. To address this, we propose a diffusion model (DM) trained on a source domain that generates synthetic cardiac MR images that resemble a given reference. The synthetic data maintains spatial and structural fidelity, ensuring similarity to the source domain and compatibility with the segmentation mask. We assess the utility of our generative approach in multi-centre cardiac MR segmentation, using the 2D nnU-Net, 3D nnU-Net and vanilla U-Net segmentation networks. We explore domain generalisation, where, domain-invariant segmentation models are trained on synthetic source domain data, and domain adaptation, where, we shift target domain data towards the source domain using the DM. Both strategies significantly improved segmentation performance on data from an unseen target domain, in terms of surface-based metrics (Welch's t-test, p < 0.01), compared to training segmentation models on real data alone. The proposed method ameliorates the need for transfer learning or online training to address domain shift challenges in cardiac MR image analysis, especially useful in data-scarce settings.[127] ViPro-2: Unsupervised State Estimation via Integrated Dynamics for Guiding Video Prediction
Patrick Takenaka,Johannes Maucher,Marco F. Huber
Main category: cs.CV
TL;DR: 本文改进了ViPro模型,使其能够在无监督的情况下从观察中正确推断状态,并扩展了Orbits数据集,以更接近现实场景。
Details
Motivation: ViPro在复杂动态环境中的表现表明程序性知识能够帮助深度模型,但其依赖于给定的真实初始符号状态,导致在噪声环境下无法准确估计状态。因此,本文旨在解决这一问题,使模型能够从观察中正确推断状态。 Method: 对ViPro进行了改进,使其能够在无监督的情况下从观察中推断状态,并扩展了原始Orbits数据集,增加了一个3D变体,以缩小与现实场景之间的差距。 Result: 改进后的模型能够在无监督的情况下正确从观察中推断状态,并且新的3D Orbits数据集缩小了与现实场景之间的差距。 Conclusion: ViPro的先前版本依赖于给定的真实初始符号状态,这导致模型学习了一个无法将观察环境与预测符号状态连接起来的捷径。本文提出了一些改进,使模型能够在没有初始完整真实状态的情况下正确从观察中推断状态。 Abstract: Predicting future video frames is a challenging task with many downstream applications. Previous work has shown that procedural knowledge enables deep models for complex dynamical settings, however their model ViPro assumed a given ground truth initial symbolic state. We show that this approach led to the model learning a shortcut that does not actually connect the observed environment with the predicted symbolic state, resulting in the inability to estimate states given an observation if previous states are noisy. In this work, we add several improvements to ViPro that enables the model to correctly infer states from observations without providing a full ground truth state in the beginning. We show that this is possible in an unsupervised manner, and extend the original Orbits dataset with a 3D variant to close the gap to real world scenarios.[128] Street View Sociability: Interpretable Analysis of Urban Social Behavior Across 15 Cities
Kieran Elrod,Katherine Flanigan,Mario Bergés
Main category: cs.CV
TL;DR: 研究通过分析街景图像中的社会互动信息,验证了城市规划理论,并提出了一个用于研究城市社交性的新工具。
Details
Motivation: 设计社会活动活跃的街道一直是城市规划的目标,但现有的定量研究大多衡量的是行人数量而不是社会互动的质量。 Method: 分析了15个城市的2998张街景图像,使用多模态大语言模型,结合Mehta的被动、短暂和持久社交分类法,通过线性回归模型控制天气、一天中的时间和行人数量等因素。 Result: 结果与长期存在的城市规划理论一致:天空视野指数与所有三种社交类型相关,绿色视野指数预测持久的社交性,地方依恋与短暂的社交性正相关。 Conclusion: 研究表明街景图像可以用来推断特定类型的社会互动与建成环境变量之间的关系,并为跨文化理论测试和以证据为基础的、社会活力城市的设计提供可扩展的、保护隐私的工具。 Abstract: Designing socially active streets has long been a goal of urban planning, yet existing quantitative research largely measures pedestrian volume rather than the quality of social interactions. We hypothesize that street view imagery -- an inexpensive data source with global coverage -- contains latent social information that can be extracted and interpreted through established social science theory. As a proof of concept, we analyzed 2,998 street view images from 15 cities using a multimodal large language model guided by Mehta's taxonomy of passive, fleeting, and enduring sociability -- one illustrative example of a theory grounded in urban design that could be substituted or complemented by other sociological frameworks. We then used linear regression models, controlling for factors like weather, time of day, and pedestrian counts, to test whether the inferred sociability measures correlate with city-level place attachment scores from the World Values Survey and with environmental predictors (e.g., green, sky, and water view indices) derived from individual street view images. Results aligned with long-standing urban planning theory: the sky view index was associated with all three sociability types, the green view index predicted enduring sociability, and place attachment was positively associated with fleeting sociability. These results provide preliminary evidence that street view images can be used to infer relationships between specific types of social interactions and built environment variables. Further research could establish street view imagery as a scalable, privacy-preserving tool for studying urban sociability, enabling cross-cultural theory testing and evidence-based design of socially vibrant cities.[129] Aligning Effective Tokens with Video Anomaly in Large Language Models
Yingxian Chen,Jiahui Liu,Ruifan Di,Yanwei Li,Chirui Chang,Shizhen Zhao,Wilton W. T. Fok,Xiaojuan Qi,Yik-Chung Wu
Main category: cs.CV
TL;DR: 提出了VA-GPT,一种新的视频异常事件分析方法,通过专门构建的模块和数据集提高了视频异常事件总结和定位的准确性。
Details
Motivation: 当前视频理解多模态大语言模型难以处理由于异常事件空间和时间稀疏性导致的冗余信息问题。 Method: 提出了一种新的多模态大语言模型VA-GPT,包括空间有效令牌选择(SETS)和时间有效令牌生成(TETG)模块,并构建了专门用于微调视频异常感知MLLM的数据集。 Result: VA-GPT能够有效捕捉和分析与异常事件相关的空间和时间信息,从而产生更准确的响应和交互。 Conclusion: VA-GPT在多个基准测试中优于现有最先进方法,展示了其在视频异常事件总结和定位方面的有效性。 Abstract: Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.[130] An Implemention of Two-Phase Image Segmentation using the Split Bregman Method
Olakunle S. Abawonse,Günay Doğan
Main category: cs.CV
TL;DR: 本文实现了Goldstein等人提出的基于分裂Bregman方法的两阶段图像分割算法,并通过实验验证其在图像分割中的有效性和鲁棒性。
Details
Motivation: 本文的动机是实现Goldstein、Bresson和Osher提出的高效图像分割算法,以解决传统Chan-Vese模型在计算效率上的不足,并通过实验验证其在不同图像和参数下的有效性。 Method: 作者基于Chan和Vese提出的区域隶属函数模型,使用Goldstein等人修改后的能量函数,并采用分裂Bregman方法进行高效最小化,从而实现图像的前景和背景分割。 Result: 实验结果表明,该方法能够有效分割图像的前景和背景,并且在多种图像和参数设置下具有良好的性能表现。 Conclusion: 本文提出了一种基于Goldstein、Bresson和Osher提出的两阶段图像分割算法的详细实现,并通过多种图像展示了该算法在不同参数下的性能。 Abstract: In this paper, we describe an implementation of the two-phase image segmentation algorithm proposed by Goldstein, Bresson, Osher in \cite{gold:bre}. This algorithm partitions the domain of a given 2d image into foreground and background regions, and each pixel of the image is assigned membership to one of these two regions. The underlying assumption for the segmentation model is that the pixel values of the input image can be summarized by two distinct average values, and that the region boundaries are smooth. Accordingly, the model is defined as an energy in which the variable is a region membership function to assign pixels to either region, originally proposed by Chan and Vese in \cite{chan:vese}. This energy is the sum of image data terms in the regions and a length penalty for region boundaries. Goldstein, Bresson, Osher modify the energy of Chan-Vese in \cite{gold:bre} so that their new energy can be minimized efficiently using the split Bregman method to produce an equivalent two-phase segmentation. We provide a detailed implementation of this method \cite{gold:bre}, and document its performance with several images over a range of algorithm parameters.[131] Are you In or Out (of gallery)? Wisdom from the Same-Identity Crowd
Aman Bhatta,Maria Dhakal,Michael C. King,Kevin W. Bowyer
Main category: cs.CV
TL;DR: The paper proposes a new approach to detect Out-of-gallery results in one-to-many facial identification by using additional enrolled images to train a classifier, showing viability across different datasets, matchers, and demographic groups, especially with matchers trained using advanced margin-based loss functions.
Details
Motivation: A central problem in one-to-many facial identification is determining whether the person in the probe image is In-gallery or Out-of-gallery. Past approaches have focused on finding a suitable threshold on the similarity score for detection, but this work introduces a new approach using additional enrolled images to make this prediction. Method: The method involves generating In-gallery and Out-of-gallery training data by extracting ranks of additional enrolled images corresponding to the rank-one identity. A classifier is then trained to predict whether a rank-one result is In-gallery or Out-of-gallery based on this feature vector. Result: Experimental results using two datasets and four different matchers show that the approach is viable for mugshot quality probe images and also for probes degraded by blur, reduced resolution, atmospheric turbulence, and sunglasses. The classification accuracy is similar across demographic groups. Conclusion: The proposed approach has the potential to provide an objective estimate of whether a one-to-many facial identification is Out-of-gallery, thereby reducing false positive identifications, wrongful arrests, and wasted investigative time. The effectiveness of the Out-of-gallery detection approach emerges only with matchers trained using advanced margin-based loss functions. Abstract: A central problem in one-to-many facial identification is that the person in the probe image may or may not have enrolled image(s) in the gallery; that is, may be In-gallery or Out-of-gallery. Past approaches to detect when a rank-one result is Out-of-gallery have mostly focused on finding a suitable threshold on the similarity score. We take a new approach, using the additional enrolled images of the identity with the rank-one result to predict if the rank-one result is In-gallery / Out-of-gallery. Given a gallery of identities and images, we generate In-gallery and Out-of-gallery training data by extracting the ranks of additional enrolled images corresponding to the rank-one identity. We then train a classifier to utilize this feature vector to predict whether a rank-one result is In-gallery or Out-of-gallery. Using two different datasets and four different matchers, we present experimental results showing that our approach is viable for mugshot quality probe images, and also, importantly, for probes degraded by blur, reduced resolution, atmospheric turbulence and sunglasses. We also analyze results across demographic groups, and show that In-gallery / Out-of-gallery classification accuracy is similar across demographics. Our approach has the potential to provide an objective estimate of whether a one-to-many facial identification is Out-of-gallery, and thereby to reduce false positive identifications, wrongful arrests, and wasted investigative time. Interestingly, comparing the results of older deep CNN-based face matchers with newer ones suggests that the effectiveness of our Out-of-gallery detection approach emerges only with matchers trained using advanced margin-based loss functions.[132] Text as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning
Xiangyu Wu,Feng Yu,Yang Yang,Jianfeng Lu
Main category: cs.CV
TL;DR: 本研究提出 TaAM-CPT,一种无需特定模态标注数据即可处理多种模态任务的可扩展方法,并在多个任务上表现出色。
Details
Motivation: 现有的多模态学习方法严重依赖大量特定模态的标注数据,或者仅适用于单一模态,因此需要一种更通用、可扩展的方法。 Method: TaAM-CPT 包括模态提示池、文本构建和模态对齐的文本编码器,并设计了模态内和模态间的学习目标以协调不同模态的学习。 Result: TaAM-CPT 在没有特定模态标注数据的情况下,在多种模态数据集上取得了领先的结果,包括视频分类、图像分类和音频分类。 Conclusion: TaAM-CPT 是一种可扩展的方法,仅使用文本数据即可构建面向无限模态的通用表示模型。 Abstract: The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within modalities while maintaining semantic consistency across different modalities. Benefiting from its scalable architecture and pre-trained models, TaAM-CPT can be seamlessly extended to accommodate unlimited modalities. Remarkably, without any modality-specific labeled data, TaAM-CPT achieves leading results on diverse datasets spanning various modalities, including video classification, image classification, and audio classification. The code is available at https://github.com/Jinx630/TaAM-CPT.[133] FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation
Wenbin Teng,Gonglin Chen,Haiwei Chen,Yajie Zhao
Main category: cs.CV
TL;DR: FVGen是一个新的框架,通过将视频扩散模型蒸馏成少步去噪模型,显著提高了在稀疏输入视图下生成新颖视图的速度和时间效率。
Details
Motivation: 尽管在3D重建方面取得了进展,但在稀疏视图情况下仍存在挑战,导致未见区域经常出现伪影。最近的研究利用视频扩散模型(VDMs)生成密集观测,填补仅提供稀疏视图时的空白,但这些方法在使用VDMs时采样速度慢是一个重要的限制。 Method: FVGen通过生成对抗网络(GANs)和软化反向KL散度最小化,将多步去噪教师模型蒸馏成少步去噪学生模型。 Result: 与之前的工作相比,FVGen框架在生成相同数量的新颖视图时具有相似(或更好)的视觉质量,同时采样时间减少了超过90%。 Conclusion: FVGen显著提高了在稀疏输入视图下下游重建任务的时间效率,尤其是在需要多次运行预训练VDM以获得更好空间覆盖的情况下。 Abstract: Recent progress in 3D reconstruction has enabled realistic 3D models from dense image captures, yet challenges persist with sparse views, often leading to artifacts in unseen areas. Recent works leverage Video Diffusion Models (VDMs) to generate dense observations, filling the gaps when only sparse views are available for 3D reconstruction tasks. A significant limitation of these methods is their slow sampling speed when using VDMs. In this paper, we present FVGen, a novel framework that addresses this challenge by enabling fast novel view synthesis using VDMs in as few as four sampling steps. We propose a novel video diffusion model distillation method that distills a multi-step denoising teacher model into a few-step denoising student model using Generative Adversarial Networks (GANs) and softened reverse KL-divergence minimization. Extensive experiments on real-world datasets show that, compared to previous works, our framework generates the same number of novel views with similar (or even better) visual quality while reducing sampling time by more than 90%. FVGen significantly improves time efficiency for downstream reconstruction tasks, particularly when working with sparse input views (more than 2) where pre-trained VDMs need to be run multiple times to achieve better spatial coverage.[134] A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery
Ch Muhammad Awais,Marco Reggiannini,Davide Moroni,Oktay Karakus
Main category: cs.CV
TL;DR: 本文研究了超分辨率与分类性能之间的关系,并提出了一种结合图像质量和分类目标的新方法,以优化损失函数来提高合成孔径雷达图像的分辨率。
Details
Motivation: 传统方法主要集中于基于像素级指标增强图像质量,而超分辨图像保真度与下游分类性能之间的关系则未被充分探索。 Method: 通过部署一种专门的算法策略,探索超分辨率和分类之间的关系,并提出一种新方法,该方法结合了图像质量和分类目标。 Result: 该方法提高了图像质量和分类准确性。 Conclusion: 本文提出了一种新的方法,通过优化考虑图像质量和分类性能的损失函数来提高合成孔径雷达图像的分辨率,从而在提高图像质量的同时增强了分类准确性。 Abstract: High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.[135] Feature-Space Oversampling for Addressing Class Imbalance in SAR Ship Classification
Ch Muhammad Awais,Marco Reggiannini,Davide Moroni,Oktay Karakus
Main category: cs.CV
TL;DR: 本文探讨了在SAR舰船分类中使用过采样方法来解决类别不平衡问题,并提出了两种基于特征空间的新算法M2m_f和M2m_u,实验结果显示这些方法优于原始M2m和基线方法。
Details
Motivation: SAR舰船分类面临长尾数据集的挑战,这使得代表性不足的类别难以分类。光学数据中的过采样方法已被证明在解决类别不平衡方面有效。 Method: 提出了两种受Major-to-minor(M2m)方法启发的新算法M2m_f和M2m_u,并在两个公开数据集OpenSARShip(6类)和FuSARShip(9类)上进行了测试,使用三种最先进的模型ViT、VGG16和ResNet50作为特征提取器。 Result: 实验结果显示,对于FuSARShip数据集,平均F1分数提高了8.82%,而对于OpenSARShip数据集,平均F1分数提高了4.44%。 Conclusion: 提出的新方法在解决SAR舰船分类中的类别不平衡问题方面显示出了有效性。 Abstract: SAR ship classification faces the challenge of long-tailed datasets, which complicates the classification of underrepresented classes. Oversampling methods have proven effective in addressing class imbalance in optical data. In this paper, we evaluated the effect of oversampling in the feature space for SAR ship classification. We propose two novel algorithms inspired by the Major-to-minor (M2m) method M2m$_f$, M2m$_u$. The algorithms are tested on two public datasets, OpenSARShip (6 classes) and FuSARShip (9 classes), using three state-of-the-art models as feature extractors: ViT, VGG16, and ResNet50. Additionally, we also analyzed the impact of oversampling methods on different class sizes. The results demonstrated the effectiveness of our novel methods over the original M2m and baselines, with an average F1-score increase of 8.82% for FuSARShip and 4.44% for OpenSARShip.[136] SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation
Guido Manni,Clemente Lauretti,Loredana Zollo,Paolo Soda
Main category: cs.CV
TL;DR: 本文提出了一种新的GAN半监督学习框架,通过结合生成器、判别器和分类器,在标注数据极少的情况下显著提升了医学图像分类的性能。
Details
Motivation: 深度学习在医学图像领域受限于标注数据不足,本文旨在开发一种能在低标注数据条件下有效工作的半监督学习方法。 Method: 该方法结合了三个专门的神经网络:生成器、判别器和分类器,并采用三阶段训练框架,利用有限的标注数据进行监督学习和利用大量未标注数据进行无监督学习。 Result: 在11个MedMNIST数据集上的实验表明,该方法在5个到50个标注样本每类的设置下,均优于6种最先进的基于GAN的半监督方法,尤其在5样本设置下表现突出。 Conclusion: 该论文提出了一种新颖的基于GAN的半监督学习框架,在医学图像分类任务中,特别是在标注数据稀缺的情况下,实现了比现有方法更好的性能。 Abstract: Deep learning has revolutionized medical imaging, but its effectiveness is severely limited by insufficient labeled training data. This paper introduces a novel GAN-based semi-supervised learning framework specifically designed for low labeled-data regimes, evaluated across settings with 5 to 50 labeled samples per class. Our approach integrates three specialized neural networks -- a generator for class-conditioned image translation, a discriminator for authenticity assessment and classification, and a dedicated classifier -- within a three-phase training framework. The method alternates between supervised training on limited labeled data and unsupervised learning that leverages abundant unlabeled images through image-to-image translation rather than generation from noise. We employ ensemble-based pseudo-labeling that combines confidence-weighted predictions from the discriminator and classifier with temporal consistency through exponential moving averaging, enabling reliable label estimation for unlabeled data. Comprehensive evaluation across eleven MedMNIST datasets demonstrates that our approach achieves statistically significant improvements over six state-of-the-art GAN-based semi-supervised methods, with particularly strong performance in the extreme 5-shot setting where the scarcity of labeled data is most challenging. The framework maintains its superiority across all evaluated settings (5, 10, 20, and 50 shots per class). Our approach offers a practical solution for medical imaging applications where annotation costs are prohibitive, enabling robust classification performance even with minimal labeled data. Code is available at https://github.com/GuidoManni/SPARSE.[137] MotionSwap
Om Patil,Jinesh Modi,Suryabha Mukhopadhyay,Meghaditya Giri,Chhavi Malhotra
Main category: cs.CV
TL;DR: 本文提出了一种改进的SimSwap高效人脸交换框架,通过引入注意力机制、动态损失权重和学习率调度等技术,显著提升了模型性能,并指出了未来发展方向。
Details
Motivation: 人脸交换技术在学术研究和商业应用中都受到了广泛关注,该论文旨在提高SimSwap模型的高保真度和性能。 Method: 论文提出对SimSwap模型的改进,包括在生成器架构中引入自注意力和交叉注意力机制、动态损失权重以及余弦退火学习率调度。 Result: 实验结果显示,在400,000次训练迭代中,增强模型在身份相似性、更低的FID分数和视觉上更优的结果方面优于基线模型,消融研究确认了每项改进的重要性。 Conclusion: 该论文总结指出,增强的SimSwap模型在身份保持、属性一致性和整体视觉质量方面取得了显著改进,未来的研究方向包括集成StyleGAN3、改进嘴唇同步、引入3D面部建模和视频应用中的时间一致性。 Abstract: Face swapping technology has gained significant attention in both academic research and commercial applications. This paper presents our implementation and enhancement of SimSwap, an efficient framework for high fidelity face swapping. We introduce several improvements to the original model, including the integration of self and cross-attention mechanisms in the generator architecture, dynamic loss weighting, and cosine annealing learning rate scheduling. These enhancements lead to significant improvements in identity preservation, attribute consistency, and overall visual quality. Our experimental results, spanning 400,000 training iterations, demonstrate progressive improvements in generator and discriminator performance. The enhanced model achieves better identity similarity, lower FID scores, and visibly superior qualitative results compared to the baseline. Ablation studies confirm the importance of each architectural and training improvement. We conclude by identifying key future directions, such as integrating StyleGAN3, improving lip synchronization, incorporating 3D facial modeling, and introducing temporal consistency for video-based applications.[138] CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment
Shengzhu Yang,Jiawei Du,Shuai Lu,Weihang Zhang,Ningli Wang,Huiqi Li
Main category: cs.CV
TL;DR: 本研究提出了一种新的非对比插件CLIPin,能够提升CLIP模型的多模态语义对齐效果,并具有良好的兼容性和广泛应用潜力。
Details
Motivation: 大规模自然图像-文本数据集由于弱监督而存在松散的语义对齐,而医学数据集虽然具有高跨模态相关性但内容多样性低,这对对比语言-图像预训练(CLIP)模型学习鲁棒且可泛化的表示能力构成了挑战。 Method: 提出了CLIPin插件,并设计了两个共享的预投影器,分别用于图像和文本模态,以在参数折衷的方式下促进对比学习和非对比学习的融合。 Result: 在各种下游任务中进行了广泛的实验,结果证明了CLIPin作为一种即插即用组件的有效性和通用性。 Conclusion: CLIPin是一种统一的非对比插件,能够有效提高CLIP风格架构中的多模态语义对齐能力,同时具备良好的兼容性和通用性。 Abstract: Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.[139] TRUST: Leveraging Text Robustness for Unsupervised Domain Adaptation
Mattia Litrico,Mario Valerio Giuffrida,Sebastiano Battiato,Devis Tuia
Main category: cs.CV
TL;DR: TRUST is a new UDA approach leveraging language modality to guide vision model adaptation, featuring pseudo-label generation, uncertainty estimation, and multimodal contrastive learning, achieving superior performance on domain shift benchmarks.
Details
Motivation: Existing UDA methods struggle with complex domain shifts (e.g., geographical shifts), where both background and object appearances differ significantly. Language modality has shown robustness to such shifts, motivating its use in guiding vision model adaptation. Method: TRUST uses the robustness of the language modality to guide vision model adaptation. It generates pseudo-labels from captions, estimates pseudo-label uncertainty using normalized CLIP similarity scores, and reweights classification loss. It also introduces a multimodal soft-contrastive learning loss to align vision and language feature spaces. Result: TRUST achieves superior performance compared to previous methods on both classical and complex domain shift benchmarks, setting a new state-of-the-art. Conclusion: The proposed TRUST method outperforms previous UDA approaches, establishing a new state-of-the-art on both classical and complex domain shifts, including DomainNet and GeoNet. Abstract: Recent unsupervised domain adaptation (UDA) methods have shown great success in addressing classical domain shifts (e.g., synthetic-to-real), but they still suffer under complex shifts (e.g. geographical shift), where both the background and object appearances differ significantly across domains. Prior works showed that the language modality can help in the adaptation process, exhibiting more robustness to such complex shifts. In this paper, we introduce TRUST, a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model. TRUST generates pseudo-labels for target samples from their captions and introduces a novel uncertainty estimation strategy that uses normalised CLIP similarity scores to estimate the uncertainty of the generated pseudo-labels. Such estimated uncertainty is then used to reweight the classification loss, mitigating the adverse effects of wrong pseudo-labels obtained from low-quality captions. To further increase the robustness of the vision model, we propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces, by leveraging captions to guide the contrastive training of the vision model on target images. In our contrastive loss, each pair of images acts as both a positive and a negative pair and their feature representations are attracted and repulsed with a strength proportional to the similarity of their captions. This solution avoids the need for hardly determining positive and negative pairs, which is critical in the UDA setting. Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts. The code will be available upon acceptance.[140] Text Embedded Swin-UMamba for DeepLesion Segmentation
Ruida Cheng,Tejas Sudharshan Mathai,Pritam Mukherjee,Benjamin Hou,Qingqing Zhu,Zhiyong Lu,Matthew McAuliffe,Ronald M. Summers
Main category: cs.CV
TL;DR: This study introduces the Text-Swin-UMamba model that integrates text and imaging features for lesion segmentation, achieving high accuracy and outperforming existing methods.
Details
Motivation: Lesion segmentation on CT scans is crucial for clinical assessment of chronic diseases like lymphoma. Integrating imaging features with textual descriptions from radiology reports using large language models (LLMs) can enhance segmentation accuracy and clinical relevance. Method: The study integrates text descriptions from radiology reports into the Swin-UMamba architecture (Text-Swin-UMamba) for lesion segmentation. The publicly available ULS23 DeepLesion dataset and corresponding short-form descriptions of findings are used for training and testing. Result: On the test dataset, the Text-Swin-UMamba model achieved a Dice Score of 82% and a Hausdorff distance of 6.58 pixels. It outperformed previous approaches, including the LLM-driven LanGuideMedSeg model by 37% (p < 0.001), and surpassed the image-based xLSTM-UNet and nnUNet models by 1.74% and 0.22%, respectively. Conclusion: The proposed Text-Swin-UMamba model demonstrates superior performance in lesion segmentation by integrating text and imaging features, outperforming existing methods significantly. Abstract: Segmentation of lesions on CT enables automatic measurement for clinical assessment of chronic diseases (e.g., lymphoma). Integrating large language models (LLMs) into the lesion segmentation workflow offers the potential to combine imaging features with descriptions of lesion characteristics from the radiology reports. In this study, we investigate the feasibility of integrating text into the Swin-UMamba architecture for the task of lesion segmentation. The publicly available ULS23 DeepLesion dataset was used along with short-form descriptions of the findings from the reports. On the test dataset, a high Dice Score of 82% and low Hausdorff distance of 6.58 (pixels) was obtained for lesion segmentation. The proposed Text-Swin-UMamba model outperformed prior approaches: 37% improvement over the LLM-driven LanGuideMedSeg model (p < 0.001),and surpassed the purely image-based xLSTM-UNet and nnUNet models by 1.74% and 0.22%, respectively. The dataset and code can be accessed at https://github.com/ruida/LLM-Swin-UMamba[141] WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion
Sofiane Bouaziz,Adel Hafiane,Raphael Canals,Rachid Nedjai
Main category: cs.CV
TL;DR: This study introduces WGAST, an end-to-end weakly-supervised deep learning framework for daily 10 m Land Surface Temperature estimation using spatio-temporal fusion of satellite data, outperforming existing methods in accuracy and robustness.
Details
Motivation: There is a growing demand for precise and timely environmental monitoring due to urbanization, climate change, and agricultural stress. Current systems face a trade-off between spatial and temporal resolution, and few methods have addressed the estimation of daily LST at 10 m resolution. Method: WGAST uses a Weakly-Supervised Generative Adversarial Network with a four-stage generator (feature extraction, fusion, LST reconstruction, and noise suppression) and a PatchGAN discriminator. Result: WGAST reduces RMSE by 17.18% and improves SSIM by 11.00% compared to the best-performing baseline. It is validated against 33 ground-based sensors and robust to cloud-induced LST. Conclusion: WGAST is the first end-to-end deep learning framework for daily 10 m LST estimation, outperforming existing methods and showing robustness to cloud-induced LST while capturing fine-scale thermal patterns. Abstract: Urbanization, climate change, and agricultural stress are increasing the demand for precise and timely environmental monitoring. Land Surface Temperature (LST) is a key variable in this context and is retrieved from remote sensing satellites. However, these systems face a trade-off between spatial and temporal resolution. While spatio-temporal fusion methods offer promising solutions, few have addressed the estimation of daily LST at 10 m resolution. In this study, we present WGAST, a Weakly-Supervised Generative Network for Daily 10 m LST Estimation via Spatio-Temporal Fusion of Terra MODIS, Landsat 8, and Sentinel-2. WGAST is the first end-to-end deep learning framework designed for this task. It adopts a conditional generative adversarial architecture, with a generator composed of four stages: feature extraction, fusion, LST reconstruction, and noise suppression. The first stage employs a set of encoders to extract multi-level latent representations from the inputs, which are then fused in the second stage using cosine similarity, normalization, and temporal attention mechanisms. The third stage decodes the fused features into high-resolution LST, followed by a Gaussian filter to suppress high-frequency noise. Training follows a weakly supervised strategy based on physical averaging principles and reinforced by a PatchGAN discriminator. Experiments demonstrate that WGAST outperforms existing methods in both quantitative and qualitative evaluations. Compared to the best-performing baseline, on average, WGAST reduces RMSE by 17.18% and improves SSIM by 11.00%. Furthermore, WGAST is robust to cloud-induced LST and effectively captures fine-scale thermal patterns, as validated against 33 ground-based sensors. The code is available at https://github.com/Sofianebouaziz1/WGAST.git.[142] LightSwitch: Multi-view Relighting with Material-guided Diffusion
Yehonathan Litman,Fernando De la Torre,Shubham Tulsiani
Main category: cs.CV
TL;DR: 本文提出了一种名为LightSwitch的新方法,该方法利用多视角和材质信息,改进了3D物体的2D图像重照明效果。