Table of Contents
cs.CL [Back]
[1] Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM
Thomas Thebaud,Yen-Ju Lu,Matthew Wiesner,Peter Viechnicki,Najim Dehak
Main category: cs.CL
TL;DR: 本研究提出了一种结合音频基础模型和LLAMA语言模型的新方法,用于在对话转录中添加说话人特征的元数据标签,保持模型的模块化和运行速度。
Details
Motivation: 在对话转录流水线中,LLMs经常用于后处理以改善语法、标点和可读性。本文探索了添加描述说话人特征(如年龄、性别和情感)的元数据标签的后处理步骤。 Method: 将音频基础模型(如Whisper或WavLM)与LLAMA语言模型结合,使用轻量级连接器桥接音频和语言表示,无需任务特定的微调。 Result: 实现了具有竞争力的说话人分析性能,并展示了冻结的LLAMA模型在某些情况下可以直接比较x向量,达到8.8%的等错误率。 Conclusion: 结合冻结的音频基础模型和LLAMA语言模型,可以有效地进行说话人特征分析,同时保持模块化和速度。 Abstract: In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.[2] Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Negar Foroutan,Clara Meister,Debjit Paul,Joel Niklaus,Sina Ahmadi,Antoine Bosselut,Rico Sennrich
Main category: cs.CL
TL;DR: This paper introduces Parity-aware BPE, a tokenization method that ensures fairer compression across languages, addressing inequalities caused by traditional frequency-based approaches.
Details
Motivation: Standard tokenization methods favor high-resource languages, creating disparities for lower-resource languages in terms of tokenization quality and contributing to computational and financial inequalities. Method: Parity-aware BPE algorithm, which focuses on maximizing compression gain for the worst-compressed language at each merge step. Result: Parity-aware BPE achieves more equitable token counts across languages while maintaining high global compression efficiency and downstream language-model performance. Conclusion: Parity-aware BPE is an effective solution for achieving more equitable tokenization across languages with minimal impact on global compression rate and language-model performance. Abstract: Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with[3] Pitch Accent Detection improves Pretrained Automatic Speech Recognition
David Sasu,Natalie Schluter
Main category: cs.CL
TL;DR: 该论文提出了一种联合自动语音识别(ASR)和音高重音检测模型,通过引入互补的音高重音检测模块,提升了使用半监督语音表示的ASR系统的性能。
Details
Motivation: 为了提高自动语音识别系统在有限资源条件下的性能,并解决音高重音检测任务中的性能差距,作者提出了联合模型方法。 Method: 作者引入了一个联合自动语音识别和音高重音检测模型,旨在提升ASR系统的性能,并在音高重音检测任务中达到最先进的水平。 Result: 音高重音检测组件在F1分数上实现了41%的显著提升,而联合训练中的ASR性能在LibriSpeech上降低了28.3%的词错误率(WER)。 Conclusion: 该研究表明,扩展预训练语音模型以保留或重新学习重要的韵律线索(如音高重音)对提升ASR系统性能具有重要意义。 Abstract: We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.[4] Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History
Tommaso Tosato,Saskia Helbling,Yorguin-Jose Mantilla-Ramos,Mahmood Hegazy,Alberto Tosato,David John Lemay,Irina Rish,Guillaume Dumas
Main category: cs.CL
TL;DR: The study reveals significant behavioral inconsistencies in large language models, challenging assumptions about their reliability and suggesting that current personality-based alignment strategies may be insufficient for ensuring safe deployment.
Details
Motivation: Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. Method: The study uses traditional (BFI-44, SD3) and novel LLM-adapted personality instruments to systematically vary question order, paraphrasing, personas, and reasoning modes across 25+ open-source models with 500,000+ responses. Result: Findings show substantial response variability even in large models, shifts in personality measurements due to minor prompt changes, increased variability from interventions meant to stabilize behavior, and equal instability in LLM-adapted instruments compared to human-centric versions. Conclusion: Personality-based alignment strategies may be fundamentally inadequate for ensuring behavioral consistency in large language models, suggesting current LLMs lack the foundations for genuine behavioral consistency. Abstract: Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD > 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.[5] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory
Jun Liu,Zhenglun Kong,Changdi Yang,Fan Yang,Tianqi Li,Peiyan Dong,Joannah Nanjekye,Hao Tang,Geng Yuan,Wei Niu,Wenbin Zhang,Pu Zhao,Xue Lin,Dong Huang,Yanzhi Wang
Main category: cs.CL
TL;DR: 本文提出RCR-Router,一种高效的多智能体LLM上下文路由框架,通过动态选择相关记忆减少token消耗并保持答案质量。
Details
Motivation: 现有的协调方案依赖于静态或全上下文路由策略,导致token消耗过多、记忆冗余以及适应性有限,因此需要一种更高效、自适应的协作方法。 Method: RCR-Router是一种模块化且角色感知的上下文路由框架,动态选择与每个智能体角色和任务阶段相关的记忆子集,并采用轻量级评分策略指导记忆选择。 Result: 实验结果表明,在三个多跳问答基准测试中,RCR-Router减少了最多30%的token使用量,同时提高了或保持了答案质量。 Conclusion: RCR-Router有效地减少了token使用量,同时改进或保持了答案质量,突出了结构化记忆路由和输出感知评估在多智能体LLM系统中的重要性。 Abstract: Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks -- HotPotQA, MuSiQue, and 2WikiMultihop -- demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems.[6] I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations
Julia Kharchenko,Tanya Roosta,Aman Chadha,Chirag Shah
Main category: cs.CL
TL;DR: 该论文介绍了一种用于评估大型语言模型对语言shibboleths反应的新基准,揭示了模型对某些语言模式的系统性偏见。
Details
Motivation: 旨在揭示大型语言模型在处理特定语言模式时可能存在的系统性偏见,尤其是对内容质量相同但使用不同语言标记的回答。 Method: 通过使用100个经过验证的问题-回答对的访谈模拟,生成控制语言变化的同时保持语义等价,以精确测量自动评估系统中的偏见。 Result: 研究显示,使用犹豫语言的回答平均评分低了25.6%,并且该基准能有效识别特定模型的偏见。 Conclusion: 本文提出了一种全面的基准测试方法,用于检测大型语言模型在处理语言shibboleths时可能存在的偏见,为检测和度量人工智能系统中的语言歧视提供了基础框架。 Abstract: This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark's effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts.[7] Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering
Louie Hong Yao,Nicholas Jarvis,Tianyu Jiang
Main category: cs.CL
TL;DR: A new vision-language clustering framework improves the evaluation of visual activity recognition systems by addressing verb ambiguities and aligning better with human judgments.
Details
Motivation: Standard exact-match evaluation methods are insufficient for assessing visual activity recognition systems due to ambiguities in verb semantics and image interpretation. Synonymous verbs and varying perspectives can lead to equally valid but distinct verb choices, necessitating a more robust evaluation approach. Method: A vision-language clustering framework was developed to construct verb sense clusters, allowing for a more comprehensive evaluation of activity recognition models. The imSitu dataset was analyzed to assess the average number of sense clusters per image, and multiple models were evaluated using the proposed cluster-based method. Result: The analysis of the imSitu dataset revealed that each image maps to an average of 2.8 sense clusters, representing distinct perspectives. The proposed cluster-based evaluation method demonstrated better alignment with human judgments compared to traditional evaluation techniques. Conclusion: The proposed vision-language clustering framework provides a more robust and nuanced evaluation of visual activity recognition systems, better aligning with human judgments compared to standard evaluation methods. Abstract: Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs verb sense clusters, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to an average of 2.8 sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our cluster-based evaluation with standard evaluation methods. Additionally, our human alignment analysis suggests that the cluster-based evaluation better aligns with human judgements, offering a more nuanced assessment of model performance.[8] A Multi-Stage Large Language Model Framework for Extracting Suicide-Related Social Determinants of Health
Song Wang,Yishu Wei,Haotian Ma,Max Lovitt,Kelly Deng,Yuan Meng,Zihan Xu,Jingze Zhang,Yunyu Xiao,Ying Ding,Xuhai Xu,Joydeep Ghosh,Yifan Peng
Main category: cs.CL
TL;DR: This paper proposes a multi-stage large language model framework that improves the accuracy and explainability of extracting suicide-related social determinants of health (SDoH) factors from unstructured text, supporting early risk identification and prevention strategies.
Details
Motivation: Understanding SDoH factors contributing to suicide is crucial for early intervention, but challenges like long-tailed distributions, identifying key stressors, and model explainability limit data-driven approaches. Method: A multi-stage large language model framework was proposed to enhance SDoH factor extraction and explainability, compared with models like BioBERT, GPT-3.5-turbo, and DeepSeek-R1. Fine-tuning a smaller, task-specific model was also evaluated. Result: The proposed framework improved performance in extracting SDoH factors and retrieving relevant context. Fine-tuned smaller models reduced inference costs while maintaining performance. The multi-stage design enhanced explainability. Conclusion: The approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts, potentially supporting early identification of at-risk individuals and informing prevention strategies. Abstract: Background: Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches to this goal face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability. Methods: We present a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text. Our approach was compared to other state-of-the-art language models (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1). We also evaluated how the model's explanations help people annotate SDoH factors more quickly and accurately. The analysis included both automated comparisons and a pilot user study. Results: We show that our proposed framework demonstrated performance boosts in the overarching task of extracting SDoH factors and in the finer-grained tasks of retrieving relevant context. Additionally, we show that fine-tuning a smaller, task-specific model achieves comparable or better performance with reduced inference costs. The multi-stage design not only enhances extraction but also provides intermediate explanations, improving model explainability. Conclusions: Our approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts. These advancements have the potential to support early identification of individuals at risk and inform more effective prevention strategies.[9] Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning
Kun Peng,Cong Cao,Hao Peng,Zhifeng Hao,Lei Jiang,Kongjing Gu,Yanbing Liu,Philip S. Yu
Main category: cs.CL
TL;DR: 本文提出一种基于结构熵最小化的对话分割方法,并结合两步框架进行四元组提取,有效提升了对话层面情感四元组提取的效果和效率。
Details
Motivation: 现有的方法通常在整个对话中学习词关系,假设情感元素分布均匀,但实际上对话通常包含多个语义独立的子对话,跨整个对话学习会引入噪声,影响提取效果。 Method: 提出了一种利用结构熵最小化算法对对话进行语义独立子对话划分的方法,并设计了一个两步框架:首先在话语层面提取个体情感元素,然后在子对话层面匹配四元组。 Result: 在DiaASQ任务上取得了最先进的性能,并且计算成本显著降低。 Conclusion: 该论文提出了一种新的基于结构熵最小化的对话分割方法,结合两步框架进行四元组提取,实现了在对话层面情感四元组提取任务上的SOTA性能,并且计算成本更低。 Abstract: Dialogues Aspect-based Sentiment Quadruple Extraction (DiaASQ) aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue. Existing methods typically learn word relations across entire dialogues, assuming a uniform distribution of sentiment elements. However, we find that dialogues often contain multiple semantically independent sub-dialogues without clear dependencies between them. Therefore, learning word relationships across the entire dialogue inevitably introduces additional noise into the extraction process. To address this, our method focuses on partitioning dialogues into semantically independent sub-dialogues. Achieving completeness while minimizing these sub-dialogues presents a significant challenge. Simply partitioning based on reply relationships is ineffective. Instead, we propose utilizing a structural entropy minimization algorithm to partition the dialogues. This approach aims to preserve relevant utterances while distinguishing irrelevant ones as much as possible. Furthermore, we introduce a two-step framework for quadruple extraction: first extracting individual sentiment elements at the utterance level, then matching quadruples at the sub-dialogue level. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in DiaASQ with much lower computational costs.[10] Evaluation of LLMs in AMR Parsing
Shu Han Ho
Main category: cs.CL
TL;DR: This paper shows that simple finetuning of decoder-only LLMs like LLaMA 3.2 can match the performance of advanced AMR parsers, offering a simpler alternative for semantic graph parsing.
Details
Motivation: To explore whether straightforward finetuning of decoder-only LLMs can serve as a promising and simpler alternative for AMR parsing compared to complex state-of-the-art methods. Method: The study evaluates four LLM architectures (Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled) through straightforward finetuning on the LDC2020T02 Gold AMR3.0 test set. Result: LLaMA 3.2 achieved an SMATCH F1 score of 0.804, matching APT + Silver and approaching Graphene Smatch. Phi 3.5 excelled in structural validity while LLaMA 3.2 led in semantic performance. Conclusion: Finetuning decoder-only LLMs can achieve competitive performance compared to complex SOTA AMR parsers, with LLaMA 3.2 performing well in semantic understanding and Phi 3.5 excelling in structural validity. Abstract: Meaning Representation (AMR) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity.[11] Align, Don't Divide: Revisiting the LoRA Architecture in Multi-Task Learning
Jinda Liu,Bo Cheng,Yi Chang,Yuan Wu
Main category: cs.CL
TL;DR: 本文挑战了多任务学习中使用复杂多适配器或多头结构的主流范式,提出了一种更简单的单适配器LoRA方法,并进一步提出Align-LoRA,通过显式对齐任务表示取得了更好的多任务学习效果。
Details
Motivation: 现有的多任务学习方法倾向于使用复杂的多适配器或多头结构,但本文发现这种范式可能并非最优,因此提出了一种更简单的单适配器LoRA方法,并进一步提出Align-LoRA。 Method: 提出了一种新的多任务学习方法Align-LoRA,通过在共享适配器空间中引入显式的损失函数来对齐任务表示。 Result: 实验表明,简化后的多头架构和增强秩的单适配器LoRA均表现出色,而Align-LoRA在多任务学习效果上显著超越所有基线。 Conclusion: Align-LoRA实现了比所有基线更显著的效果,为多任务学习提供了一个更简单而有效的参数高效微调范式。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) is essential for adapting Large Language Models (LLMs). In practice, LLMs are often required to handle a diverse set of tasks from multiple domains, a scenario naturally addressed by multi-task learning (MTL). Within this MTL context, a prevailing trend involves LoRA variants with multiple adapters or heads, which advocate for structural diversity to capture task-specific knowledge. Our findings present a direct challenge to this paradigm. We first show that a simplified multi-head architecture with high inter-head similarity substantially outperforms complex multi-adapter and multi-head systems. This leads us to question the multi-component paradigm itself, and we further demonstrate that a standard single-adapter LoRA, with a sufficiently increased rank, also achieves highly competitive performance. These results lead us to a new hypothesis: effective MTL generalization hinges on learning robust shared representations, not isolating task-specific features. To validate this, we propose Align-LoRA, which incorporates an explicit loss to align task representations within the shared adapter space. Experiments confirm that Align-LoRA significantly surpasses all baselines, establishing a simpler yet more effective paradigm for adapting LLMs to multiple tasks. The code is available at https://github.com/jinda-liu/Align-LoRA.[12] Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations
Aditya Kishore,Gaurav Kumar,Jasabanta Patro
Main category: cs.CL
TL;DR: This paper proposes MultiCheck, a multimodal framework for fact verification that effectively combines textual and visual evidence to improve misinformation detection.
Details
Motivation: The motivation was the growing rate of multimodal misinformation, which poses challenges to traditional fact-checking systems that rely mainly on textual evidence. Method: The authors proposed a unified framework for multimodal fact verification called MultiCheck, which combines text and image encoders with a fusion module to capture cross-modal relationships. They used a classification head with a contrastive learning objective to predict claim veracity. Result: On the Factify 2 dataset, the MultiCheck framework achieved a weighted F1 score of 0.84, significantly outperforming the baseline method. Conclusion: The study concludes that the proposed MultiCheck framework is effective for multimodal fact verification, showing potential for scalable and interpretable fact-checking in real-world scenarios. Abstract: The growing rate of multimodal misinformation, where claims are supported by both text and images, poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we have proposed a unified framework for fine-grained multimodal fact verification called "MultiCheck", designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios.[13] BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation
Yuhao Wang,Ruiyang Ren,Yucheng Wang,Jing Liu,Wayne Xin Zhao,Hua Wu,Haifeng Wang
Main category: cs.CL
TL;DR: This paper proposes BEE-RAG, a framework that improves retrieval-augmented generation by balancing context entropy and ensuring stable performance with varying context lengths.
Details
Motivation: RAG systems face performance challenges due to entropy growth and attention dilution from long retrieval contexts, necessitating a more adaptable framework. Method: BEE-RAG applies entropy invariance to balance context entropy and reformulate attention dynamics, incorporating a zero-shot inference strategy and adaptive fine-tuning. Result: Experiments show that BEE-RAG effectively enhances RAG performance across multiple tasks by maintaining stable entropy levels and optimizing attention dynamics. Conclusion: BEE-RAG improves RAG system adaptability by managing context entropy, ensuring stable performance across varying context lengths. Abstract: With the rapid advancement of large language models (LLMs), retrieval-augmented generation (RAG) has emerged as a critical approach to supplement the inherent knowledge limitations of LLMs. However, due to the typically large volume of retrieved information, RAG tends to operate with long context lengths. From the perspective of entropy engineering, we identify unconstrained entropy growth and attention dilution due to long retrieval context as significant factors affecting RAG performance. In this paper, we propose the balanced entropy-engineered RAG (BEE-RAG) framework, which improves the adaptability of RAG systems to varying context lengths through the principle of entropy invariance. By leveraging balanced context entropy to reformulate attention dynamics, BEE-RAG separates attention sensitivity from context length, ensuring a stable entropy level. Building upon this, we introduce a zero-shot inference strategy for multi-importance estimation and a parameter-efficient adaptive fine-tuning mechanism to obtain the optimal balancing factor for different settings. Extensive experiments across multiple RAG tasks demonstrate the effectiveness of BEE-RAG.[14] Attention Basin: Why Contextual Position Matters in Large Language Models
Zihao Yi,Delong Zeng,Zhenqing Ling,Haohao Luo,Zhe Xu,Wei Liu,Jian Luan,Wanxia Cao,Ying Shen
Main category: cs.CL
TL;DR: 这篇论文讨论了大型语言模型(LLMs)对输入信息位置的敏感性,并提出了一种称为Attention-Driven Reranking(AttnRank)的两阶段框架,以提高模型性能。
Details
Motivation: 作者旨在探究位置偏差背后的机制,并发现模型在处理结构化项目序列时,倾向于关注序列的开始和结束部分,而忽视中间部分。 Method: 提出了AttnRank方法,通过估计模型的位置注意力偏好并重新排序检索到的文档或少量示例来优化模型性能。 Result: 实验表明,AttnRank在不同架构和规模的10个大型语言模型上实现了显著改进。 Conclusion: AttnRank是一种模型无关、无需训练且即插即用的方法,能够有效提升大型语言模型的表现。 Abstract: The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model's intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.[15] Towards Assessing Medical Ethics from Knowledge to Practice
Chang Hong,Minghao Wu,Qingying Xiao,Yuchi Wang,Xiang Wan,Guangjun Yu,Benyou Wang,Yan Hu
Main category: cs.CL
TL;DR: PrinciplismQA是一个用于评估大型语言模型医疗伦理推理的新基准,发现模型在应用有益原则方面存在困难,并提出了提高伦理对齐的方法。
Details
Motivation: 将大语言模型整合到医疗保健中需要严格评估其伦理推理,而当前基准往往忽视了这一领域。 Method: 引入了一个包含3,648个问题的全面基准PrinciplismQA,基于Principlism理论,包含从权威教科书中整理的多项选择题和从权威医学伦理案例研究文献中获取的开放式问题,并由医学专家验证。 Result: 实验显示模型的伦理知识与实际应用之间存在显著差距,尤其是在动态应用伦理原则到现实场景中。大多数LLMs在涉及有益原则的困境中表现困难,往往过分强调其他原则。 Conclusion: PrinciplismQA提供了可扩展的框架,用于诊断特定的伦理弱点,为更平衡和负责任的医疗AI铺平了道路。 Abstract: The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs' alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models' ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models' overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.[16] ATLANTIS at SemEval-2025 Task 3: Detecting Hallucinated Text Spans in Question Answering
Catherine Kobus,François Lancelot,Marion-Cécile Martin,Nawal Ould Amer
Main category: cs.CL
TL;DR: ATLANTIS团队在SemEval-2025任务3中提出了一种检测问答系统中幻觉文本片段的方法,并取得了良好的效果。
Details
Motivation: 大型语言模型(LLMs)在自然语言生成方面取得了显著进展,但仍然容易产生幻觉,生成错误或误导性内容。 Method: 利用少量样本提示、标记级别分类及基于合成数据微调的LLM,探索了有无外部上下文的方法。 Result: 在西班牙语中取得了顶级排名,在英语和德语中也获得了不错的成绩。 Conclusion: 整合相关上下文对于减少幻觉的重要性,并展示了微调模型和提示工程的潜力。 Abstract: This paper presents the contributions of the ATLANTIS team to SemEval-2025 Task 3, focusing on detecting hallucinated text spans in question answering systems. Large Language Models (LLMs) have significantly advanced Natural Language Generation (NLG) but remain susceptible to hallucinations, generating incorrect or misleading content. To address this, we explored methods both with and without external context, utilizing few-shot prompting with a LLM, token-level classification or LLM fine-tuned on synthetic data. Notably, our approaches achieved top rankings in Spanish and competitive placements in English and German. This work highlights the importance of integrating relevant context to mitigate hallucinations and demonstrate the potential of fine-tuned models and prompt engineering.[17] Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation
Haonan Shangguan,Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang,Ge Yu
Main category: cs.CL
TL;DR: 本文提出了一种名为MulCoT-RD的多模态推理蒸馏模型,用于在资源受限环境下进行高效的多模态情感推理和分类。
Details
Motivation: 当前的多模态情感分析方法依赖于参数量大的模型,而在资源受限的环境下缺乏自主多模态情感推理能力。 Method: 采用“教师-助手-学生”的蒸馏框架,利用高性能的多模态大语言模型生成初始推理数据集,训练一个中等规模的助手模型,并进一步训练一个轻量级的学生模型进行多任务学习。 Result: 在四个数据集上的实验表明,参数量仅为3B的MulCoT-RD在JMSRC任务中表现出色,具有良好的泛化能力和可解释性。 Conclusion: MulCoT-RD在资源受限环境下实现了高效的多模态情感推理与分类,为部署轻量级模型提供了新思路。 Abstract: The surge in rich multimodal content on social media platforms has greatly advanced Multimodal Sentiment Analysis (MSA), with Large Language Models (LLMs) further accelerating progress in this field. Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) LLMs for sentiment classification, overlooking autonomous multimodal sentiment reasoning generation in resource-constrained environments. Therefore, we focus on the Resource-Limited Joint Multimodal Sentiment Reasoning and Classification task, JMSRC, which simultaneously performs multimodal sentiment reasoning chain generation and sentiment classification only with a lightweight model. We propose a Multimodal Chain-of-Thought Reasoning Distillation model, MulCoT-RD, designed for JMSRC that employs a "Teacher-Assistant-Student" distillation paradigm to address deployment constraints in resource-limited environments. We first leverage a high-performance Multimodal Large Language Model (MLLM) to generate the initial reasoning dataset and train a medium-sized assistant model with a multi-task learning mechanism. A lightweight student model is jointly trained to perform efficient multimodal sentiment reasoning generation and classification. Extensive experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.[18] Pruning Large Language Models by Identifying and Preserving Functional Networks
Yiheng Liu,Junhao Ning,Sichen Xia,Xiaohui Gao,Ning Qiang,Bao Ge,Junwei Han,Xintao Hu
Main category: cs.CL
TL;DR: 本文提出了一种新的大语言模型结构化剪枝方法,通过识别和保留模型中的功能网络和关键神经元,有效提升了剪枝性能。
Details
Motivation: 当前结构化剪枝方法通常忽视了人工神经元之间的相互作用和协作,导致大语言模型宏观功能架构的破坏和剪枝性能下降。 Method: 将大语言模型视为数字大脑,分解成类似神经影像数据中识别功能脑网络的功能网络,并通过保留这些功能网络中的关键神经元进行剪枝。 Result: 提出了一种基于功能网络识别和保留的方法,改善了大语言模型的结构化剪枝效果。 Conclusion: 实验结果表明,该方法能够有效识别和定位大语言模型中的功能网络和关键神经元,从而实现高效模型剪枝。 Abstract: Structured pruning is one of the representative techniques for compressing large language models (LLMs) to reduce GPU memory consumption and accelerate inference speed. It offers significant practical value in improving the efficiency of LLMs in real-world applications. Current structured pruning methods typically rely on assessment of the importance of the structure units and pruning the units with less importance. Most of them overlooks the interaction and collaboration among artificial neurons that are crucial for the functionalities of LLMs, leading to a disruption in the macro functional architecture of LLMs and consequently a pruning performance degradation. Inspired by the inherent similarities between artificial neural networks and functional neural networks in the human brain, we alleviate this challenge and propose to prune LLMs by identifying and preserving functional networks within LLMs in this study. To achieve this, we treat an LLM as a digital brain and decompose the LLM into functional networks, analogous to identifying functional brain networks in neuroimaging data. Afterwards, an LLM is pruned by preserving the key neurons within these functional networks. Experimental results demonstrate that the proposed method can successfully identify and locate functional networks and key neurons in LLMs, enabling efficient model pruning. Our code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.[19] CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL
Sijie Wang,Quanjiang Guo,Kai Zhao,Yawei Zhang,Xin Li,Xiang Li,Siqi Li,Rui She,Shangshu Yu,Wee Peng Tay
Main category: cs.CL
TL;DR: CodeBoost enhances code large language models using only code snippets through five innovative training techniques, eliminating reliance on labor-intensive, human-annotated instructions.
Details
Motivation: The scarcity and high cost of acquiring high-quality human-annotated coding instructions create a bottleneck in the post-training of code LLMs, while code snippets are widely available. Method: CodeBoost introduces five key components: maximum-clique curation, bi-directional prediction, error-aware prediction, heterogeneous augmentation, and heterogeneous rewarding to improve the training of code LLMs. Result: Extensive experiments show that CodeBoost consistently improves the performance of code LLMs across multiple benchmarks, proving its effectiveness as a scalable training solution. Conclusion: CodeBoost offers a scalable and effective post-training framework for enhancing code large language models using only code snippets, without the need for human-annotated instructions. Abstract: Code large language models (LLMs) have become indispensable tools for building efficient and automated coding pipelines. Existing models are typically post-trained using reinforcement learning (RL) from general-purpose LLMs using "human instruction-final answer" pairs, where the instructions are usually from manual annotations. However, collecting high-quality coding instructions is both labor-intensive and difficult to scale. On the other hand, code snippets are abundantly available from various sources. This imbalance presents a major bottleneck in instruction-based post-training. We propose CodeBoost, a post-training framework that enhances code LLMs purely from code snippets, without relying on human-annotated instructions. CodeBoost introduces the following key components: (1) maximum-clique curation, which selects a representative and diverse training corpus from code; (2) bi-directional prediction, which enables the model to learn from both forward and backward prediction objectives; (3) error-aware prediction, which incorporates learning signals from both correct and incorrect outputs; (4) heterogeneous augmentation, which diversifies the training distribution to enrich code semantics; and (5) heterogeneous rewarding, which guides model learning through multiple reward types including format correctness and execution feedback from both successes and failures. Extensive experiments across several code LLMs and benchmarks verify that CodeBoost consistently improves performance, demonstrating its effectiveness as a scalable and effective training pipeline.[20] ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs
Dongxu Zhang,Ning Yang,Jihua Zhu,Jinnan Yang,Miao Xin,Baoliang Tian
Main category: cs.CL
TL;DR: This paper challenges the assumption that early errors in reasoning chains are most harmful, revealing that late-stage errors are more detrimental. It introduces ASCoT, a method that improves accuracy by adaptively identifying and correcting these vulnerabilities.
Details
Motivation: The motivation stems from the challenge of ensuring the reliability of Chain-of-Thought (CoT) prompting in Large Language Models (LLMs), particularly the assumption that early errors are most harmful. This paper aims to investigate this hypothesis and improve CoT robustness. Method: The researchers conducted error-injection experiments to analyze the impact of error timing in reasoning chains. They developed the ASCoT method, which uses an Adaptive Verification Manager (AVM) and a Multi-Perspective Self-Correction Engine (MSCE) to identify and correct late-stage errors. Result: The experiments revealed 'Late-Stage Fragility', showing that later errors are more harmful than earlier ones. The ASCoT method outperformed standard CoT and other baselines on benchmarks like GSM8K and MATH. Conclusion: The study concludes that late-stage errors in CoT chains are more detrimental than early-stage ones, and the proposed ASCoT method effectively enhances accuracy by addressing these vulnerabilities. Abstract: Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs), yet the reliability of these reasoning chains remains a critical challenge. A widely held "cascading failure" hypothesis suggests that errors are most detrimental when they occur early in the reasoning process. This paper challenges that assumption through systematic error-injection experiments, revealing a counter-intuitive phenomenon we term "Late-Stage Fragility": errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning. To address this specific vulnerability, we introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method. ASCoT employs a modular pipeline in which an Adaptive Verification Manager (AVM) operates first, followed by the Multi-Perspective Self-Correction Engine (MSCE). The AVM leverages a Positional Impact Score function I(k) that assigns different weights based on the position within the reasoning chains, addressing the Late-Stage Fragility issue by identifying and prioritizing high-risk, late-stage steps. Once these critical steps are identified, the MSCE applies robust, dual-path correction specifically to the failure parts. Extensive experiments on benchmarks such as GSM8K and MATH demonstrate that ASCoT achieves outstanding accuracy, outperforming strong baselines, including standard CoT. Our work underscores the importance of diagnosing specific failure modes in LLM reasoning and advocates for a shift from uniform verification strategies to adaptive, vulnerability-aware correction mechanisms.[21] Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue
Sukannya Purkayastha,Nils Dycke,Anne Lauscher,Iryna Gurevych
Main category: cs.CL
TL;DR: This paper explores the use of dialogue agents for enhancing meta-reviewing efficiency by generating synthetic training data through LLMs, demonstrating improved performance over existing solutions.
Details
Motivation: The motivation stems from the need to enhance the efficiency of the meta-reviewing process, a critical stage in peer-review, by leveraging dialogue agents to assist decision-making, complementing the traditional summarization approach. Method: The study uses a self-refinement strategy with LLMs to generate synthetic data for training dialogue agents, which are then evaluated against off-the-shelf LLM-based assistants in real-world meta-reviewing scenarios. Result: The proposed method produces higher-quality synthetic data and effective dialogue agents that outperform existing LLM-based assistants in meta-reviewing tasks. Conclusion: The study concludes that dialogue agents can be effectively trained to assist in meta-reviewing tasks by generating synthetic data using LLMs, which enhances the efficiency of the meta-reviewing process. Abstract: Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform \emph{off-the-shelf} LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.\footnote{Code and Data: https://github.com/UKPLab/arxiv2025-meta-review-as-dialog[22] SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens
Nikita Dragunov,Temurbek Rahmatullaev,Elizaveta Goncharova,Andrey Kuznetsov,Anton Razzhigaev
Main category: cs.CL
TL;DR: SONAR-LLM is a decoder-only transformer that combines the semantic abstraction of LCM with a likelihood-based training signal, achieving competitive generation quality across model sizes from 39M to 1.3B parameters.
Details
Motivation: The motivation is to retain the semantic abstraction of the Large Concept Model (LCM) while addressing its limitations by eliminating its diffusion sampler and restoring a likelihood-based training signal. Method: The authors introduce SONAR-LLM, a decoder-only transformer that operates in the continuous SONAR embedding space but is trained using token-level cross-entropy through a frozen SONAR decoder, combining the semantic abstraction of LCM with a likelihood-based training signal. Result: SONAR-LLM achieves competitive generation quality across model sizes from 39M to 1.3B parameters, with the authors reporting scaling trends, ablations, benchmark results, and releasing all training code and pretrained checkpoints. Conclusion: SONAR-LLM attains competitive generation quality across model sizes from 39M to 1.3B parameters while eliminating LCM's diffusion sampler and restoring a likelihood-based training signal. Abstract: The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.[23] Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression
Jiameng Huang,Baijiong Lin,Guhao Feng,Jierun Chen,Di He,Lu Hou
Main category: cs.CL
TL;DR: 本文提出了一种新的方法CGRS,用于减少大型推理语言模型中不必要的反思行为,在保持准确性的同时显著降低计算成本。
Details
Motivation: 现有大型推理语言模型的反思行为可能导致冗余推理步骤,增加推理成本,降低实用性,因此需要一种方法解决这一问题。 Method: 提出了一种新颖的确定性引导反射抑制方法(CGRS),通过动态抑制模型在高置信度下的反思触发生成,无需重新训练或修改架构。 Result: 实验表明,CGRS平均减少了18.5%到41.9%的令牌使用量,同时保持准确性,并在多个基准和模型架构上表现出最佳的长度减少与性能平衡。 Conclusion: CGRS有效地减少了大型推理语言模型中的过度思考问题,同时保持了推理准确性,具有高效的实际应用价值。 Abstract: Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., "Wait" and "Alternatively") to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model's generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS's effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS's practical value for efficient reasoning.[24] Evaluation of a Sign Language Avatar on Comprehensibility, User Experience \& Acceptability
Fenya Wasserroth,Eleftherios Avramidis,Vera Czehmann,Tanja Kojic,Fabrizio Nunnari,Sebastian Möller
Main category: cs.CL
TL;DR: This study explores the impact of adjustable features in sign language avatars on Microsoft Hololens 2, finding that while users prefer adjustability, it doesn't significantly improve user experience or comprehensibility unless paired with better animation, facial expressions, and interaction design.
Details
Motivation: The motivation of the paper is to understand how adjustable features affect the user experience and comprehensibility of sign language avatars in augmented reality, particularly for expert sign language users. It aims to identify key factors influencing acceptability and usability in such systems. Method: Expert German Sign Language (DGS) users interacted with both adjustable and non-adjustable avatars on a Microsoft Hololens 2 device. Their interactions were analyzed for comprehensibility, user experience (UX), acceptability, and stress levels. The study also evaluated the intuitiveness of adjustment gestures and the impact of missing SL elements and implementation flaws. Result: Despite user preference for adjustable settings, there were no significant improvements in UX or comprehensibility, which remained low due to missing SL elements (e.g., facial expressions, mouthing) and implementation issues (e.g., unclear hand shapes, lack of feedback). Hedonic quality was rated higher than pragmatic quality, stress levels were higher for the adjustable avatar, and concerns were raised about gesture intuitiveness. Acceptability of adjustability was positive but heavily dependent on usability and animation quality. Conclusion: The study concludes that personalization through adjustable features alone is insufficient for SL avatars; comprehensibility by default is essential. Adjustments must be paired with improvements in animation quality, interaction design, and the inclusion of SL elements like facial expressions and mouthing. The research emphasizes the importance of user involvement in design processes to enhance usability and acceptability. Abstract: This paper presents an investigation into the impact of adding adjustment features to an existing sign language (SL) avatar on a Microsoft Hololens 2 device. Through a detailed analysis of interactions of expert German Sign Language (DGS) users with both adjustable and non-adjustable avatars in a specific use case, this study identifies the key factors influencing the comprehensibility, the user experience (UX), and the acceptability of such a system. Despite user preference for adjustable settings, no significant improvements in UX or comprehensibility were observed, which remained at low levels, amid missing SL elements (mouthings and facial expressions) and implementation issues (indistinct hand shapes, lack of feedback and menu positioning). Hedonic quality was rated higher than pragmatic quality, indicating that users found the system more emotionally or aesthetically pleasing than functionally useful. Stress levels were higher for the adjustable avatar, reflecting lower performance, greater effort and more frustration. Additionally, concerns were raised about whether the Hololens adjustment gestures are intuitive and easy to familiarise oneself with. While acceptability of the concept of adjustability was generally positive, it was strongly dependent on usability and animation quality. This study highlights that personalisation alone is insufficient, and that SL avatars must be comprehensible by default. Key recommendations include enhancing mouthing and facial animation, improving interaction interfaces, and applying participatory design.[25] Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025
Samy Ateia,Udo Kruschwitz
Main category: cs.CL
TL;DR: 本文研究了在生物医学专业搜索任务中,大型语言模型(LLMs)的自我反馈机制是否能提升性能,并比较了推理模型与非推理模型在生成反馈方面的能力。
Details
Motivation: 尽管基于代理的检索增强生成(RAG)和“深度研究”系统旨在实现大型语言模型(LLMs)的自主搜索过程,但在生物医学研究等专业领域中应用这些系统存在挑战,因为自动化系统可能会减少用户的参与度,并与专家的信息需求不一致。 Method: 使用BioASQ CLEF 2025挑战中的专家提出的问题,评估了Gemini-Flash 2.0、o3-mini、o4-mini和DeepSeek-R1等LLMs的性能。通过自反馈机制,让模型生成、评估并改进其输出,以用于查询扩展和多种答案类型(是/否、事实、列表、理想答案)的生成。 Result: 初步结果表明,不同模型和任务中自反馈策略的效果存在差异,推理模型在生成有用反馈方面的能力未明确优于非推理模型。 Conclusion: 本文探讨了当前的推理和非推理LLMs在生物医学专业搜索任务中的自我反馈机制的有效性,并提供了对LLMs自我修正能力的见解,以及未来比较LLM生成反馈与专家输入的建议。 Abstract: Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.[26] The TUB Sign Language Corpus Collection
Eleftherios Avramidis,Vera Czehmann,Fabian Deckert,Lorenz Hufe,Aljoscha Lipski,Yuni Amaloa Quintero Villalobos,Tae Kwon Rhee,Mengqian Shi,Lennart Stölting,Fabrizio Nunnari,Sebastian Möller
Main category: cs.CL
TL;DR: 该论文介绍了包含12种手语的视频语料库,总计超过1300小时,并提供了数据收集方法和统计数据。
Details
Motivation: 建立大规模的手语平行语料库,为手语研究和应用提供数据支持。 Method: 通过收集和处理来自多个在线来源的视频,主要包括新闻节目、政府机构和教育频道的广播材料,创建了手语语料库。 Result: 论文创建了包含12种手语的视频语料库,总计超过1300小时,包含4381个视频文件和1400万标记的字幕。 Conclusion: 该论文通过收集和处理来自多个国家的视频资料,建立了包含12种手语的平行语料库,并提供了相关统计数据和数据收集方法的概述。 Abstract: We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.[27] MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints
Zhong Ken Hew,Jia Xin Low,Sze Jue Yang,Chee Seng chan
Main category: cs.CL
TL;DR: This paper introduces MyCulture, a new benchmark to evaluate Large Language Models' understanding of Malaysian culture, highlighting disparities among models and the need for more inclusive evaluation methods.
Details
Motivation: The motivation stems from the cultural biases in LLMs due to training data dominated by high-resource languages like English and Chinese, which challenges the accurate representation and evaluation of diverse cultural contexts, especially in low-resource language settings. Method: The researchers introduced MyCulture, a benchmark for evaluating LLMs on Malaysian culture using an open-ended multiple-choice question format. They theoretically justified the effectiveness of this format, analyzed structural bias by comparing structured and free-form outputs, and assessed language bias through multilingual prompt variations. Result: The evaluation using MyCulture revealed significant disparities in cultural comprehension across regional and international LLMs. Conclusion: The study concludes that there are significant disparities in cultural comprehension among LLMs, emphasizing the need for more culturally grounded and linguistically inclusive benchmarks in LLM development and assessment. Abstract: Large Language Models (LLMs) often exhibit cultural biases due to training data dominated by high-resource languages like English and Chinese. This poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. To address this, we introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on Malaysian culture across six pillars: arts, attire, customs, entertainment, food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options, thereby reducing guessing and mitigating format bias. We provide a theoretical justification for the effectiveness of this open-ended structure in improving both fairness and discriminative power. Furthermore, we analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations. Our evaluation across a range of regional and international LLMs reveals significant disparities in cultural comprehension, highlighting the urgent need for culturally grounded and linguistically inclusive benchmarks in the development and assessment of LLMs.[28] LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang,Yujiong Shen,Jingyi Deng,Yuhui Wang,Yue Zhang,Junzhe Wang,Shichun Liu,Shihan Dou,Huayu Sha,Qiyuan Peng,Changhao Jiang,Jingqi Tong,Yilong Wu,Zhihao Zhang,Mingqi Wu,Zhiheng Xi,Mingxu Chai,Tao Liang,Zhihui Fei,Zhen Wang,Mingyang Wan,Guojun Ma,Tao Gui,Qi Zhang,Xuanjing Huang
Main category: cs.CL
TL;DR: LLMEval-3 introduces a dynamic evaluation framework for Large Language Models, addressing issues of data contamination and overfitting found in static benchmarks, and offering a robust method for assessing true model capabilities.
Details
Motivation: Existing evaluation of LLMs on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. Method: LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process. Result: An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. Conclusion: LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards. Abstract: Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.[29] TASE: Token Awareness and Structured Evaluation for Multilingual Language Models
Chenzhuo Zhao,Xinda Wang,Yue Huang,Junting Lu,Ziqian Liu
Main category: cs.CL
TL;DR: 本文介紹了TASE,一個用於評估大型語言模型在多語言環境下token-level信息感知和推理能力的綜合基準。
Details
Motivation: 雖然大型語言模型在高級語義任務上表現出色,但在需要精確和控制的應用中,它們通常在細粒度的token級理解和結構推理上遇到困難。 Method: 介紹TASE,涵蓋10項任務的綜合基準,涵蓋token awareness和structural understanding,並使用包含35,927個實例的評估集和可擴展的合成數據生成管道進行訓練。對30多個領先的商業和開源LLM進行評估,並使用GRPO訓練方法訓練自定義Qwen2.5-14B模型。 Result: 結果顯示,人類表現明顯優於當前的LLM,顯示出在token級推理上的持續弱點。 Conclusion: TASE揭示了當前LLM在token-level reasoning上的不足,並為改進低層語言理解和跨語言泛化提供了新的診斷視角。代碼和數據集可在GitHub上公開獲取。 Abstract: While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning--capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs' ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at https://github.com/cyzcz/Tase .[30] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
Li-Chun Lu,Miri Liu,Pin-Chun Lu,Yufei Tian,Shao-Hua Sun,Nanyun Peng
Main category: cs.CL
TL;DR: 研究比较了不同创造力度量方法,发现它们存在局限性,并强调需要改进评估框架。
Details
Motivation: 为了更好地理解和评估创造力,研究者们希望找出当前创造力度量方法的局限性。 Method: 系统性地检验、分析并比较了不同创造性领域中的代表性创造力度量方法。 Result: 分析揭示了这些度量方法表现出有限的一致性,并突出了它们的关键局限性。 Conclusion: 研究强调需要更强大、可推广的评估框架,以更好地与人类对创造力的判断相一致。 Abstract: We systematically examine, analyze, and compare representative creativity measures--creativity index, perplexity, syntactic templates, and LLM-as-a-Judge--across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index's focus on lexical diversity, perplexity's sensitivity to model confidence, and syntactic templates' inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.[31] LAG: Logic-Augmented Generation from a Cartesian Perspective
Yilin Xiao,Chuang Zhou,Qinggang Zhang,Su Dong,Shengyuan Chen,Xiao Huang
Main category: cs.CL
TL;DR: This paper introduces Logic-Augmented Generation (LAG), a new approach to improve the reasoning capabilities of large language models by decomposing complex questions, using logical dependencies, and preventing error propagation, resulting in more robust and human-aligned problem-solving.
Details
Motivation: Large language models (LLMs) show remarkable capabilities but exhibit limitations in knowledge-intensive tasks, often generating hallucinations. Retrieval-augmented generation (RAG) struggles with complex reasoning due to its reliance on direct semantic retrieval and lack of structured logical organization. The paper is inspired by Cartesian principles to develop a more effective solution. Method: This paper introduces Logic-Augmented Generation (LAG), which reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. It decomposes complex questions into atomic sub-questions ordered by logical dependencies, resolves them sequentially while using prior answers to guide context retrieval, incorporates a logical termination mechanism to prevent error propagation, and synthesizes all sub-resolutions to generate verified responses. Result: Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition. Conclusion: LAG offers a principled alternative to existing RAG systems by enhancing reasoning robustness, reducing hallucinations, and aligning LLM problem-solving with human cognition. Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la m\'ethode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems.[32] The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities
Harsh Nishant Lalai,Raj Sanjay Shah,Jiaxin Pei,Sashank Varma,Yi-Chia Wang,Ali Emami
Main category: cs.CL
TL;DR: 本研究通过20个问题游戏测试了大规模语言模型(LLM)的隐性偏见,发现模型在推断地理和文化实体时表现出全球范围的差异,揭示了训练数据中存在的潜在偏见。
Details
Motivation: 尽管大规模语言模型(LLM)已经经过大量调整以减轻明显的偏见,但它们仍然表现出根植于预训练数据中的隐性偏见。本文旨在通过模型主动提问的行为研究这些偏见。 Method: 研究使用了名为Geo20Q+的新数据集,通过20个问题游戏测试LLM的表现,并分析了不同地理区域和文化对象的推理过程。研究还测试了多种语言和游戏设置下的LLM表现。 Result: 结果表明,LLM在推断来自全球北方和西方实体时表现更好,而对全球南方和东方实体的推断效果较差。此外,维基百科浏览量和预训练语料库频率只能部分解释这些差异。 Conclusion: 论文得出的结论是,通过创造性、自由发挥的评估框架,可以揭示标准提示设置下隐藏的LLM中的微妙偏见。研究发现,LLM的推理过程中嵌入了地理和文化差异。 Abstract: Large Language Models (LLMs) have been extensively tuned to mitigate explicit biases, yet they often exhibit subtle implicit biases rooted in their pre-training data. Rather than directly probing LLMs with human-crafted questions that may trigger guardrails, we propose studying how models behave when they proactively ask questions themselves. The 20 Questions game, a multi-turn deduction task, serves as an ideal testbed for this purpose. We systematically evaluate geographic performance disparities in entity deduction using a new dataset, Geo20Q+, consisting of both notable people and culturally significant objects (e.g., foods, landmarks, animals) from diverse regions. We test popular LLMs across two gameplay configurations (canonical 20-question and unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs are substantially more successful at deducing entities from the Global North than the Global South, and the Global West than the Global East. While Wikipedia pageviews and pre-training corpus frequency correlate mildly with performance, they fail to fully explain these disparities. Notably, the language in which the game is played has minimal impact on performance gaps. These findings demonstrate the value of creative, free-form evaluation frameworks for uncovering subtle biases in LLMs that remain hidden in standard prompting setups. By analyzing how models initiate and pursue reasoning goals over multiple turns, we find geographic and cultural disparities embedded in their reasoning processes. We release the dataset (Geo20Q+) and code at https://sites.google.com/view/llmbias20q/home.[33] CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation
Santosh T. Y. S. S,Youssef Tarek Elkhayat,Oana Ichim,Pranav Shetty,Dongsheng Wang,Zhiqiang Ma,Armineh Nourbakhsh,Xiaomo Liu
Main category: cs.CL
TL;DR: CoCoLex 通过动态插值模型生成的词汇分布和基于上下文复制的分布,提高生成文本的保真度。
Details
Motivation: LLM 在法律领域的应用受到其生成不忠于输入、不基于事实或虚构输出倾向的阻碍。尽管检索增强生成通过基于外部知识进行推理提供了有希望的解决方案,但它不能保证提供的上下文将被有效整合。 Method: 提出了一种名为 CoCoLex 的解码策略,该策略结合了模型生成的词汇分布和基于上下文复制的分布,并根据模型的置信度动态插值。 Result: 实验结果表明 CoCoLex 在五个法律基准测试中优于现有的上下文感知解码方法,特别是在长文本生成任务中。 Conclusion: CoCoLex 是一种解码策略,通过从上下文中复制来提高生成文本的保真度,尤其在长文本生成任务中优于现有的上下文感知解码方法。 Abstract: Due to their ability to process long and complex contexts, LLMs can offer key benefits to the Legal domain, but their adoption has been hindered by their tendency to generate unfaithful, ungrounded, or hallucinatory outputs. While Retrieval-Augmented Generation offers a promising solution by grounding generations in external knowledge, it offers no guarantee that the provided context will be effectively integrated. To address this, context-aware decoding strategies have been proposed to amplify the influence of relevant context, but they usually do not explicitly enforce faithfulness to the context. In this work, we introduce Confidence-guided Copy-based Decoding for Legal Text Generation (CoCoLex)-a decoding strategy that dynamically interpolates the model produced vocabulary distribution with a distribution derived based on copying from the context. CoCoLex encourages direct copying based on the model's confidence, ensuring greater fidelity to the source. Experimental results on five legal benchmarks demonstrate that CoCoLex outperforms existing context-aware decoding methods, particularly in long-form generation tasks.[34] Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees
Guang Yang,Xinyang Liu
Main category: cs.CL
TL;DR: 该论文提出了一种基于频率的不确定性量化方法,用于提高大型语言模型在多项选择题回答中的可靠性。
Details
Motivation: 大型语言模型虽然在多项选择题回答中表现出色,但其固有的不可靠性限制了其在高风险领域的应用。 Method: 提出了一种基于频率的不确定性量化方法,利用共形预测确保可证明的覆盖率保证,并通过计算预测熵来评估模型输出分布。 Result: 实验评估显示,基于频率的预测熵在区分正确和错误预测方面优于基于logit的预测熵,并且能够有效控制经验误覆盖率。 Conclusion: 这项工作提供了一个分布无关、模型无关的框架,增强了大型语言模型在实际应用中的可信度。 Abstract: Large Language Models (LLMs) have shown remarkable progress in multiple-choice question answering (MCQA), but their inherent unreliability, such as hallucination and overconfidence, limits their application in high-risk domains. To address this, we propose a frequency-based uncertainty quantification method under black-box settings, leveraging conformal prediction (CP) to ensure provable coverage guarantees. Our approach involves multiple independent samplings of the model's output distribution for each input, with the most frequent sample serving as a reference to calculate predictive entropy (PE). Experimental evaluations across six LLMs and four datasets (MedMCQA, MedQA, MMLU, MMLU-Pro) demonstrate that frequency-based PE outperforms logit-based PE in distinguishing between correct and incorrect predictions, as measured by AUROC. Furthermore, the method effectively controls the empirical miscoverage rate under user-specified risk levels, validating that sampling frequency can serve as a viable substitute for logit-based probabilities in black-box scenarios. This work provides a distribution-free model-agnostic framework for reliable uncertainty quantification in MCQA with guaranteed coverage, enhancing the trustworthiness of LLMs in practical applications.[35] Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs
Franziska Weeber,Tanise Ceron,Sebastian Padó
Main category: cs.CL
TL;DR: 研究发现,在西方语言环境中,政治观点在多语言大语言模型中跨语言转移,这表明实现明确的社会语言学、文化和政治对齐存在挑战。
Details
Motivation: 跨文化背景下公众舆论调查显示出政治观点的差异,但这些差异是否转化为多语言大语言模型中的跨语言差异尚不清楚。 Method: 通过使用直接偏好优化和仅英语对齐数据,评估了五种西方语言中不同大小的多语言大语言模型在政治观点上的意见转移情况。 Result: 未对齐的模型显示了极少的显著跨语言差异,而政治对齐则在所有五种语言中几乎统一地改变了观点。 Conclusion: 政治观点在不同语言间转移,表明多语言大语言模型在社会语言学、文化和政治对齐方面面临挑战。 Abstract: Public opinion surveys show cross-cultural differences in political opinions between socio-cultural contexts. However, there is no clear evidence whether these differences translate to cross-lingual differences in multilingual large language models (MLLMs). We analyze whether opinions transfer between languages or whether there are separate opinions for each language in MLLMs of various sizes across five Western languages. We evaluate MLLMs' opinions by prompting them to report their (dis)agreement with political statements from voting advice applications. To better understand the interaction between languages in the models, we evaluate them both before and after aligning them with more left or right views using direct preference optimization and English alignment data only. Our findings reveal that unaligned models show only very few significant cross-lingual differences in the political opinions they reflect. The political alignment shifts opinions almost uniformly across all five languages. We conclude that in Western language contexts, political opinions transfer between languages, demonstrating the challenges in achieving explicit socio-linguistic, cultural, and political alignment of MLLMs.[36] MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
Shaoxiong Zhan,Yanlin Lai,Ziyu Lu,Dahua Lin,Ziqing Yang,Fei Tang
Main category: cs.CL
TL;DR: MathSmith is a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning, which constructs new problems from scratch, ensures data independence, avoids contamination, and outperforms existing baselines.
Details
Motivation: The advancement of large language models in mathematical reasoning is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods limit diversity and scalability by largely relying on transforming human-written templates. Method: MathSmith constructs new mathematical problems from scratch by randomly sampling concept-explanation pairs from PlanetMath, uses nine predefined strategies as soft constraints during rationales to increase difficulty, and adopts reinforcement learning to optimize structural validity, reasoning complexity, and answer consistency. Result: Experiments across five benchmarks show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Conclusion: MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities. Abstract: Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.[37] Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models
Haitao Hong,Yuchen Yan,Xingyu Wu,Guiyang Hou,Wenqi Zhang,Weiming Lu,Yongliang Shen,Jun Xiao
Main category: cs.CL
TL;DR: 本文提出Cooper框架,通过联合优化策略模型和动态更新奖励模型,有效缓解了奖励黑客问题,在强化学习任务中取得了更好的性能提升。
Details
Motivation: 现有的基于模型和基于规则的奖励方法存在鲁棒性差和奖励黑客问题,需要一种更有效的强化学习框架。 Method: 提出了一种名为Cooper的RL框架,结合了策略模型和奖励模型的联合优化,并引入了基于参考答案的奖励建模范式。 Result: VerifyRM在VerifyBench数据集上表现优于同规模模型,Cooper在Qwen2.5-1.5B-Instruct上平均准确率提升了0.54%。 Conclusion: Cooper框架通过动态更新奖励模型有效缓解了奖励黑客问题,并提升了强化学习的整体性能。 Abstract: Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.[38] OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks
Zixuan Wang,Dingming Li,Hongxing Li,Shuo Chen,Yuchen Yan,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang
Main category: cs.CL
TL;DR: OmniEAR是一个用于评估语言模型在具身任务中推理能力的框架,揭示了当前模型在动态能力获取和多智能体协调方面的局限性。
Details
Motivation: 大型语言模型擅长抽象推理,但其在具身智能体推理方面的能力尚未被充分探索。现有基准测试通常提供预定义工具集或明确协作指令,而真实场景需要智能体动态获取能力并自主决策。 Method: OmniEAR通过文本环境表示建模连续物理属性和复杂空间关系,包含1500个场景,涵盖家庭和工业领域,系统评估语言模型在物理交互、工具使用和多智能体协调方面的推理能力。 Result: 模型在明确指令下表现良好(85-96%成功率),但在工具推理和隐式协作任务中表现下降(分别为56-85%和63-85%),复杂任务失败率超过50%;完全环境信息反而降低协调性能,微调对单智能体任务有效(0.6%到76.3%),但对多智能体任务改进有限(1.5%到5.5%)。 Conclusion: OmniEAR评估框架揭示了当前模型在具身推理方面面临的基本挑战,且在多智能体任务中的改进空间有限,表明需要改进模型架构以增强具身AI系统的能力。 Abstract: Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.[39] Learning to Reason for Factuality
Xilun Chen,Ilia Kulikov,Vincent-Pierre Berges,Barlas Oğuz,Rulin Shao,Gargi Ghosh,Jason Weston,Wen-tau Yih
Main category: cs.CL
TL;DR: 本研究设计了一种新的奖励函数以解决推理大语言模型在长格式事实性任务中的幻觉问题,并通过在线强化学习显著提升了模型表现。
Details
Motivation: 推理大语言模型在复杂推理任务上表现出色,但在长格式事实性任务中常出现幻觉问题,而现有的自动事实评估方法在在线强化学习中容易导致奖励欺骗问题,因此需要一种更可靠的奖励函数设计方法。 Method: 提出了一种新的奖励函数,综合考虑事实精确度、回答细节水平和答案相关性,并将在这种奖励函数指导下的在线强化学习应用于事实推理模型。 Result: 在六个长格式事实性基准测试中,该模型平均幻觉率降低了23.1个百分点,回答细节水平提高了23%,整体响应有用性没有下降。 Conclusion: 通过结合事实精确度、回答细节水平和答案相关性设计新的奖励函数,并应用在线强化学习,显著提升了长格式事实性的推理模型的表现,减少了幻觉率并提高了回答细节水平,且未降低整体响应有用性。 Abstract: Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.[40] How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
Brandon Jaipersaud,David Krueger,Ekdeep Singh Lubana
Main category: cs.CL
TL;DR: 本文研究了如何使用线性探测器(probes)来分析大型语言模型(LLMs)在多轮对话中的说服动态,并展示了其在识别说服策略、说服成功点和被说服者个性方面的有效性。
Details
Motivation: 尽管大型语言模型已展现出说服人类的能力,但对其动态的理解仍有限。受近期利用线性探测器分析模型表示能力的启发,作者希望通过这种方法深入研究自然多轮对话中的说服机制。 Method: 基于认知科学的见解,训练探测器来分析说服的三个方面:说服成功、被说服者个性和说服策略。 Result: 虽然探测器结构简单,但它们能够在样本和数据集层面捕捉到说服的多个方面,例如识别对话中说服发生的关键点。此外,与基于提示的方法相比,探测器在效率上具有优势,甚至在某些情况下表现更优。 Conclusion: 线性探测器是一种有效且高效的工具,可用于研究复杂行为(如欺骗和操控),尤其适用于多轮对话和大规模数据分析。 Abstract: Large Language Models (LLMs) have started to demonstrate the ability to persuade humans, yet our understanding of how this dynamic transpires is limited. Recent work has used linear probes, lightweight tools for analyzing model representations, to study various LLM skills such as the ability to model user sentiment and political perspective. Motivated by this, we apply probes to study persuasion dynamics in natural, multi-turn conversations. We leverage insights from cognitive science to train probes on distinct aspects of persuasion: persuasion success, persuadee personality, and persuasion strategy. Despite their simplicity, we show that they capture various aspects of persuasion at both the sample and dataset levels. For instance, probes can identify the point in a conversation where the persuadee was persuaded or where persuasive success generally occurs across the entire dataset. We also show that in addition to being faster than expensive prompting-based approaches, probes can do just as well and even outperform prompting in some settings, such as when uncovering persuasion strategy. This suggests probes as a plausible avenue for studying other complex behaviours such as deception and manipulation, especially in multi-turn settings and large-scale dataset analysis where prompting-based methods would be computationally inefficient.[41] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
Mehrdad Zakershahrak,Samira Ghodratnama
Main category: cs.CL
TL;DR: H-NET++ 是一种无需分词器的字节级语言模型,通过分层动态分块机制,在波斯语等形态丰富的语言任务中实现了最先进的性能和更高的计算效率。
Details
Motivation: 字节级语言模型虽然可以避免传统分词器的脆弱性,但在处理形态丰富的语言时面临计算挑战,因为这些语言的单词往往包含大量字节。 Method: H-NET++ 采用了一种分层动态分块模型,通过端到端训练学习语言信息驱动的分割方式。其关键创新包括:轻量级Transformer上下文混合器、两级潜在超先验机制、对正字法错误(如波斯语ZWNJ)的特殊处理,以及基于课程的学习策略。 Result: 在14亿词的波斯语语料库上,H-NET++ 取得了以下成果:相较于基于BPE的GPT-2-fa模型,BPB降低0.159(压缩率提高12%);ParsGLUE性能提升5.4个百分点;对ZWNJ损坏的鲁棒性提高53%;在真实形态边界上的F1得分为73.8%。 Conclusion: H-NET++ 提出了一种无需分词器的、基于字节级语言模型的解决方案,有效处理形态丰富的语言(如波斯语)中的计算挑战,并在多个指标上达到最先进的性能。 Abstract: Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handling of orthographic artifacts (e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks align with Persian morphology without explicit supervision, demonstrating that hierarchical dynamic chunking provides an effective tokenizer-free solution for MRLs while maintaining computational efficiency.cs.CV [Back]
[42] RetinexDual: Retinex-based Dual Nature Approach for Generalized Ultra-High-Definition Image Restoration
Mohab Kishawy,Ali Abdellatif Hussein,Jun Chen
Main category: cs.CV
TL;DR: RetinexDual is a novel framework for Ultra-High-Definition Image Restoration that combines spatial and frequency domain techniques to effectively address common image degradation issues, outperforming existing methods.
Details
Motivation: Traditional methods like downsampling and frequency-domain approaches have significant drawbacks in UHD image restoration, such as information loss and ineffectiveness for spatially confined artifacts. Method: RetinexDual uses two complementary sub-networks, SAMBA and FIA, to correct reflectance, color, and illumination distortions across four image restoration tasks. Result: RetinexDual outperforms recent methods in both qualitative and quantitative evaluations on UHD image restoration tasks like deraining, deblurring, dehazing, and low-light enhancement. Conclusion: RetinexDual demonstrates superior performance in Ultra-High-Definition Image Restoration by combining spatial and frequency domain approaches, addressing the limitations of traditional methods. Abstract: Advancements in image sensing have elevated the importance of Ultra-High-Definition Image Restoration (UHD IR). Traditional methods, such as extreme downsampling or transformation from the spatial to the frequency domain, encounter significant drawbacks: downsampling induces irreversible information loss in UHD images, while our frequency analysis reveals that pure frequency-domain approaches are ineffective for spatially confined image artifacts, primarily due to the loss of degradation locality. To overcome these limitations, we present RetinexDual, a novel Retinex theory-based framework designed for generalized UHD IR tasks. RetinexDual leverages two complementary sub-networks: the Scale-Attentive maMBA (SAMBA) and the Frequency Illumination Adaptor (FIA). SAMBA, responsible for correcting the reflectance component, utilizes a coarse-to-fine mechanism to overcome the causal modeling of mamba, which effectively reduces artifacts and restores intricate details. On the other hand, FIA ensures precise correction of color and illumination distortions by operating in the frequency domain and leveraging the global context provided by it. Evaluating RetinexDual on four UHD IR tasks, namely deraining, deblurring, dehazing, and Low-Light Image Enhancement (LLIE), shows that it outperforms recent methods qualitatively and quantitatively. Ablation studies demonstrate the importance of employing distinct designs for each branch in RetinexDual, as well as the effectiveness of its various components.[43] ACM Multimedia Grand Challenge on ENT Endoscopy Analysis
Trong-Thuan Nguyen,Viet-Tham Huynh,Thao Thi Phuong Dao,Ha Nguyen Thi,Tien To Vu Thuy,Uyen Hanh Tran,Tam V. Nguyen,Thanh Dinh Le,Minh-Triet Tran
Main category: cs.CV
TL;DR: The paper introduces ENTRep, a new benchmark for endoscopic image analysis in ENT care, featuring classification, retrieval capabilities, and multi-language support.
Details
Motivation: The motivation is to overcome the limitations in automated analysis of endoscopic imagery in ENT care, such as device and operator variability, subtle findings, and the lack of reliable case retrieval systems. Method: The study introduces ENTRep, a benchmark challenge that includes a dataset of expert-annotated images with anatomical labels and dual-language descriptions, along with three defined benchmark tasks and standardized submission protocols. Result: The result includes the development of ENTRep with expert-annotated images and descriptions, the definition of benchmark tasks, and performance evaluation from top-performing teams. Conclusion: The paper concludes that ENTRep provides a robust framework for analyzing endoscopic imagery with integrated classification and retrieval capabilities, enhancing ENT care. Abstract: Automated analysis of endoscopic imagery is a critical yet underdeveloped component of ENT (ear, nose, and throat) care, hindered by variability in devices and operators, subtle and localized findings, and fine-grained distinctions such as laterality and vocal-fold state. In addition to classification, clinicians require reliable retrieval of similar cases, both visually and through concise textual descriptions. These capabilities are rarely supported by existing public benchmarks. To this end, we introduce ENTRep, the ACM Multimedia 2025 Grand Challenge on ENT endoscopy analysis, which integrates fine-grained anatomical classification with image-to-image and text-to-image retrieval under bilingual (Vietnamese and English) clinical supervision. Specifically, the dataset comprises expert-annotated images, labeled for anatomical region and normal or abnormal status, and accompanied by dual-language narrative descriptions. In addition, we define three benchmark tasks, standardize the submission protocol, and evaluate performance on public and private test splits using server-side scoring. Moreover, we report results from the top-performing teams and provide an insight discussion.[44] CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework
Sriram Mandalika,Lalitha V
Main category: cs.CV
TL;DR: CoMAD是一个轻量级的框架,通过结合多个自监督学习模型的知识,训练出性能优越的小型模型。
Details
Motivation: 自监督学习范式通常单独预训练,忽略了互补的见解,并且生成的模型在资源受限的部署中不切实际。 Method: CoMAD通过非对称掩码和联合共识门控机制,将来自MAE、MoCo v3和iBOT的三个预训练ViT-Base教师模型的知识蒸馏到学生网络中。 Result: CoMAD的学生模型ViT-Tiny在ImageNet-1K上达到了75.4%的Top-1准确率,在ADE20K上达到了47.3%的mIoU,在MS-COCO上分别达到了44.5%的框平均精度和40.5%的掩码平均精度。 Conclusion: CoMAD是一种轻量级、无参数的框架,能够将多个最先进的自监督视觉变换模型统一到一个紧凑的学生网络中,并在多个任务上实现了新的最先进的结果。 Abstract: Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask, forcing the student to interpolate missing features under richer contexts. Teacher embeddings are aligned to the student's space via a linear adapter and layer normalization, then fused through our joint consensus gating, which weights each token by combining cosine affinity with inter-teacher agreement. The student is trained with dual-level KL divergence on visible tokens and reconstructed feature maps, capturing both local and global structure. On ImageNet-1K, CoMAD's ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art. In dense-prediction transfers, it attains 47.3 percent mIoU on ADE20K, and 44.5 percent box average precision and 40.5 percent mask average precision on MS-COCO, establishing a new state-of-the-art in compact SSL distillation.[45] Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models
Mehrdad Moradi,Marco Grasso,Bianca Maria Colosimo,Kamran Paynabar
Main category: cs.CV
TL;DR: RADAR是一种基于注意力机制的实时异常检测方法,它克服了基于重建的异常检测方法的限制,并在多个数据集上展示了优于现有技术的性能。
Details
Motivation: 基于重建的异常检测方法存在三个主要挑战:计算成本高、重建图像可能对应不同的正常模式以及选择合适的中间噪声水平困难。 Method: 引入了一种基于注意力机制的实时异常检测方法,称为RADAR,该方法不进行输入图像的重建,而是直接从扩散模型中生成异常图。 Result: 在MVTec-AD数据集和3D打印材料的实际应用中,RADAR方法在所有关键指标(包括准确率、精确率、召回率和F1分数)上均超过了最先进的扩散模型和统计机器学习模型。具体而言,RADAR在MVTec-AD数据集上的F1分数提高了7%,在3D打印材料数据集上的F1分数提高了13%。 Conclusion: RADAR通过直接生成异常图克服了基于重建的异常检测方法的限制,提高了检测准确性和计算效率。 Abstract: Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling. However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings. We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model. Code available at: https://github.com/mehrdadmoradi124/RADAR[46] A deep learning approach to track eye movements based on events
Chirag Seth,Divya Naiken,Keyan Lin
Main category: cs.CV
TL;DR: 这项研究开发了一种使用深度学习方法(特别是CNN_LSTM模型)的经济高效的眼动追踪算法,准确率约为81%,并计划通过Layer-wise Relevance Propagation (LRP)进一步提高模型的可解释性和预测性能。
Details
Motivation: 眼动分析在消费电子产品,特别是VR和AR产品开发中具有广泛的应用。由于人眼的快速运动,通常需要昂贵的高速相机进行精确的眼动追踪,因此需要一种更经济且可解释的算法来预测人的注意力,以提高设备的舒适性和整体用户体验。 Method: 研究中采用了深度学习方法,特别是CNN_LSTM模型,用于从事件相机输入中预测眼动位置,并计划通过Layer-wise Relevance Propagation (LRP)进一步提高模型的可解释性和预测性能。 Result: 使用CNN_LSTM模型实现了大约81%的准确率,表明该模型在预测眼动位置方面是有效的。 Conclusion: 该研究的主要结论是,使用CNN_LSTM模型可以有效地从事件相机输入中预测眼动位置,准确率约为81%,并且未来的工作将集中在使用LRP提高模型的可解释性和预测性能。 Abstract: This research project addresses the challenge of accurately tracking eye movements during specific events by leveraging previous research. Given the rapid movements of human eyes, which can reach speeds of 300{\deg}/s, precise eye tracking typically requires expensive and high-speed cameras. Our primary objective is to locate the eye center position (x, y) using inputs from an event camera. Eye movement analysis has extensive applications in consumer electronics, especially in VR and AR product development. Therefore, our ultimate goal is to develop an interpretable and cost-effective algorithm using deep learning methods to predict human attention, thereby improving device comfort and enhancing overall user experience. To achieve this goal, we explored various approaches, with the CNN\_LSTM model proving most effective, achieving approximately 81\% accuracy. Additionally, we propose future work focusing on Layer-wise Relevance Propagation (LRP) to further enhance the model's interpretability and predictive performance.[47] LuKAN: A Kolmogorov-Arnold Network Framework for 3D Human Motion Prediction
Md Zahidul Hasan,A. Ben Hamza,Nizar Bouguila
Main category: cs.CV
TL;DR: LuKAN is a computationally efficient model for 3D human motion prediction using KANs and Lucas polynomials, achieving high accuracy and temporal coherence.
Details
Motivation: Existing methods struggle to balance prediction accuracy and computational efficiency in 3D human motion prediction. Method: LuKAN uses discrete wavelet transform for temporal encoding, a spatial projection layer for capturing inter-joint dependencies, a Temporal Dependency Learner based on KAN with Lucas polynomial activations, and inverse discrete wavelet transform for reconstructing motion sequences. Result: Extensive experiments show that LuKAN performs competitively against strong baselines in both quantitative and qualitative evaluations while maintaining computational efficiency. Conclusion: The LuKAN model, based on KANs with Lucas polynomial activations, achieves competitive performance and computational efficiency in 3D human motion prediction. Abstract: The goal of 3D human motion prediction is to forecast future 3D poses of the human body based on historical motion data. Existing methods often face limitations in achieving a balance between prediction accuracy and computational efficiency. In this paper, we present LuKAN, an effective model based on Kolmogorov-Arnold Networks (KANs) with Lucas polynomial activations. Our model first applies the discrete wavelet transform to encode temporal information in the input motion sequence. Then, a spatial projection layer is used to capture inter-joint dependencies, ensuring structural consistency of the human body. At the core of LuKAN is the Temporal Dependency Learner, which employs a KAN layer parameterized by Lucas polynomials for efficient function approximation. These polynomials provide computational efficiency and an enhanced capability to handle oscillatory behaviors. Finally, the inverse discrete wavelet transform reconstructs motion sequences in the time domain, generating temporally coherent predictions. Extensive experiments on three benchmark datasets demonstrate the competitive performance of our model compared to strong baselines, as evidenced by both quantitative and qualitative evaluations. Moreover, its compact architecture coupled with the linear recurrence of Lucas polynomials, ensures computational efficiency.[48] VER-Bench: Evaluating MLLMs on Reasoning with Fine-Grained Visual Evidence
Chenhui Qiang,Zhaoyang Wei,Xumeng Han Zipeng Wang,Siyao Li,Xiangyuan Lan,Jianbin Jiao,Zhenjun Han
Main category: cs.CV
TL;DR: VER-Bench是一个用于评估多模态语言模型识别细粒度视觉线索并进行复杂推理能力的新框架。
Details
Motivation: 随着MLLMs的迅速发展,评估其视觉能力变得愈发重要。当前的基准测试主要分为基本感知基准和主流推理基准,但它们都不能全面评估细微线索所需的复杂分析。 Method: VER-Bench是一种新框架,包含374个精心设计的问题,涉及地理空间、时间、情境、意图、系统状态和符号推理,每个问题都配有结构化证据:视觉线索和由此得出的问题相关推理。 Result: VER-Bench能够评估MLLMs识别细粒度视觉线索的能力,并将这些线索与世界知识结合进行复杂推理。 Conclusion: VER-Bench揭示了当前模型在提取细微视觉证据和构建基于证据的论点方面的局限性,强调了增强模型细粒度视觉证据提取、整合和推理能力的必要性。 Abstract: With the rapid development of MLLMs, evaluating their visual capabilities has become increasingly crucial. Current benchmarks primarily fall into two main types: basic perception benchmarks, which focus on local details but lack deep reasoning (e.g., "what is in the image?"), and mainstream reasoning benchmarks, which concentrate on prominent image elements but may fail to assess subtle clues requiring intricate analysis. However, profound visual understanding and complex reasoning depend more on interpreting subtle, inconspicuous local details than on perceiving salient, macro-level objects. These details, though occupying minimal image area, often contain richer, more critical information for robust analysis. To bridge this gap, we introduce the VER-Bench, a novel framework to evaluate MLLMs' ability to: 1) identify fine-grained visual clues, often occupying on average just 0.25% of the image area; 2) integrate these clues with world knowledge for complex reasoning. Comprising 374 carefully designed questions across Geospatial, Temporal, Situational, Intent, System State, and Symbolic reasoning, each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them. VER-Bench reveals current models' limitations in extracting subtle visual evidence and constructing evidence-based arguments, highlighting the need to enhance models's capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis. Dataset and additional materials are available https://github.com/verbta/ACMMM-25-Materials.[49] Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications
Noreen Anwar,Guillaume-Alexandre Bilodeau,Wassim Bouachir
Main category: cs.CV
TL;DR: 本文提出了一种名为DAMM的新型检测框架,通过双流注意力和多模态查询解决了基于Transformer的目标检测中的固定查询、密集注意力导致的遮挡、细粒度定位和计算效率低的问题,并在四个具有挑战性的基准测试中达到了最先进的平均精度(AP)和召回率。
Details
Motivation: 基于Transformer的目标检测器在处理遮挡、细粒度定位和计算效率方面存在困难,这主要由固定查询和密集注意力机制导致。本文旨在通过引入查询自适应和结构化交叉注意力机制来解决这些问题。 Method: 提出了一种名为DAMM(Dual-stream Attention with Multi-Modal queries)的新型框架。该框架结合了三种类型的查询:基于视觉语言模型的外观查询、使用多边形嵌入的位置查询和用于通用场景覆盖的随机学习查询。此外,DAMM采用了一个双流交叉注意力模块,分别优化语义和空间特征,从而提高在复杂场景中的定位精度。 Result: DAMM在四个具有挑战性的基准测试中表现优异,达到了最先进的平均精度(AP)和召回率,证明了多模态查询自适应和双流注意力机制的有效性。 Conclusion: 本文提出的DAMM框架通过引入多模态查询和双流注意力机制,有效解决了基于Transformer的目标检测中的遮挡、细粒度定位和计算效率低的问题,展示了其在目标检测任务中的巨大潜力。 Abstract: Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: \href{https://github.com/DET-LIP/DAMM}{GitHub}.[50] Revealing Temporal Label Noise in Multimodal Hateful Video Classification
Shuonan Yang,Tailin Chen,Rahul Singh,Jiangbei Yue,Jianbo Jiao,Zeyu Fu
Main category: cs.CV
TL;DR: 该论文研究了多模态仇恨视频检测中粗略视频级注释带来的标签噪声问题,通过时间戳剪辑和分析发现注释噪声影响模型决策,并强调需要时间感知模型来提高鲁棒性和可解释性。
Details
Motivation: 随着在线多媒体内容的快速增长,仇恨言论的传播加剧,现有的多模态仇恨视频检测方法依赖于粗略的视频级注释,这导致了标签噪声问题,需要通过更细粒度的方法来研究这种标签模糊性的影响。 Method: 通过使用时间戳注释对HateMM和MultiHateClip数据集中的仇恨视频进行细粒度剪辑,并对剪辑后的片段进行探索性分析,研究仇恨和非仇恨内容的分布和特征。 Result: 分析发现粗略注释导致了语义重叠和模型混淆,实验表明时间戳噪声从根本上改变了模型的决策边界并削弱了分类置信度,显示出仇恨言论表达的情境依赖性和时间连续性。 Conclusion: 研究结果表明,粗略的视频级注释会引入显著的标签噪声,影响模型决策边界和分类置信度,强调了时间感知模型和基准测试的必要性。 Abstract: The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise.[51] Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations
Zahidul Islam,Sujoy Paul,Mrigank Rochan
Main category: cs.CV
TL;DR: This paper introduces Highlight-TTA, a test-time adaptation framework that improves video highlight detection by adjusting models to each video's specific features.
Details
Motivation: Existing methods struggle to generalize due to fixed models that don't account for individual video variations. Method: Highlight-TTA dynamically adapts the model using a meta-auxiliary training scheme with cross-modality hallucinations as an auxiliary task. Result: Experiments show that Highlight-TTA boosts performance of three state-of-the-art models on three benchmark datasets. Conclusion: Highlight-TTA improves video highlight detection by adapting models during testing to each video's unique characteristics. Abstract: Existing video highlight detection methods, although advanced, struggle to generalize well to all test videos. These methods typically employ a generic highlight detection model for each test video, which is suboptimal as it fails to account for the unique characteristics and variations of individual test videos. Such fixed models do not adapt to the diverse content, styles, or audio and visual qualities present in new, unseen test videos, leading to reduced highlight detection performance. In this paper, we propose Highlight-TTA, a test-time adaptation framework for video highlight detection that addresses this limitation by dynamically adapting the model during testing to better align with the specific characteristics of each test video, thereby improving generalization and highlight detection performance. Highlight-TTA is jointly optimized with an auxiliary task, cross-modality hallucinations, alongside the primary highlight detection task. We utilize a meta-auxiliary training scheme to enable effective adaptation through the auxiliary task while enhancing the primary task. During testing, we adapt the trained model using the auxiliary task on the test video to further enhance its highlight detection performance. Extensive experiments with three state-of-the-art highlight detection models and three benchmark datasets show that the introduction of Highlight-TTA to these models improves their performance, yielding superior results.[52] Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens
Suchisrit Gangopadhyay,Jung-Hee Kim,Xien Chen,Patrick Rim,Hyoungseob Park,Alex Wong
Main category: cs.CV
TL;DR: 该论文提出了一种新的方法,通过引入校准标记来调整潜在嵌入,从而实现将基础单目深度估计器扩展到鱼眼图像的应用。
Details
Motivation: 基础单目深度估计器在训练时容易受到相机校准参数变化引起的协变量转移的影响,导致深度估计错误。 Method: 引入了一组校准标记作为轻量级适应机制,以调节潜在嵌入以实现对齐。 Result: 该方法在室内外数据集上均一致优于现有技术,仅使用一组标记。 Conclusion: 提出了一种无需重新训练或微调即可将基础单目深度估计器扩展到鱼眼图像的方法。 Abstract: We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.[53] Toward Errorless Training ImageNet-1k
Bo Deng,Levi Heath
Main category: cs.CV
TL;DR: 该论文提出了一种使用新方法训练的前馈人工神经网络,在ImageNet 2012竞赛数据集上实现了98.3%的准确率和285.9个完美分类的标签。
Details
Motivation: 论文的动机是通过新方法训练神经网络以在ImageNet数据集上实现更高的分类准确率。 Method: 论文使用了一种新的方法来训练一个前馈人工神经网络,该网络在ImageNet 2012竞赛数据集上进行训练和测试。 Result: 模型达到了98.3%的准确率和99.69的Top-1率,平均有285.9个标签完美分类。最佳模型使用了322,430,160个参数,并具有4位小数精度。 Conclusion: 作者推测模型未能达到100%准确率的原因是数据集中存在具有不同标签的重复图像导致的双重标记问题。 Abstract: In this paper, we describe a feedforward artificial neural network trained on the ImageNet 2012 contest dataset [7] with the new method of [5] to an accuracy rate of 98.3% with a 99.69 Top-1 rate, and an average of 285.9 labels that are perfectly classified over the 10 batch partitions of the dataset. The best performing model uses 322,430,160 parameters, with 4 decimal places precision. We conjecture that the reason our model does not achieve a 100% accuracy rate is due to a double-labeling problem, by which there are duplicate images in the dataset with different labels.[54] Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models
Phuoc-Nguyen Bui,Khanh-Binh Nguyen,Hyunseung Choo
Main category: cs.CV
TL;DR: ProMIM是一种轻量级的框架,通过整合掩码图像建模(MIM)到现有的视觉-语言模型(VLM)流程中,提高条件提示学习的性能。
Details
Motivation: 现有的提示学习技术如CoOp和CoCoOp在适应新任务时往往对已知类别过拟合,限制了对未见类别的泛化能力。 Method: 引入了一种称为ProMIM的插件式框架,通过将掩码图像建模(MIM)集成到现有VLM流程中来增强条件提示学习。 Result: 实验结果显示,ProMIM在零样本和少样本分类任务中显著提升了泛化性能。 Conclusion: ProMIM为现实世界的视觉-语言应用提供了一个实用且轻量级的解决方案。 Abstract: Vision-language models (VLMs) like CLIP excel in zero-shot learning but often require resource-intensive training to adapt to new tasks. Prompt learning techniques, such as CoOp and CoCoOp, offer efficient adaptation but tend to overfit to known classes, limiting generalization to unseen categories. We introduce ProMIM, a plug-and-play framework that enhances conditional prompt learning by integrating masked image modeling (MIM) into existing VLM pipelines. ProMIM leverages a simple yet effective masking strategy to generate robust, instance-conditioned prompts, seamlessly augmenting methods like CoOp and CoCoOp without altering their core architectures. By masking only visible image patches and using these representations to guide prompt generation, ProMIM improves feature robustness and mitigates overfitting, all while introducing negligible additional computational cost. Extensive experiments across zero-shot and few-shot classification tasks demonstrate that ProMIM consistently boosts generalization performance when plugged into existing approaches, providing a practical, lightweight solution for real-world vision-language applications.[55] TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring
Zhu Xu,Ting Lei,Zhimin Li,Guan Wang,Qingchao Chen,Yuxin Peng,Yang liu
Main category: cs.CV
TL;DR: The paper proposes a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method for Weakly Supervised Dynamic Scene Graph Generation (WS-DSGG) that leverages knowledge to enhance detection in relation-aware dynamic scenarios. The TRKT method includes two components: relation-aware knowledge mining and a dual-stream fusion module. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset.
Details
Motivation: WS-DSGG methods rely on an external object detector to generate pseudo labels for subsequent DSGG training. Detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. This paper aims to address the challenges posed by external object detectors in WS-DSGG. Method: The TRKT method includes two components: relation-aware knowledge mining and a dual-stream fusion module. Relation-aware knowledge mining uses object and relation class decoders to generate category-specific attention maps, and an inter-frame attention augmentation strategy is used to enhance the attention maps. The dual-stream fusion module integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Result: Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Conclusion: TRKT achieves state-of-the-art performance on Action Genome dataset. Abstract: Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git.[56] AdvDINO: Domain-Adversarial Self-Supervised Representation Learning for Spatial Proteomics
Stella Su,Marc Harary,Scott J. Rodig,William Lotter
Main category: cs.CV
TL;DR: AdvDINO是一个领域对抗性的自我监督学习框架,用于解决领域转移问题,特别适用于生物医学成像。
Details
Motivation: 标准SSL方法对领域转移的鲁棒性仍不确定,这在生物医学成像中尤其是一个挑战,因为批次效应可能会掩盖真实的生物信号。 Method: AdvDINO通过在DINOv2架构中集成梯度反转层来促进领域不变特征学习。 Result: AdvDINO比非对抗性基线更能减轻幻灯片特定的偏差,学习到更强大且生物学上有意义的表示。 Conclusion: AdvDINO是一个广泛适用的框架,可用于其他存在领域转移和有限注释数据的成像领域,如放射学、遥感和自动驾驶。 Abstract: Self-supervised learning (SSL) has emerged as a powerful approach for learning visual representations without manual annotations. However, the robustness of standard SSL methods to domain shift -- systematic differences across data sources -- remains uncertain, posing an especially critical challenge in biomedical imaging where batch effects can obscure true biological signals. We present AdvDINO, a domain-adversarial self-supervised learning framework that integrates a gradient reversal layer into the DINOv2 architecture to promote domain-invariant feature learning. Applied to a real-world cohort of six-channel multiplex immunofluorescence (mIF) whole slide images from non-small cell lung cancer patients, AdvDINO mitigates slide-specific biases to learn more robust and biologically meaningful representations than non-adversarial baselines. Across $>5.46$ million mIF image tiles, the model uncovers phenotype clusters with distinct proteomic profiles and prognostic significance, and improves survival prediction in attention-based multiple instance learning. While demonstrated on mIF data, AdvDINO is broadly applicable to other imaging domains -- including radiology, remote sensing, and autonomous driving -- where domain shift and limited annotated data hinder model generalization and interpretability.[57] Open-world Point Cloud Semantic Segmentation: A Human-in-the-loop Framework
Peng Zhang,Songru Yang,Jinsheng Sun,Weiqing Li,Zhiyong Su
Main category: cs.CV
TL;DR: The paper proposes HOW-Seg, a human-in-the-loop framework for open-world point cloud semantic segmentation, which dynamically improves predictions using sparse human feedback. It outperforms existing methods in performance while reducing reliance on dense annotations.
Details
Motivation: Existing OW-Seg methods rely on resource-intensive offline incremental learning or densely annotated support data, which limits their practicality. Open-world point cloud semantic segmentation aims to predict point labels for both base and novel classes in real-world scenarios, and the paper aims to address these limitations with a more practical framework. Method: The paper proposes HOW-Seg, a human-in-the-loop framework for OW-Seg that constructs class prototypes directly on the query data. Sparse human annotations guide prototype-based segmentation for both base and novel classes. A hierarchical prototype disambiguation mechanism refines ambiguous prototypes, while a dense CRF enhances contextual awareness by optimizing label assignments on refined prototypes. Result: Experiments show that HOW-Seg performs as well as or better than the state-of-the-art GFS-Seg method under the 5-shot setting with sparse annotations (e.g., one-novel-class-one-click). Using advanced backbones and denser annotations (e.g., 10 clicks per sub-scene), HOW-Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2, outperforming alternatives. Conclusion: HOW-Seg dynamically improves its predictions through iterative human feedback, achieving high-quality segmentation for both base and novel classes. With sparse annotations, HOW-Seg matches or surpasses the state-of-the-art GFS-Seg method under the 5-shot setting. Using advanced backbones and denser annotations, HOW-Seg achieves significantly better mIoU scores on S3DIS and ScanNetv2 compared to alternatives. Abstract: Open-world point cloud semantic segmentation (OW-Seg) aims to predict point labels of both base and novel classes in real-world scenarios. However, existing methods rely on resource-intensive offline incremental learning or densely annotated support data, limiting their practicality. To address these limitations, we propose HOW-Seg, the first human-in-the-loop framework for OW-Seg. Specifically, we construct class prototypes, the fundamental segmentation units, directly on the query data, avoiding the prototype bias caused by intra-class distribution shifts between the support and query data. By leveraging sparse human annotations as guidance, HOW-Seg enables prototype-based segmentation for both base and novel classes. Considering the lack of granularity of initial prototypes, we introduce a hierarchical prototype disambiguation mechanism to refine ambiguous prototypes, which correspond to annotations of different classes. To further enrich contextual awareness, we employ a dense conditional random field (CRF) upon the refined prototypes to optimize their label assignments. Through iterative human feedback, HOW-Seg dynamically improves its predictions, achieving high-quality segmentation for both base and novel classes. Experiments demonstrate that with sparse annotations (e.g., one-novel-class-one-click), HOW-Seg matches or surpasses the state-of-the-art generalized few-shot segmentation (GFS-Seg) method under the 5-shot setting. When using advanced backbones (e.g., Stratified Transformer) and denser annotations (e.g., 10 clicks per sub-scene), HOW-Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2, significantly outperforming alternatives.[58] UGOD: Uncertainty-Guided Differentiable Opacity and Soft Dropout for Enhanced Sparse-View 3DGS
Zhihao Guo,Peng Wang,Zidong Chen,Xiangyu Kong,Yan Lyu,Guanyu Gao,Liangxiu Han
Main category: cs.CV
TL;DR: This paper proposes an adaptive Gaussian weighting method using learned uncertainties for 3D Gaussian Splatting, improving rendering quality and outperforming existing methods in sparse-view scenarios.
Details
Motivation: Existing 3D Gaussian Splatting (3DGS) methods treat all Gaussians equally, leading to overfitting, especially in sparse-view scenarios. This work aims to improve rendering quality by introducing adaptive weighting through learned uncertainties. Method: A method was developed that uses learned uncertainties to adaptively weight Gaussians during rendering. This includes guiding the differentiable update of Gaussian opacity and applying soft differentiable dropout regularization to the uncertainty, which enhances the projection and blending process. Result: Experimental results show that the method outperforms existing approaches in sparse-view 3D synthesis, achieving higher-quality reconstructions with fewer Gaussians. For example, it achieved a 3.27% improvement in PSNR on the MipNeRF 360 dataset compared to DropGaussian. Conclusion: The proposed method of adaptive weighting of Gaussians with learned uncertainties improves rendering quality in sparse-view 3D synthesis, achieving better performance than existing approaches like DropGaussian. Abstract: 3D Gaussian Splatting (3DGS) has become a competitive approach for novel view synthesis (NVS) due to its advanced rendering efficiency through 3D Gaussian projection and blending. However, Gaussians are treated equally weighted for rendering in most 3DGS methods, making them prone to overfitting, which is particularly the case in sparse-view scenarios. To address this, we investigate how adaptive weighting of Gaussians affects rendering quality, which is characterised by learned uncertainties proposed. This learned uncertainty serves two key purposes: first, it guides the differentiable update of Gaussian opacity while preserving the 3DGS pipeline integrity; second, the uncertainty undergoes soft differentiable dropout regularisation, which strategically transforms the original uncertainty into continuous drop probabilities that govern the final Gaussian projection and blending process for rendering. Extensive experimental results over widely adopted datasets demonstrate that our method outperforms rivals in sparse-view 3D synthesis, achieving higher quality reconstruction with fewer Gaussians in most datasets compared to existing sparse-view approaches, e.g., compared to DropGaussian, our method achieves 3.27\% PSNR improvements on the MipNeRF 360 dataset.[59] CSRAP: Enhanced Canvas Attention Scheduling for Real-Time Mission Critical Perception
Md Iftekharul Islam Sakib,Yigong Hu,Tarek Abdelzaher
Main category: cs.CV
TL;DR: 本文提出了一种改进的canvas注意力调度机制,提高了边缘平台上的实时感知性能。
Details
Motivation: 在边缘平台上执行高分辨率物体检测时,如何在有限计算资源和严格延迟约束下提高感知子系统的效率是一个核心挑战。 Method: 使用可变尺寸的canvas frame和可选的canvas frame rate,并通过YOLOv11和NVIDIA Jetson Orin Nano进行实验评估。 Result: 实验结果表明,该方法在质量/成本权衡方面优于现有技术,实现了更高的平均精度(mAP)和召回率。 Conclusion: 该文通过可变尺寸画布帧和可选帧率的画布注意力调度机制,在资源受限的边缘平台上实现了更高精度和召回率的实时感知。 Abstract: Real-time perception on edge platforms faces a core challenge: executing high-resolution object detection under stringent latency constraints on limited computing resources. Canvas-based attention scheduling was proposed in earlier work as a mechanism to reduce the resource demands of perception subsystems. It consolidates areas of interest in an input data frame onto a smaller area, called a canvas frame, that can be processed at the requisite frame rate. This paper extends prior canvas-based attention scheduling literature by (i) allowing for variable-size canvas frames and (ii) employing selectable canvas frame rates that may depart from the original data frame rate. We evaluate our solution by running YOLOv11, as the perception module, on an NVIDIA Jetson Orin Nano to inspect video frames from the Waymo Open Dataset. Our results show that the additional degrees of freedom improve the attainable quality/cost trade-offs, thereby allowing for a consistently higher mean average precision (mAP) and recall with respect to the state of the art.[60] Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression
Zheng Chen,Mingde Zhou,Jinpei Guo,Jiale Yuan,Yifei Ji,Yulun Zhang
Main category: cs.CV
TL;DR: SODEC is a novel single-step diffusion image compression model that addresses the issues of excessive decoding latency and poor fidelity in diffusion-based image compression.
Details
Motivation: Diffusion-based image compression suffers from excessive decoding latency and poor fidelity. This paper proposes SODEC to address these issues. Method: SODEC uses a pre-trained VAE-based model to produce informative latents and replaces the iterative denoising process with single-step decoding. It also introduces a fidelity guidance module and uses a rate annealing training strategy. Result: SODEC significantly outperforms existing methods in rate-distortion-perception performance and improves decoding speed by more than 20x. Conclusion: SODEC is a novel single-step diffusion image compression model that significantly outperforms existing methods in rate-distortion-perception performance and decoding speed. Abstract: Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code is released at: https://github.com/zhengchen1999/SODEC.[61] Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion
Shenglun Chen,Xinzhu Ma,Hong Zhang,Haojie Li,Zhihui Wang
Main category: cs.CV
TL;DR: This paper proposes a novel depth completion framework leveraging depth foundation models to achieve robust performance in out-of-distribution scenarios without large-scale training.
Details
Motivation: Existing depth completion models suffer performance degradation in OOD scenarios due to reliance on limited training data. This work aims to enhance robustness by leveraging depth foundation models without requiring large-scale retraining. Method: The framework uses a depth foundation model to extract structural and semantic cues from RGB images, guiding sparse depth propagation in both 3D and 2D spaces. A dual-space propagation approach and a learnable correction module are employed to maintain geometric structure and refine depth predictions. Result: The framework was trained on NYUv2 and KITTI datasets and evaluated on 16 other datasets, showing superior performance in OOD scenarios compared to existing methods. Conclusion: The proposed depth completion framework demonstrates exceptional robustness in out-of-distribution (OOD) scenarios, outperforming existing state-of-the-art methods while avoiding the need for large-scale training. Abstract: Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagates sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in https://github.com/shenglunch/PSD.[62] Unified modality separation: A vision-language framework for unsupervised domain adaptation
Xinyao Li,Jingjing Li,Zhekai Du,Lei Zhu,Heng Tao Shen
Main category: cs.CV
TL;DR: This paper proposes a modality separation framework to address the modality gap in unsupervised domain adaptation with vision-language models, achieving better performance and efficiency by handling modality-specific and modality-invariant components separately.
Details
Motivation: The motivation stems from the inherent modality gap between different data types in vision-language models, which limits the performance of unsupervised domain adaptation. The paper aims to improve adaptation by addressing both modality-invariant and modality-specific characteristics. Method: The method involves disentangling modality components from vision-language model (VLM) features during training and handling them separately. At test time, modality-adaptive ensemble weights are determined automatically. A modality discrepancy metric is also designed for categorizing samples into modality-invariant, modality-specific, and uncertain types. Result: The proposed method achieves up to a 9% performance gain and 9 times greater computational efficiency. Extensive experiments demonstrate its efficacy across various backbones, datasets, and adaptation settings. Conclusion: The paper concludes that the proposed unified modality separation framework effectively addresses the modality gap in unsupervised domain adaptation by separately handling modality-specific and modality-invariant components, achieving improved performance and computational efficiency. Abstract: Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.[63] Modeling Rapid Contextual Learning in the Visual Cortex with Fast-Weight Deep Autoencoder Networks
Yue Li,Weifan Wang,Tai Sing Lee
Main category: cs.CV
TL;DR: This study investigates how familiarity training can induce sensitivity to global context in early layers of a deep neural network using a Vision Transformer-based autoencoder and Low-Rank Adaptation to implement fast weights.
Details
Motivation: Recent neurophysiological studies have revealed that the early visual cortex can rapidly learn global image context, and this phenomenon has been attributed primarily to local recurrent interactions. The study aims to explore this rapid learning via fast weights, which encode transient or short-term memory traces. Method: A Vision Transformer (ViT)-based autoencoder was employed to investigate how familiarity training can induce sensitivity to global context in the early layers of a deep neural network. Low-Rank Adaptation (LoRA) was used to implement fast weights within each Transformer layer. Result: The results show that (1) The proposed ViT-based autoencoder's self-attention circuit performs a manifold transform similar to a neural circuit model of the familiarity effect. (2) Familiarity training aligns latent representations in early layers with those in the top layer that contains global context information. (3) Familiarity training broadens the self-attention scope within the remembered image context. (4) These effects are significantly amplified by LoRA-based fast weights. Conclusion: Familiarity training introduces global sensitivity to earlier layers in a hierarchical network, and a hybrid fast-and-slow weight architecture may provide a viable computational model for studying rapid global context learning in the brain. Abstract: Recent neurophysiological studies have revealed that the early visual cortex can rapidly learn global image context, as evidenced by a sparsification of population responses and a reduction in mean activity when exposed to familiar versus novel image contexts. This phenomenon has been attributed primarily to local recurrent interactions, rather than changes in feedforward or feedback pathways, supported by both empirical findings and circuit-level modeling. Recurrent neural circuits capable of simulating these effects have been shown to reshape the geometry of neural manifolds, enhancing robustness and invariance to irrelevant variations. In this study, we employ a Vision Transformer (ViT)-based autoencoder to investigate, from a functional perspective, how familiarity training can induce sensitivity to global context in the early layers of a deep neural network. We hypothesize that rapid learning operates via fast weights, which encode transient or short-term memory traces, and we explore the use of Low-Rank Adaptation (LoRA) to implement such fast weights within each Transformer layer. Our results show that (1) The proposed ViT-based autoencoder's self-attention circuit performs a manifold transform similar to a neural circuit model of the familiarity effect. (2) Familiarity training aligns latent representations in early layers with those in the top layer that contains global context information. (3) Familiarity training broadens the self-attention scope within the remembered image context. (4) These effects are significantly amplified by LoRA-based fast weights. Together, these findings suggest that familiarity training introduces global sensitivity to earlier layers in a hierarchical network, and that a hybrid fast-and-slow weight architecture may provide a viable computational model for studying rapid global context learning in the brain.[64] Attribute Guidance With Inherent Pseudo-label For Occluded Person Re-identification
Rui Zhi,Zhen Yang,Haiyang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的框架AG-ReID,通过两阶段过程结合整体和细粒度属性信息,以提高遮挡情况下的人体再识别性能。
Details
Motivation: 预训练的视觉语言模型在处理遮挡场景时存在局限性,因为它们关注整体图像语义而忽略了细粒度属性信息。 Method: 两阶段过程:首先生成属性伪标签,然后引入结合整体和细粒度属性信息的双引导机制。 Result: AG-ReID在多个广泛使用的Re-ID数据集上达到了最先进的结果,展示了在处理遮挡和细微属性差异方面的显著改进。 Conclusion: AG-ReID实现了最先进的结果,同时在处理遮挡和细微属性差异方面表现出显著的改进。 Abstract: Person re-identification (Re-ID) aims to match person images across different camera views, with occluded Re-ID addressing scenarios where pedestrians are partially visible. While pre-trained vision-language models have shown effectiveness in Re-ID tasks, they face significant challenges in occluded scenarios by focusing on holistic image semantics while neglecting fine-grained attribute information. This limitation becomes particularly evident when dealing with partially occluded pedestrians or when distinguishing between individuals with subtle appearance differences. To address this limitation, we propose Attribute-Guide ReID (AG-ReID), a novel framework that leverages pre-trained models' inherent capabilities to extract fine-grained semantic attributes without additional data or annotations. Our framework operates through a two-stage process: first generating attribute pseudo-labels that capture subtle visual characteristics, then introducing a dual-guidance mechanism that combines holistic and fine-grained attribute information to enhance image feature extraction. Extensive experiments demonstrate that AG-ReID achieves state-of-the-art results on multiple widely-used Re-ID datasets, showing significant improvements in handling occlusions and subtle attribute differences while maintaining competitive performance on standard Re-ID scenarios.[65] CRAM: Large-scale Video Continual Learning with Bootstrapped Compression
Shivani Mall,Joao F. Henriques
Main category: cs.CV
TL;DR: 本文提出了一种名为CRAM的方法,通过在线训练视频压缩器并刷新视频编码,有效解决了视频持续学习中的高内存需求问题,并在大规模基准测试中表现出色。
Details
Motivation: 视频持续学习面临高内存需求的挑战,尤其是长时间视频和持续流数据,与常见的回放缓冲区大小限制相冲突。 Method: 提出了一种名为Continually Refreshed Amodal Memory (CRAM)的方法,通过在线训练视频压缩器并刷新视频编码来应对灾难性遗忘。 Result: CRAM在EpicKitchens-100和Kinetics-700等大规模视频持续学习基准测试中表现出色,内存占用显著减少。 Conclusion: CRAM有效地解决了视频持续学习中的高内存需求问题,使得在较小内存占用的情况下实现高效的视频持续学习。 Abstract: Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling, which requires random access to a full dataset. This would allow for much smaller storage requirements and self-sufficiency of deployed systems that cope with natural distribution shifts, similarly to biological learning. We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer. We posit that part of the reason why practical video CL is challenging is the high memory requirements of video, further exacerbated by long-videos and continual streams, which are at odds with the common rehearsal-buffer size constraints. To address this, we propose to use compressed vision, i.e. store video codes (embeddings) instead of raw inputs, and train a video classifier by IID sampling from this rolling buffer. Training a video compressor online (so not depending on any pre-trained networks) means that it is also subject to catastrophic forgetting. We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one. We name our method Continually Refreshed Amodal Memory (CRAM). We expand current video CL benchmarks to large-scale settings, namely EpicKitchens-100 and Kinetics-700, storing thousands of relatively long videos in under 2 GB, and demonstrate empirically that our video CL method outperforms prior art with a significantly reduced memory footprint.[66] Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation
Xusheng Liang,Lihua Zhou,Nianxin Li,Miao Xu,Ziyang Song,Dong Yi,Jinlin Wu,Hongbin Liu,Jiebo Luo,Zhen Lei
Main category: cs.CV
TL;DR: MCDRL框架结合因果推理和视觉-语言模型(VLM),通过识别并消除域特定变化的影响,提高了医学图像分割的准确性和泛化能力。
Details
Motivation: 医学图像存在显著的领域变化,导致现有视觉-语言模型(VLM)难以泛化到未见过的领域。 Method: MCDRL分为两步:1)利用CLIP识别候选病变区域并通过文本提示构建混杂因子字典;2)训练因果干预网络,利用该字典消除域特定变化的影响,同时保留关键解剖结构信息。 Result: MCDRL在多个实验中均优于现有方法,表现出更高的分割精度和更强的泛化能力。 Conclusion: MCDRL有效解决了医学图像分割中的领域泛化问题,具有良好的应用前景。 Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.[67] AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content
Shushi Wang,Chunyi Li,Zicheng Zhang,Han Zhou,Wei Dong,Jun Chen,Guangtao Zhai,Xiaohong Liu
Main category: cs.CV
TL;DR: 本文构建了AU-IQA数据集,用于评估AI增强UGC图像的质量,推动相关领域的发展。
Details
Motivation: 现有的图像质量评估模型在AI增强的用户生成内容(AI-UGC)上的效果尚未被充分研究,需要专门的数据集和评估方法。 Method: 构建了一个包含4800张AI增强UGC图像的基准数据集AU-IQA,并评估了多种现有质量评估模型的性能。 Result: 研究提供了对当前AI-UGC图像质量评估方法的全面分析,并发布了AU-IQA数据集以促进相关研究。 Conclusion: AU-IQA数据集的构建填补了AI增强UGC图像质量评估的空白,为未来研究提供了基础。 Abstract: AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.[68] Skin-SOAP: A Weakly Supervised Framework for Generating Structured SOAP Notes
Sadia Kamal,Tim Oates,Joy Wan
Main category: cs.CV
TL;DR: Skin-SOAP is a weakly supervised multimodal framework designed to automate the generation of structured SOAP notes from limited inputs, reducing manual effort and achieving performance comparable to leading models while introducing new metrics for clinical relevance evaluation.
Details
Motivation: Manually generating detailed SOAP notes is labor-intensive and contributes to clinician burnout, necessitating an automated solution for scalable and clinically grounded documentation. Method: A weakly supervised multimodal framework named skin-SOAP was developed to generate structured SOAP notes using limited inputs such as lesion images and sparse clinical text. Result: Skin-SOAP achieves performance comparable to advanced models like GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics, including newly introduced MedConceptEval and Clinical Coherence Score (CCS). Conclusion: The proposed skin-SOAP framework effectively generates structured SOAP notes, offering a scalable solution that reduces clinician burden and the need for large annotated datasets. Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. Early diagnosis, accurate and timely treatment are critical to improving patient survival rates. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose skin-SOAP, a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate this clinical relevance, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.[69] A Novel Image Similarity Metric for Scene Composition Structure
Md Redwanul Haque,Manzur Murshed,Manoranjan Paul,Tsz-Kwan Lee
Main category: cs.CV
TL;DR: The paper proposes SCSSIM, a new training-free metric for evaluating the structural integrity of images generated by AI models, which outperforms existing metrics in maintaining Scene Composition Structure (SCS).
Details
Motivation: Traditional image similarity metrics often fall short in assessing Scene Composition Structure (SCS) integrity, which is crucial for ensuring faithful and structurally accurate GenAI outputs. Method: The SCS Similarity Index Measure (SCSSIM) is introduced as a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images. Result: SCSSIM demonstrates high invariance to non-compositional distortions and shows a strong monotonic decrease for compositional distortions, accurately reflecting unchanged SCS and precisely indicating when SCS has been altered. Conclusion: SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition. Abstract: The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image's underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM's high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.[70] HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID
Yiyang Su,Yunping Shi,Feng Liu,Xiaoming Liu
Main category: cs.CV
TL;DR: This paper proposes HAMoBE, a novel framework for video-based person re-identification that adaptively integrates biometric features and dynamically adjusts expert contributions, leading to significant performance improvements.
Details
Motivation: Existing video-based person re-identification (ReID) methods often neglect the importance of selecting the most discriminative features from query-gallery video pairs, which is crucial for effective matching. Method: The authors propose a two-level framework called HAMoBE that uses multi-layer features from a pre-trained large model (e.g., CLIP) and includes specialized experts for long-term, short-term, and temporal features. A dual-input decision gating network dynamically adjusts the contributions of each expert. Result: Extensive evaluations on benchmarks like MEVID show that the proposed approach achieves significant performance improvements, such as a +13.0% increase in Rank-1 accuracy. Conclusion: The proposed HAMoBE framework significantly improves the performance of video-based person re-identification by adaptively integrating key biometric features and dynamically adjusting expert contributions. Abstract: Recently, research interest in person re-identification (ReID) has increasingly focused on video-based scenarios, which are essential for robust surveillance and security in varied and dynamic environments. However, existing video-based ReID methods often overlook the necessity of identifying and selecting the most discriminative features from both videos in a query-gallery pair for effective matching. To address this issue, we propose a novel Hierarchical and Adaptive Mixture of Biometric Experts (HAMoBE) framework, which leverages multi-layer features from a pre-trained large model (e.g., CLIP) and is designed to mimic human perceptual mechanisms by independently modeling key biometric features--appearance, static body shape, and dynamic gait--and adaptively integrating them. Specifically, HAMoBE includes two levels: the first level extracts low-level features from multi-layer representations provided by the frozen large model, while the second level consists of specialized experts focusing on long-term, short-term, and temporal features. To ensure robust matching, we introduce a new dual-input decision gating network that dynamically adjusts the contributions of each expert based on their relevance to the input scenarios. Extensive evaluations on benchmarks like MEVID demonstrate that our approach yields significant performance improvements (e.g., +13.0% Rank-1 accuracy).[71] Finding Needles in Images: Can Multimodal LLMs Locate Fine Details?
Parth Thakkar,Ankush Agarwal,Prasad Kasu,Pulkit Bansal,Chaitanya Devaguptapu
Main category: cs.CV
TL;DR: The paper introduces NiM, a benchmark for evaluating MLLMs in locating fine-grained details in complex documents, and proposes Spot-IT, a method to enhance MLLMs' performance in such tasks.
Details
Motivation: MLLMs' ability to locate and reason about fine-grained details in complex documents remains understudied, prompting the need for a benchmark like NiM and a solution like Spot-IT. Method: Spot-IT uses intelligent patch selection and Gaussian attention to improve MLLMs' focus on fine-grained details. Result: Spot-IT achieves significant improvements over baseline methods in handling intricate document understanding tasks. Conclusion: Spot-IT enhances MLLMs' capability in fine-grained document understanding tasks, especially in extracting precise details from complex layouts. Abstract: While Multi-modal Large Language Models (MLLMs) have shown impressive capabilities in document understanding tasks, their ability to locate and reason about fine-grained details within complex documents remains understudied. Consider searching a restaurant menu for a specific nutritional detail or identifying a disclaimer in a lengthy newspaper article tasks that demand careful attention to small but significant details within a broader narrative, akin to Finding Needles in Images (NiM). To address this gap, we introduce NiM, a carefully curated benchmark spanning diverse real-world documents including newspapers, menus, and lecture images, specifically designed to evaluate MLLMs' capability in these intricate tasks. Building on this, we further propose Spot-IT, a simple yet effective approach that enhances MLLMs capability through intelligent patch selection and Gaussian attention, motivated from how humans zoom and focus when searching documents. Our extensive experiments reveal both the capabilities and limitations of current MLLMs in handling fine-grained document understanding tasks, while demonstrating the effectiveness of our approach. Spot-IT achieves significant improvements over baseline methods, particularly in scenarios requiring precise detail extraction from complex layouts.[72] DualMat: PBR Material Estimation via Coherent Dual-Path Diffusion
Yifeng Huang,Zhang Chen,Yi Xu,Minh Hoai,Zhong Li
Main category: cs.CV
TL;DR: DualMat is a novel approach for estimating PBR materials from single images under complex lighting conditions, achieving state-of-the-art performance by operating in two distinct latent spaces and introducing feature distillation during training.
Details
Motivation: The motivation is to accurately estimate Physically Based Rendering (PBR) materials from single images under complex lighting conditions, which is challenging for existing methods. Method: DualMat operates in two distinct latent spaces: an albedo-optimized path leveraging pretrained visual knowledge through RGB latent space, and a material-specialized path operating in a compact latent space for precise metallic and roughness estimation. Feature distillation is introduced during training to ensure coherent predictions between the two paths. Rectified flow is employed to enhance efficiency, and the framework extends to high-resolution and multi-view inputs through patch-based estimation and cross-view attention. Result: DualMat significantly outperforms existing methods with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors on both Objaverse and real-world data. Conclusion: DualMat is a novel dual-path diffusion framework that achieves state-of-the-art performance in estimating PBR materials from single images under complex lighting conditions. Abstract: We present DualMat, a novel dual-path diffusion framework for estimating Physically Based Rendering (PBR) materials from single images under complex lighting conditions. Our approach operates in two distinct latent spaces: an albedo-optimized path leveraging pretrained visual knowledge through RGB latent space, and a material-specialized path operating in a compact latent space designed for precise metallic and roughness estimation. To ensure coherent predictions between the albedo-optimized and material-specialized paths, we introduce feature distillation during training. We employ rectified flow to enhance efficiency by reducing inference steps while maintaining quality. Our framework extends to high-resolution and multi-view inputs through patch-based estimation and cross-view attention, enabling seamless integration into image-to-3D pipelines. DualMat achieves state-of-the-art performance on both Objaverse and real-world data, significantly outperforming existing methods with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors.[73] Decoupling Continual Semantic Segmentation
Yifu Guo,Yuquan Lu,Wentao Zhang,Zishan Xu,Dexia Chen,Siyu Zhang,Yizhe Zhang,Ruixuan Wang
Main category: cs.CV
TL;DR: 本文提出DecoupleCSS,通过两阶段框架解耦类别感知检测与类别无关分割,有效解决持续语义分割中的灾难性遗忘问题,提升模型的保留与适应能力。
Details
Motivation: 现有持续语义分割方法通常采用单阶段编码器-解码器架构,导致新旧类别学习之间的干扰以及保留与可塑性之间的不平衡,因此需要一种新方法来改善这一问题。 Method: 提出了一种两阶段框架DecoupleCSS,第一阶段使用LoRA适配的预训练文本和图像编码器生成位置感知提示,第二阶段使用Segment Anything Model (SAM) 生成精确分割掩码,实现分割知识在新旧类别间的共享。 Result: DecoupleCSS在多个持续语义分割任务上达到了最先进的性能,有效缓解了灾难性遗忘问题,同时保持了模型对新类别的适应能力。 Conclusion: DecoupleCSS通过解耦类别感知检测和类别无关分割,有效解决了持续语义分割中的灾难性遗忘问题,实现了新类别学习与旧知识保留的平衡,并在多个任务上达到最先进的性能。 Abstract: Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention-plasticity balance. We introduce DecoupleCSS, a novel two-stage framework for CSS. By decoupling class-aware detection from class-agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre-trained text and image encoders, adapted using LoRA, to encode class-specific information and generate location-aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state-of-the-art performance across a variety of challenging tasks. Our code is publicly available at: https://github.com/euyis1019/Decoupling-Continual-Semantic-Segmentation.[74] Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks
Ruiyu Li,Changyuan Qiu,Hangrui Cao,Qihan Ren,Yuqing Qiu
Main category: cs.CV
TL;DR: This project explores automatic image colorization using classification and adversarial learning techniques, aiming to improve upon traditional regression-based methods by leveraging semantic and texture cues from available data.
Details
Motivation: Image colorization is a challenging yet important task in computer vision, with applications such as color restoration and automatic animation colorization. Its ill-posed nature and the availability of large training datasets provide an opportunity to explore semantic and texture cues for accurate color prediction. Method: The project employs classification and adversarial learning techniques for image colorization, building and modifying models based on prior works. Result: The project demonstrates the effectiveness of using classification and adversarial learning for automatic image colorization, indicating potential improvements over traditional regression-based approaches. Conclusion: The project concludes that automatic image colorization can be effectively achieved through classification and adversarial learning, building on prior works while adapting them to the specific scenario under study. Abstract: Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons.[75] FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer
Jian Zhu,Shanyuan Liu,Liuzhuozheng Li,Yue Gong,He Wang,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Dawei Leng,Yuhui Yin,Yang Xu
Main category: cs.CV
TL;DR: FLUX-Makeup是一种高效且无需额外控制组件的妆容迁移方法,通过引入RefLoRAInjector和优化数据生成流程,实现了高质量和身份一致的妆容迁移。
Details
Motivation: 现有的基于GAN和扩散模型的妆容迁移方法依赖复杂的损失函数或额外的面部控制模块,容易引入误差,导致效果不佳。 Method: 基于FLUX-Kontext框架,使用源图像作为条件输入,并引入轻量级妆容特征注入器RefLoRAInjector,同时设计了鲁棒且可扩展的数据生成流程。 Result: FLUX-Makeup在多种场景下表现出色,实现了最先进的性能,同时生成的配对妆容数据集质量显著优于现有数据集。 Conclusion: FLUX-Makeup有效地实现了高质量、身份一致且鲁棒性强的妆容迁移,无需任何辅助面部控制组件。 Abstract: Makeup transfer aims to apply the makeup style from a reference face to a target face and has been increasingly adopted in practical applications. Existing GAN-based approaches typically rely on carefully designed loss functions to balance transfer quality and facial identity consistency, while diffusion-based methods often depend on additional face-control modules or algorithms to preserve identity. However, these auxiliary components tend to introduce extra errors, leading to suboptimal transfer results. To overcome these limitations, we propose FLUX-Makeup, a high-fidelity, identity-consistent, and robust makeup transfer framework that eliminates the need for any auxiliary face-control components. Instead, our method directly leverages source-reference image pairs to achieve superior transfer performance. Specifically, we build our framework upon FLUX-Kontext, using the source image as its native conditional input. Furthermore, we introduce RefLoRAInjector, a lightweight makeup feature injector that decouples the reference pathway from the backbone, enabling efficient and comprehensive extraction of makeup-related information. In parallel, we design a robust and scalable data generation pipeline to provide more accurate supervision during training. The paired makeup datasets produced by this pipeline significantly surpass the quality of all existing datasets. Extensive experiments demonstrate that FLUX-Makeup achieves state-of-the-art performance, exhibiting strong robustness across diverse scenarios.[76] AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models
Yuxiang Xiao,Yang Hu,Bin Li,Tianyang Zhang,Zexi Li,Huazhu Fu,Jens Rittscher,Kaixiang Yang
Main category: cs.CV
TL;DR: AdaFusion dynamically integrates knowledge from multiple PFMs to enhance performance and interpretability in pathology applications.
Details
Motivation: The motivation is to overcome the latent biases introduced by diverse and opaque pre-training contexts of PFMs, which hinder generalisability and transparency in downstream applications. Method: AdaFusion uses a prompt-guided inference framework that compresses and aligns tile-level features from diverse models and employs a lightweight attention mechanism to adaptively fuse them based on tissue phenotype context. Result: AdaFusion consistently surpasses individual PFMs across both classification and regression tasks and provides interpretable insights into each model's biosemantic specialisation. Conclusion: AdaFusion successfully integrates multiple PFMs to overcome their individual biases, enhancing both performance and interpretability in various downstream applications such as treatment response prediction, tumor grading, and spatial gene expression inference. Abstract: Pathology foundation models (PFMs) have demonstrated strong representational capabilities through self-supervised pre-training on large-scale, unannotated histopathology image datasets. However, their diverse yet opaque pretraining contexts, shaped by both data-related and structural/training factors, introduce latent biases that hinder generalisability and transparency in downstream applications. In this paper, we propose AdaFusion, a novel prompt-guided inference framework that, to our knowledge, is among the very first to dynamically integrate complementary knowledge from multiple PFMs. Our method compresses and aligns tile-level features from diverse models and employs a lightweight attention mechanism to adaptively fuse them based on tissue phenotype context. We evaluate AdaFusion on three real-world benchmarks spanning treatment response prediction, tumour grading, and spatial gene expression inference. Our approach consistently surpasses individual PFMs across both classification and regression tasks, while offering interpretable insights into each model's biosemantic specialisation. These results highlight AdaFusion's ability to bridge heterogeneous PFMs, achieving both enhanced performance and interpretability of model-specific inductive biases.[77] PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation
Jingxuan He,Busheng Su,Finn Wong
Main category: cs.CV
TL;DR: PoseGen是一个可以生成无限长视频的新框架,使用单一参考图像和驱动姿态序列,解决了身份漂移问题,并在长时间视频生成中保持了高保真度和流畅性。
Details
Motivation: 当前扩散模型在生成长且时间连贯的视频时面临身份漂移和片段长度限制的问题,因此需要一种新的方法来提升视频生成的质量和持续时间。 Method: PoseGen采用了一种上下文中的LoRA微调策略,在标记级别注入主体外观以保持身份,并在通道级别上结合姿态信息实现精细的运动控制。此外,它使用交错片段生成方法,通过共享KV缓存机制和专门的过渡过程确保背景一致性和时间流畅性。 Result: 在仅33小时的视频数据集上训练后,PoseGen在身份保真度、姿态准确性和生成无限长无伪影视频的能力方面显著优于现有最先进方法。 Conclusion: PoseGen提供了一种有效解决长时间视频生成挑战的新方案,具有广泛的应用潜力。 Abstract: Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.[78] Sculpting Margin Penalty: Intra-Task Adapter Merging and Classifier Calibration for Few-Shot Class-Incremental Learning
Liang Bai,Hong Song,Jinfu Li,Yucong Lin,Jingfan Fan,Tianyu Fu,Danni Ai,Deqiang Xiao,Jian Yang
Main category: cs.CV
TL;DR: The paper proposes SMP, a novel method for Few-Shot Class-Incremental Learning (FSCIL), which balances base-class discriminability and new-class generalization by integrating margin penalties at different stages of the parameter-efficient fine-tuning paradigm. It includes the Margin-aware Intra-task Adapter Merging (MIAM) mechanism for base task learning and the Margin Penalty-based Classifier Calibration (MPCC) strategy for incremental tasks. Experiments show that SMP achieves state-of-the-art performance on FSCIL.
Details
Motivation: Real-world applications often face data privacy constraints and high acquisition costs, making the assumption of sufficient training data in incremental tasks unrealistic and leading to significant performance degradation in class-incremental learning. Forward-compatible learning has emerged as a promising solution for Few-Shot Class-Incremental Learning (FSCIL), but existing methods struggle to balance base-class discriminability and new-class generalization. Method: The proposed method is called SMP (Sculpting Margin Penalty), which integrates margin penalties at different stages within the parameter-efficient fine-tuning paradigm. It includes two mechanisms: Margin-aware Intra-task Adapter Merging (MIAM) for base task learning and Margin Penalty-based Classifier Calibration (MPCC) for incremental tasks. Result: Extensive experiments on CIFAR100, ImageNet-R, and CUB200 demonstrate that SMP achieves state-of-the-art performance in FSCIL. Conclusion: SMP achieves state-of-the-art performance in FSCIL while maintaining a better balance between base and new classes. Abstract: Real-world applications often face data privacy constraints and high acquisition costs, making the assumption of sufficient training data in incremental tasks unrealistic and leading to significant performance degradation in class-incremental learning. Forward-compatible learning, which prospectively prepares for future tasks during base task training, has emerged as a promising solution for Few-Shot Class-Incremental Learning (FSCIL). However, existing methods still struggle to balance base-class discriminability and new-class generalization. Moreover, limited access to original data during incremental tasks often results in ambiguous inter-class decision boundaries. To address these challenges, we propose SMP (Sculpting Margin Penalty), a novel FSCIL method that strategically integrates margin penalties at different stages within the parameter-efficient fine-tuning paradigm. Specifically, we introduce the Margin-aware Intra-task Adapter Merging (MIAM) mechanism for base task learning. MIAM trains two sets of low-rank adapters with distinct classification losses: one with a margin penalty to enhance base-class discriminability, and the other without margin constraints to promote generalization to future new classes. These adapters are then adaptively merged to improve forward compatibility. For incremental tasks, we propose a Margin Penalty-based Classifier Calibration (MPCC) strategy to refine decision boundaries by fine-tuning classifiers on all seen classes' embeddings with a margin penalty. Extensive experiments on CIFAR100, ImageNet-R, and CUB200 demonstrate that SMP achieves state-of-the-art performance in FSCIL while maintaining a better balance between base and new classes.[79] AHDMIL: Asymmetric Hierarchical Distillation Multi-Instance Learning for Fast and Accurate Whole-Slide Image Classification
Jiuyang Dong,Jiahan Li,Junjun Jiang,Kui Jiang,Yongbing Zhang
Main category: cs.CV
TL;DR: AHDMIL improves pathological image classification by reducing inference costs while enhancing classification performance through a novel hierarchical distillation framework.
Details
Motivation: Multi-instance learning (MIL) faces high inference costs due to processing thousands of patches from gigapixel whole slide images (WSI). AHDMIL addresses this challenge by eliminating irrelevant patches through a two-step training process. Method: AHDMIL uses an Asymmetric Hierarchical Distillation Multi-Instance Learning framework with two key components: Dynamic Multi-Instance Network (DMIN) and Dual-Branch Lightweight Instance Pre-screening Network (DB-LIPN), along with a Chebyshev-polynomial-based Kolmogorov-Arnold (CKA) classifier. Result: Experiments on four public datasets show that AHDMIL consistently outperforms previous methods in classification performance and inference speed, with improvements in accuracy, AUC, F1 score, and Brier score, and average inference speedups of 1.2 to 2.1 times. Conclusion: AHDMIL is an effective method for pathological image classification, providing improved classification performance and inference speed compared to previous state-of-the-art methods. Abstract: Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to the need to process thousands of patches from each gigapixel whole slide image (WSI). To address this, we propose AHDMIL, an Asymmetric Hierarchical Distillation Multi-Instance Learning framework that enables fast and accurate classification by eliminating irrelevant patches through a two-step training process. AHDMIL comprises two key components: the Dynamic Multi-Instance Network (DMIN), which operates on high-resolution WSIs, and the Dual-Branch Lightweight Instance Pre-screening Network (DB-LIPN), which analyzes corresponding low-resolution counterparts. In the first step, self-distillation (SD), DMIN is trained for WSI classification while generating per-instance attention scores to identify irrelevant patches. These scores guide the second step, asymmetric distillation (AD), where DB-LIPN learns to predict the relevance of each low-resolution patch. The relevant patches predicted by DB-LIPN have spatial correspondence with patches in high-resolution WSIs, which are used for fine-tuning and efficient inference of DMIN. In addition, we design the first Chebyshev-polynomial-based Kolmogorov-Arnold (CKA) classifier in computational pathology, which improves classification performance through learnable activation layers. Extensive experiments on four public datasets demonstrate that AHDMIL consistently outperforms previous state-of-the-art methods in both classification performance and inference speed. For example, on the Camelyon16 dataset, it achieves a relative improvement of 5.3% in accuracy and accelerates inference by 1.2.times. Across all datasets, area under the curve (AUC), accuracy, f1 score, and brier score show consistent gains, with average inference speedups ranging from 1.2 to 2.1 times. The code is available.[80] MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs
Yufei Gao,Jiaying Fei,Nuo Chen,Ruirui Chen,Guohang Yan,Yunshi Lan,Botian Shi
Main category: cs.CV
TL;DR: The study addresses the limitations of MLLMs in low-resource languages by introducing MELLA, a dataset that enhances both linguistic and cultural understanding, leading to improved performance.
Details
Motivation: Current MLLMs perform poorly in low-resource languages, and existing enhancement methods neglect multimodal informativeness and cultural groundedness. Method: Proposed a dual-source strategy for data collection and introduced the MELLA dataset. Result: Fine-tuning on MELLA led to performance improvements across eight languages, with models producing 'thick descriptions'. Conclusion: MELLA dataset improves the effectiveness of MLLMs in low-resource languages by enhancing both linguistic capability and cultural groundedness. Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce "thin descriptions", they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing "thick descriptions". We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.[81] Latent Expression Generation for Referring Image Segmentation and Grounding
Seonghoon Yu,Joonbeom Hong,Joonseok Lee,Jeany Son
Main category: cs.CV
TL;DR: This paper proposes a new visual grounding framework that generates multiple latent expressions to better align textual descriptions with visual details, resulting in improved performance on RIS and REC tasks.
Details
Motivation: The motivation is to bridge the gap between rich visual details in images and sparse textual cues in existing methods, which often leads to misidentification of similar objects. Method: The method introduces subject distributor and visual concept injector modules to generate multiple latent expressions and uses a positive-margin contrastive learning strategy to align these expressions with the original text. Result: The experimental results show that the proposed method outperforms state-of-the-art approaches on multiple benchmarks, including the GRES benchmark. Conclusion: The proposed visual grounding framework significantly improves the performance of RIS and REC tasks by generating multiple latent expressions that incorporate complementary visual details. Abstract: Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to embed both shared-subject and distinct-attributes concepts into the latent representations, thereby capturing unique and target-specific visual cues. We also propose a positive-margin contrastive learning strategy to align all latent expressions with the original text while preserving subtle variations. Experimental results show that our method not only outperforms state-of-the-art RIS and REC approaches on multiple benchmarks but also achieves outstanding performance on the generalized referring expression segmentation (GRES) benchmark.[82] Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Luozheng Qin,Jia Gong,Yuqing Sun,Tianjiao Li,Mengping Yang,Xiaomeng Yang,Chao Qu,Zhiyu Tan,Hao Li
Main category: cs.CV
TL;DR: 本文提出了一种统一的Chain-of-Thought(Uni-CoT)框架,该框架在单一统一模型中实现连贯和有根据的多模态推理。
Details
Motivation: 将CoT扩展到视觉语言推理任务仍然具有挑战性,因为这通常需要解释视觉状态的转换以支持推理。现有方法由于建模视觉状态转换的能力有限或由于碎片化架构导致的不连贯视觉轨迹而经常在这方面遇到困难。 Method: Uni-CoT引入了一种新颖的双层推理范式:用于高级任务规划的宏观级CoT和用于子任务执行的微观级CoT。此外,还引入了一种结合宏观级CoT的交错图文监督和微观级CoT的多任务目标的结构化训练范式。 Result: 实验结果表明,Uni-CoT在推理驱动的图像生成基准(WISE)和编辑基准(RISE和KRIS)上表现出SOTA性能和强大的泛化能力。 Conclusion: Uni-CoT是一个有前途的多模态推理解决方案,它在单一统一模型中实现了连贯和有根据的多模态推理。 Abstract: Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/[83] FedGIN: Federated Learning with Dynamic Global Intensity Non-linear Augmentation for Organ Segmentation using Multi-modal Images
Sachin Dudda Nagaraju,Ashkan Moradi,Bendik Skarre Abrahamsen,Mattijs Elschot
Main category: cs.CV
TL;DR: FedGIN是一种新的联邦学习框架,用于多模态医疗图像分割,通过集成GIN增强模块,在保护隐私的同时实现了跨模态的良好泛化性能。
Details
Motivation: 医疗图像分割在AI辅助诊断中起着关键作用,但由于数据稀缺、模态间的领域转移和隐私限制,开发一个能够有效泛化到多种模态的统一模型面临挑战。 Method: 提出FedGIN方法,结合了Global Intensity Non-linear (GIN)增强模块,用于在本地训练期间协调模态特定的强度分布。 Result: 在有限数据场景中,FedGIN在MRI测试案例中的3D Dice分数比没有GIN的FL提高了12%到18%,并且始终优于本地基线。在完整数据集场景中,FedGIN展示了接近集中式性能,与仅MRI基线相比,Dice分数提高了30%,与仅CT基线相比,提高了10%。 Conclusion: FedGIN是一个有效的联邦学习框架,能够实现多模态器官分割,同时保护患者隐私。 Abstract: Medical image segmentation plays a crucial role in AI-assisted diagnostics, surgical planning, and treatment monitoring. Accurate and robust segmentation models are essential for enabling reliable, data-driven clinical decision making across diverse imaging modalities. Given the inherent variability in image characteristics across modalities, developing a unified model capable of generalizing effectively to multiple modalities would be highly beneficial. This model could streamline clinical workflows and reduce the need for modality-specific training. However, real-world deployment faces major challenges, including data scarcity, domain shift between modalities (e.g., CT vs. MRI), and privacy restrictions that prevent data sharing. To address these issues, we propose FedGIN, a Federated Learning (FL) framework that enables multimodal organ segmentation without sharing raw patient data. Our method integrates a lightweight Global Intensity Non-linear (GIN) augmentation module that harmonizes modality-specific intensity distributions during local training. We evaluated FedGIN using two types of datasets: an imputed dataset and a complete dataset. In the limited dataset scenario, the model was initially trained using only MRI data, and CT data was added to assess its performance improvements. In the complete dataset scenario, both MRI and CT data were fully utilized for training on all clients. In the limited-data scenario, FedGIN achieved a 12 to 18% improvement in 3D Dice scores on MRI test cases compared to FL without GIN and consistently outperformed local baselines. In the complete dataset scenario, FedGIN demonstrated near-centralized performance, with a 30% Dice score improvement over the MRI-only baseline and a 10% improvement over the CT-only baseline, highlighting its strong cross-modality generalization under privacy constraints.[84] Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
Yong Du,Yuchen Yan,Fei Tang,Zhengxi Lu,Chang Zong,Weiming Lu,Shengpei Jiang,Yongliang Shen
Main category: cs.CV
TL;DR: This paper introduces GUI-RC and GUI-RCPO, two test-time methods for improving GUI grounding by leveraging prediction consistency without additional training, achieving significant accuracy improvements on benchmark datasets.
Details
Motivation: Existing GUI grounding methods rely heavily on supervised training or labeled rewards, which are costly and limit scalability. The authors aim to improve performance by leveraging implicit confidence signals from prediction overlaps without additional training. Method: GUI-RC constructs spatial voting grids from multiple predictions to identify consensus regions, while GUI-RCPO uses consistency patterns as rewards for test-time reinforcement learning, enabling iterative refinement of outputs on unlabeled data. Result: GUI-RC improves accuracy by 2-3% across different architectures on ScreenSpot benchmarks, and GUI-RCPO further boosts performance, raising Qwen2.5-VL-3B-Instruct's accuracy from 80.11% to 85.14% on ScreenSpot-v2. Conclusion: The proposed GUI-RC and GUI-RCPO methods improve the accuracy of GUI grounding without requiring additional training, highlighting the potential of test-time scaling and reinforcement learning for more robust and data-efficient GUI agents. Abstract: Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.[85] Deep Learning-based Animal Behavior Analysis: Insights from Mouse Chronic Pain Models
Yu-Hsi Chen,Wei-Hsin Chen,Chien-Yao Wang,Hong-Yuan Mark Liao,James C. Liao,Chien-Chang Chen
Main category: cs.CV
TL;DR: 该研究提出了一种无需人工标注的慢性疼痛行为分析方法,并构建了一个小鼠疼痛行为数据集,验证了方法的有效性。
Details
Motivation: 现有方法依赖人工标注行为特征,难以准确捕捉慢性疼痛的细微和持续性行为变化。 Method: 提出了一种通用动作空间投影器的方法,无需人工定义标签自动提取小鼠动作特征。 Result: 在15类疼痛分类任务中准确率达到48.41%,显著优于人类专家(21.33%)和常用方法B-SOiD(30.52%);在三类分类任务中准确率为73.1%,同样优于人类专家(48%)和B-SOiD(58.43%)。 Conclusion: 本研究展示了一种新的慢性疼痛评估方法,能够提供新的疼痛研究视角和药物开发应用潜力。 Abstract: Assessing chronic pain behavior in mice is critical for preclinical studies. However, existing methods mostly rely on manual labeling of behavioral features, and humans lack a clear understanding of which behaviors best represent chronic pain. For this reason, existing methods struggle to accurately capture the insidious and persistent behavioral changes in chronic pain. This study proposes a framework to automatically discover features related to chronic pain without relying on human-defined action labels. Our method uses universal action space projector to automatically extract mouse action features, and avoids the potential bias of human labeling by retaining the rich behavioral information in the original video. In this paper, we also collected a mouse pain behavior dataset that captures the disease progression of both neuropathic and inflammatory pain across multiple time points. Our method achieves 48.41\% accuracy in a 15-class pain classification task, significantly outperforming human experts (21.33\%) and the widely used method B-SOiD (30.52\%). Furthermore, when the classification is simplified to only three categories, i.e., neuropathic pain, inflammatory pain, and no pain, then our method achieves an accuracy of 73.1\%, which is notably higher than that of human experts (48\%) and B-SOiD (58.43\%). Finally, our method revealed differences in drug efficacy for different types of pain on zero-shot Gabapentin drug testing, and the results were consistent with past drug efficacy literature. This study demonstrates the potential clinical application of our method, which can provide new insights into pain research and related drug development.[86] Rotation Equivariant Arbitrary-scale Image Super-Resolution
Qi Xie,Jiahong Fu,Zongben Xu,Deyu Meng
Main category: cs.CV
TL;DR: This paper introduces a rotation-equivariant ASISR network that preserves structural integrity in high-resolution image recovery, outperforming existing methods on various datasets.
Details
Motivation: The motivation stems from the challenge of preserving geometric patterns like edges and textures in arbitrary-scale image super-resolution, as conventional methods often produce artifacts due to a lack of rotational invariance. Method: The method involves redesigning the implicit neural representation (INR) and encoder modules to incorporate intrinsic rotation equivariance, enabling end-to-end rotational equivariance in the ASISR network. Result: The proposed method achieves end-to-end rotational equivariance in ASISR networks, reduces equivariance error, and enhances performance across simulated and real-world datasets. Conclusion: The proposed rotation equivariant ASISR method enhances the recovery of high-resolution images by maintaining the structural integrity and orientation of geometric patterns, showing superior performance on simulated and real datasets. Abstract: The arbitrary-scale image super-resolution (ASISR), a recent popular topic in computer vision, aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image. This task is realized by representing the image as a continuous implicit function through two fundamental modules, a deep-network-based encoder and an implicit neural representation (INR) module. Despite achieving notable progress, a crucial challenge of such a highly ill-posed setting is that many common geometric patterns, such as repetitive textures, edges, or shapes, are seriously warped and deformed in the low-resolution images, naturally leading to unexpected artifacts appearing in their high-resolution recoveries. Embedding rotation equivariance into the ASISR network is thus necessary, as it has been widely demonstrated that this enhancement enables the recovery to faithfully maintain the original orientations and structural integrity of geometric patterns underlying the input image. Motivated by this, we make efforts to construct a rotation equivariant ASISR method in this study. Specifically, we elaborately redesign the basic architectures of INR and encoder modules, incorporating intrinsic rotation equivariance capabilities beyond those of conventional ASISR networks. Through such amelioration, the ASISR network can, for the first time, be implemented with end-to-end rotational equivariance maintained from input to output. We also provide a solid theoretical analysis to evaluate its intrinsic equivariance error, demonstrating its inherent nature of embedding such an equivariance structure. The superiority of the proposed method is substantiated by experiments conducted on both simulated and real datasets. We also validate that the proposed framework can be readily integrated into current ASISR methods in a plug \& play manner to further enhance their performance.[87] X-MoGen: Unified Motion Generation across Humans and Animals
Xuan Wang,Kai Ruan,Liyang Qian,Zhizhi Guo,Chang Su,Gaoang Wang
Main category: cs.CV
TL;DR: X-MoGen是第一个用于跨物种文本驱动运动生成的统一框架,覆盖人类和动物,采用两阶段架构和大规模UniMo4D数据集以提高运动生成的统一性和泛化能力。
Details
Motivation: 现有方法通常分别建模人类和动物运动,而跨物种方法能够提供统一的表示并改善泛化性能,但物种间的形态差异仍是一个关键挑战。 Method: X-MoGen采用两阶段架构:第一阶段利用条件图变分自编码器学习标准T-pose先验,并通过自编码器在共享潜在空间中编码运动,该空间受形态损失的正则化;第二阶段进行掩码运动建模以生成基于文本描述的运动嵌入。此外,训练过程中引入形态一致性模块以提升跨物种骨骼运动的合理性。同时,构建了大规模UniMo4D数据集支持统一建模。 Result: 在UniMo4D数据集上的大量实验表明,X-MoGen在已见和未见物种上均优于当前最先进的方法。 Conclusion: X-MoGen是一个有效的跨物种文本驱动运动生成框架,能够应对形态差异带来的挑战,并在运动生成任务中实现统一建模和更好的泛化性能。 Abstract: Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose \textbf{X-MoGen}, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct \textbf{UniMo4D}, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.[88] PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems
Qi Guo,Xiaojun Jia,Shanmin Pang,Simeng Qin,Lin Wang,Ju Jia,Yang Liu,Qing Guo
Main category: cs.CV
TL;DR: PhysPatch is a new adversarial patch framework designed specifically for MLLM-based autonomous driving systems, outperforming existing methods by optimizing patch location, shape, and content for real-world applicability and effectiveness.
Details
Motivation: MLLMs are increasingly used in autonomous driving systems but are vulnerable to adversarial patch attacks. Existing patch-based attacks perform poorly on MLLMs due to their complex architectures, necessitating a more effective and physically realizable solution. Method: PhysPatch jointly optimizes patch location, shape, and content using a semantic-based mask initialization strategy, an SVD-based local alignment loss with patch-guided crop-resize, and a potential field-based mask refinement method. Result: PhysPatch significantly outperforms prior methods in influencing MLLM-based AD systems' perception and planning outputs, while ensuring patches are placed in physically feasible regions. Conclusion: PhysPatch is a highly effective and physically realizable adversarial patch framework for MLLM-based autonomous driving systems, showing superior performance and real-world applicability compared to existing methods. Abstract: Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. However, MLLMs are vulnerable to adversarial attacks, particularly adversarial patch attacks, which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models and perform poorly when transferred to MLLM-based systems due to the latter's complex architectures and reasoning abilities. To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms prior methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.[89] Multi-tracklet Tracking for Generic Targets with Adaptive Detection Clustering
Zewei Wu,Longhao Wang,Cui Wang,César Teixeira,Wei Ke,Zhang Xiong
Main category: cs.CV
TL;DR: 本文提出了一种新的多目标跟踪框架MTT,通过自适应生成tracklet并结合多线索关联策略,有效应对低置信度检测和长期遮挡问题,在多目标跟踪任务中表现出色。
Details
Motivation: 现有方法在面对未见过的目标类别、低置信度检测、弱运动和外观约束以及长期遮挡时表现不佳,需要一种更鲁棒的跟踪方法。 Method: 提出了一种名为Multi-Tracklet Tracking (MTT)的跟踪方法,首先自适应聚类检测结果生成稳健的tracklet,然后利用位置和外观等线索估计最优tracklet划分。 Result: 在通用多目标跟踪基准上的广泛实验证明了MTT框架的竞争性能。 Conclusion: MTT通过灵活的tracklet生成和多线索tracklet关联框架,提升了在复杂场景下的多目标跟踪性能。 Abstract: Tracking specific targets, such as pedestrians and vehicles, has been the focus of recent vision-based multitarget tracking studies. However, in some real-world scenarios, unseen categories often challenge existing methods due to low-confidence detections, weak motion and appearance constraints, and long-term occlusions. To address these issues, this article proposes a tracklet-enhanced tracker called Multi-Tracklet Tracking (MTT) that integrates flexible tracklet generation into a multi-tracklet association framework. This framework first adaptively clusters the detection results according to their short-term spatio-temporal correlation into robust tracklets and then estimates the best tracklet partitions using multiple clues, such as location and appearance over time to mitigate error propagation in long-term association. Finally, extensive experiments on the benchmark for generic multiple object tracking demonstrate the competitiveness of the proposed framework.[90] SPA++: Generalized Graph Spectral Alignment for Versatile Domain Adaptation
Zhiqing Xiao,Haobo Wang,Xu Lu,Wentao Ye,Gang Chen,Junbo Zhao
Main category: cs.CV
TL;DR: SPA++ is a novel domain adaptation framework that leverages graph spectral alignment and propagation mechanisms to improve discriminability and adaptability in complex scenarios.
Details
Motivation: Domain Adaptation often neglects intra-domain structures, leading to reduced discriminability. This work aims to address this tradeoff effectively. Method: The paper proposes a graph spectral alignment framework named SPA++ that includes coarse graph alignment, neighbor-aware propagation, data augmentation, and consistency regularization. Result: SPA++ demonstrates superior performance over state-of-the-art methods, showing enhanced robustness and adaptability across challenging domain adaptation scenarios. Conclusion: SPA++ provides a robust and adaptable framework for domain adaptation, outperforming existing methods in various challenging scenarios. Abstract: Domain Adaptation (DA) aims to transfer knowledge from a labeled source domain to an unlabeled or sparsely labeled target domain under domain shifts. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. To tackle this tradeoff, we propose a generalized graph SPectral Alignment framework, SPA++. Its core is briefly condensed as follows: (1)-by casting the DA problem to graph primitives, it composes a coarse graph alignment mechanism with a novel spectral regularizer toward aligning the domain graphs in eigenspaces; (2)-we further develop a fine-grained neighbor-aware propagation mechanism for enhanced discriminability in the target domain; (3)-by incorporating data augmentation and consistency regularization, SPA++ can adapt to complex scenarios including most DA settings and even challenging distribution scenarios. Furthermore, we also provide theoretical analysis to support our method, including the generalization bound of graph-based DA and the role of spectral alignment and smoothing consistency. Extensive experiments on benchmark datasets demonstrate that SPA++ consistently outperforms existing cutting-edge methods, achieving superior robustness and adaptability across various challenging adaptation scenarios.[91] SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images
Dongchen Si,Di Wang,Erzhong Gao,Xiaolei Qin,Liu Zhao,Jing Zhang,Minqiang Xu,Jianbo Zhan,Jianshe Wang,Lin Liu,Bo Du,Liangpei Zhang
Main category: cs.CV
TL;DR: 本文提出了一个基于多模态视觉-语言模型的遥感图像地表覆盖提取方法,名为SPEX。通过构建包含光谱先验信息的视觉-语言指令数据集SPIE,结合多尺度特征融合等训练策略,SPEX在多个公开多光谱数据集上表现出色,且具备生成文本解释的能力。
Details
Motivation: 光谱信息在遥感观测中至关重要,但当前的视觉-语言模型未能充分利用这一信息,导致在多光谱场景下性能欠佳。因此,本文旨在开发一种能够有效利用光谱信息的方法以提升地表覆盖物提取的准确性。 Method: 基于经典光谱指数计算,构建了一个视觉-语言指令数据集SPIE,并提出了一个名为SPEX的多模态大语言模型。引入了多尺度特征聚合、标记上下文压缩和多光谱视觉预训练等组件和策略,以实现精确和灵活的像素级解释。 Result: SPEX在五个公开的多光谱数据集上进行了广泛的实验,结果表明其在提取植被、建筑物和水体等地表覆盖类别时始终优于现有的最先进方法。此外,SPEX还能生成预测的文本解释,增强了可解释性和用户友好性。 Conclusion: SPEX是首个专注于光谱遥感图像地表覆盖提取的多模态视觉-语言模型。它有效地利用了光谱信息,并在多个数据集上展现了卓越的性能,同时具备生成文本解释的能力,提高了模型的可解释性和实用性。 Abstract: Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to achieve precise and flexible pixel-level interpretation. To the best of our knowledge, SPEX is the first multimodal vision-language model dedicated to land cover extraction in spectral remote sensing imagery. Extensive experiments on five public multispectral datasets demonstrate that SPEX consistently outperforms existing state-of-the-art methods in extracting typical land cover categories such as vegetation, buildings, and water bodies. Moreover, SPEX is capable of generating textual explanations for its predictions, thereby enhancing interpretability and user-friendliness. Code will be released at: https://github.com/MiliLab/SPEX.[92] EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery
Bingyu Yang,Qingyao Tian,Yimeng Geng,Huai Liao,Xinyan Huang,Jiebo Luo,Hongbin Liu
Main category: cs.CV
TL;DR: EndoMatcher是一种用于内窥镜图像匹配的通用框架,通过构建多域数据集Endo-Mix6和采用双分支Vision Transformer与渐进式多目标训练策略,解决了数据稀缺和视觉条件困难的问题,从而在零样本条件下实现了优异的匹配性能。
Details
Motivation: 在内窥镜图像中实现通用化的密集特征匹配对于机器人辅助任务至关重要,但由于视觉条件困难(如弱纹理、大视角变化)和标注数据稀缺,这仍然是一个挑战。 Method: EndoMatcher采用了双分支Vision Transformer以提取多尺度特征,并结合双交互块增强鲁棒的对应学习。此外,研究者构建了包含约120万对真实和合成图像的多域数据集Endo-Mix6,并采用渐进式多目标训练策略提升跨域表示质量。 Result: EndoMatcher在Hamlyn和Bladder数据集上的inlier matches分别提高了140.69%和201.43%,并在Gastro-Matching数据集上的MDPA提高了9.40%。 Conclusion: EndoMatcher是一个具有高度泛化能力的内窥镜图像匹配框架,通过大规模、多域数据预训练解决了数据稀缺和视觉条件困难的问题,并通过零样本学习实现了在未见器官和成像条件下的优异性能。 Abstract: Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with correspondence labels generated using Structure-from-Motion and simulated transformations. The diversity and scale of Endo-Mix6 introduce new challenges in training stability due to significant variations in dataset sizes, distribution shifts, and error imbalance. To address them, a progressive multi-objective training strategy is employed to promote balanced learning and improve representation quality across domains. This enables EndoMatcher to generalize across unseen organs and imaging conditions in a zero-shot fashion. Extensive zero-shot matching experiments demonstrate that EndoMatcher increases the number of inlier matches by 140.69% and 201.43% on the Hamlyn and Bladder datasets over state-of-the-art methods, respectively, and improves the Matching Direction Prediction Accuracy (MDPA) by 9.40% on the Gastro-Matching dataset, achieving dense and accurate matching under challenging endoscopic conditions. The code is publicly available at https://github.com/Beryl2000/EndoMatcher.[93] VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang,Runsen Xu,Chenhang Cui,Tai Wang,Dahua Lin,Jiangmiao Pang
Main category: cs.CV
TL;DR: 本文提出了一种名为VFlowOpt的新框架,通过引入重要性图推导和渐进式裁剪模块,有效减少了大型多模态模型中的视觉标记数量,从而显著降低了计算成本并提高了推理速度。
Details
Motivation: 大型多模态模型(LMMs)通过利用众多视觉标记进行细粒度的视觉信息处理,在视觉-语言任务中表现出色,但这种标记冗余导致了显著的计算成本。 Method: 本文提出了一种名为VFlowOpt的标记裁剪框架,引入了重要性图推导过程和带有回收机制的渐进式裁剪模块。通过关注派生的上下文相关性和块级信息熵来计算图像标记的重要性图。然后决定保留或修剪哪些标记,并将修剪后的标记聚合为回收标记,以避免潜在的信息丢失。最后,应用一种视觉信息流引导的方法,将LMM中的最后一个标记视为最具代表性的文本-视觉交互信号。 Result: 实验表明,VFlowOpt可以裁剪90%的视觉标记,同时保持可比的性能,从而实现89%的KV-Cache内存减少和3.8倍的推理加速。 Conclusion: VFlowOpt可以裁剪90%的视觉标记,同时保持可比的性能,从而实现89%的KV-Cache内存减少和3.8倍的推理加速。 Abstract: Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning strategies tailored to different LMMs. Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8 times faster inference.[94] Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation
Jianming Liu,Wenlong Qiu,Haitao Wei
Main category: cs.CV
TL;DR: This paper proposes a source-free Cross-Domain Few-Shot Segmentation method using textual and visual information for better adaptation without source domain data.
Details
Motivation: To address performance degradation in Few-Shot Segmentation due to domain discrepancies and reduce data transfer and training costs. Method: Applying Task-Specific Attention Adapters and using Visual-Visual and Text-Visual Embedding Alignment modules to refine prediction masks. Result: The method achieves improvements of 2.18% and 4.11% in 1-shot and 5-shot settings across four cross-domain datasets. Conclusion: The proposed approach outperforms state-of-the-art CD-FSS methods by leveraging multi-modal features without requiring source domain data. Abstract: Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18\% and 4.11\%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.[95] ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking
Xiao Wang,Liye Jin,Xufeng Lou,Shiao Wang,Lan Chen,Bo Jiang,Zhipeng Zhang
Main category: cs.CV
TL;DR: This paper proposes ReasoningTrack, a novel vision-language tracking framework based on a pre-trained vision-language model, along with a large-scale benchmark dataset TNLLT, to improve tracking performance by leveraging reasoning and large models.
Details
Motivation: The motivation is to overcome the limitations of existing vision-language tracking methods that either inadequately fuse language and vision features or fail to fully utilize large models and provide insights into the model's reasoning process. Method: The paper proposes a reasoning-based vision-language tracking framework using Qwen2.5-VL, with optimization through SFT and GRPO. It also introduces a large-scale benchmark dataset, TNLLT, with 200 video sequences and 20 baseline trackers. Result: The result is the development of the ReasoningTrack framework and TNLLT dataset, with experiments showing the effectiveness of the reasoning-based natural language generation strategy. Conclusion: The paper concludes that their proposed framework, ReasoningTrack, and the TNLLT dataset significantly advance the field of vision-language tracking, with extensive experiments validating the effectiveness of their approach. Abstract: Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack[96] Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2
Semanur Küçük,Cosimo Della Santina,Angeliki Laskari
Main category: cs.CV
TL;DR: This paper proposes a fine-tuned Segment Anything Model (SAM v2.1) to accurately segment irregular and non-convex bubble structures in multiphase flows, particularly in air lubrication systems, using as few as 100 annotated images.
Details
Motivation: Traditional approaches and recent learning-based methods assume near-spherical bubble shapes, limiting their effectiveness in regimes where bubbles undergo deformation, coalescence, or breakup. This work aims to overcome this limitation. Method: The task is cast as a transfer learning problem using a fine-tuned Segment Anything Model (SAM v2.1), with as few as 100 annotated images used for training. Result: The model demonstrates the ability to accurately segment amorphous and topologically diverse bubble patches formed due to coalescence, which are challenging for existing methods. Conclusion: This work concludes that fine-tuned Segment Anything Model (SAM v2.1) can accurately segment highly non-convex, irregular bubble structures in multiphase flows, particularly in air lubrication systems. Abstract: Segmenting gas bubbles in multiphase flows is a critical yet unsolved challenge in numerous industrial settings, from metallurgical processing to maritime drag reduction. Traditional approaches-and most recent learning-based methods-assume near-spherical shapes, limiting their effectiveness in regimes where bubbles undergo deformation, coalescence, or breakup. This complexity is particularly evident in air lubrication systems, where coalesced bubbles form amorphous and topologically diverse patches. In this work, we revisit the problem through the lens of modern vision foundation models. We cast the task as a transfer learning problem and demonstrate, for the first time, that a fine-tuned Segment Anything Model SAM v2.1 can accurately segment highly non-convex, irregular bubble structures using as few as 100 annotated images.[97] ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models
Yatong Lan,Jingfeng Chen,Yiru Wang,Lei He
Main category: cs.CV
TL;DR: 本文提出Arbiviewgen,一种基于扩散模型的任意视角图像生成框架,通过FAVS与CVC-SSL模块解决外推视角缺乏真实数据的问题,实现了无需额外传感器的可控多视角图像生成。
Details
Motivation: 自动驾驶中任意视角图像生成具有重要意义,但由于缺乏外推视角的真实数据,高保真生成模型的训练受到限制。 Method: 提出了FAVS(特征感知自适应视图拼接)和CVC-SSL(跨视图一致性自监督学习)两个关键模块。FAVS采用分层匹配策略,结合相机姿态进行粗略几何对应,再通过改进的特征匹配算法进行精细对齐,并利用聚类分析识别高置信度匹配区域;CVC-SSL则通过扩散模型重建原始相机视图,实现无监督的跨视图一致性学习。 Result: Arbiviewgen仅需多视角相机图像及其对应的姿态进行训练,不需要额外传感器或深度图,是首个能够在多种车辆配置下实现可控任意视角图像生成的方法。 Conclusion: Arbiviewgen是一个无需真实外推视图数据的、基于扩散模型的任意视角图像生成框架,能够实现自动驾驶中多种车辆配置下的可控任意视角相机图像生成。 Abstract: Arbitrary viewpoint image generation holds significant potential for autonomous driving, yet remains a challenging task due to the lack of ground-truth data for extrapolated views, which hampers the training of high-fidelity generative models. In this work, we propose Arbiviewgen, a novel diffusion-based framework for the generation of controllable camera images from arbitrary points of view. To address the absence of ground-truth data in unseen views, we introduce two key components: Feature-Aware Adaptive View Stitching (FAVS) and Cross-View Consistency Self-Supervised Learning (CVC-SSL). FAVS employs a hierarchical matching strategy that first establishes coarse geometric correspondences using camera poses, then performs fine-grained alignment through improved feature matching algorithms, and identifies high-confidence matching regions via clustering analysis. Building upon this, CVC-SSL adopts a self-supervised training paradigm where the model reconstructs the original camera views from the synthesized stitched images using a diffusion model, enforcing cross-view consistency without requiring supervision from extrapolated data. Our framework requires only multi-camera images and their associated poses for training, eliminating the need for additional sensors or depth maps. To our knowledge, Arbiviewgen is the first method capable of controllable arbitrary view camera image generation in multiple vehicle configurations.[98] Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models
Zane Xu,Jason Sun
Main category: cs.CV
TL;DR: This paper reviews methods to enhance adversarial robustness in vision-language models while preserving their zero-shot generalization ability, highlighting the evolution of defense strategies and suggesting future research directions.
Details
Motivation: The motivation is to understand and address the trade-off between adversarial robustness and zero-shot generalization in vision-language models like CLIP. Method: The paper synthesizes findings from eight seminal works on zero-shot adversarial robustness in vision-language models, analyzing defense paradigms such as Adversarial Fine-Tuning and Training-Free/Test-Time Defenses. Result: The analysis reveals the evolution of defense methods from alignment-preserving techniques to embedding space re-engineering and latent-space purification approaches. Conclusion: The paper concludes that improving adversarial robustness while maintaining zero-shot generalization in VLMs is challenging and that future research should explore hybrid defense strategies and adversarial pre-training. Abstract: This report synthesizes eight seminal papers on the zero-shot adversarial robustness of vision-language models (VLMs) like CLIP. A central challenge in this domain is the inherent trade-off between enhancing adversarial robustness and preserving the model's zero-shot generalization capabilities. We analyze two primary defense paradigms: Adversarial Fine-Tuning (AFT), which modifies model parameters, and Training-Free/Test-Time Defenses, which preserve them. We trace the evolution from alignment-preserving methods (TeCoA) to embedding space re-engineering (LAAT, TIMA), and from input heuristics (AOM, TTC) to latent-space purification (CLIPure). Finally, we identify key challenges and future directions including hybrid defense strategies and adversarial pre-training.[99] RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding
Tianchen Fang,Guiru Liu
Main category: cs.CV
TL;DR: 为解决医学图像理解中数据标注不足和全局特征忽略局部病理信号的问题,论文提出了RegionMed-CLIP框架和MedRegion-500k数据集,显著提升了多模态医学图像理解性能。
Details
Motivation: 医学图像理解面临两个主要挑战:高质量标注数据有限以及过度依赖全局图像特征而忽略细微但临床重要的病理区域。 Method: 提出了一种区域感知的多模态对比学习框架RegionMed-CLIP,包括创新的感兴趣区域(ROI)处理器和渐进式训练策略,并构建了MedRegion-500k大规模医学图像文本数据集。 Result: 在图像-文本检索、零样本分类和视觉问答任务上的大量实验表明,RegionMed-CLIP显著优于现有最先进的视觉语言模型。 Conclusion: RegionMed-CLIP为多模态医学图像理解提供了一个强大的基础,通过引入区域感知的对比学习框架和MedRegion-500k数据集,解决了医学图像理解中的关键问题。 Abstract: Medical image understanding plays a crucial role in enabling automated diagnosis and data-driven clinical decision support. However, its progress is impeded by two primary challenges: the limited availability of high-quality annotated medical data and an overreliance on global image features, which often miss subtle but clinically significant pathological regions. To address these issues, we introduce RegionMed-CLIP, a region-aware multimodal contrastive learning framework that explicitly incorporates localized pathological signals along with holistic semantic representations. The core of our method is an innovative region-of-interest (ROI) processor that adaptively integrates fine-grained regional features with the global context, supported by a progressive training strategy that enhances hierarchical multimodal alignment. To enable large-scale region-level representation learning, we construct MedRegion-500k, a comprehensive medical image-text corpus that features extensive regional annotations and multilevel clinical descriptions. Extensive experiments on image-text retrieval, zero-shot classification, and visual question answering tasks demonstrate that RegionMed-CLIP consistently exceeds state-of-the-art vision language models by a wide margin. Our results highlight the critical importance of region-aware contrastive pre-training and position RegionMed-CLIP as a robust foundation for advancing multimodal medical image understanding.[100] A Study of Gender Classification Techniques Based on Iris Images: A Deep Survey and Analysis
Basna Mohammed Salih Hasan,Ramadhan J. Mstafa
Main category: cs.CV
TL;DR: 本文综述了性别分类方法,特别是基于虹膜的研究,分析了其应用现状及挑战,并提出了未来研究方向。
Details
Motivation: 性别分类在监控、人类交互和身份识别等领域有广泛应用,而虹膜作为一种稳定的生物特征,具有非侵入性和高识别度,因此研究其在性别分类中的应用具有重要意义。 Method: 对现有的性别分类方法进行了综述,特别是关注基于虹膜图像的分类方法,并分析了其优缺点。 Result: 本文提供了对现有性别分类方法的全面分析,并指出了当前研究中的不足之处,为后续研究提供了参考。 Conclusion: 本文总结了现有的性别分类方法,特别是基于虹膜特征的方法,并指出了该领域中的差距和挑战,为未来的研究提供了建议和方向。 Abstract: Gender classification is attractive in a range of applications, including surveillance and monitoring, corporate profiling, and human-computer interaction. Individuals' identities may be gleaned from information about their gender, which is a kind of soft biometric.Over the years, several methods for determining a person's gender have been devised. Some of the most well-known ones are based on physical characteristics like face, fingerprint, palmprint, DNA, ears, gait, and iris. On the other hand, facial features account for the vast majority of gender classification methods. Also, the iris is a significant biometric trait because the iris, according to research, remains basically constant during an individual's life. Besides that, the iris is externally visible and is non-invasive to the user, which is important for practical applications. Furthermore, there are already high-quality methods for segmenting and encoding iris images, and the current methods facilitate selecting and extracting attribute vectors from iris textures. This study discusses several approaches to determining gender. The previous works of literature are briefly reviewed. Additionally, there are a variety of methodologies for different steps of gender classification. This study provides researchers with knowledge and analysis of the existing gender classification approaches. Also, it will assist researchers who are interested in this specific area, as well as highlight the gaps and challenges in the field, and finally provide suggestions and future paths for improvement.[101] CF3: Compact and Fast 3D Feature Fields
Hyunjoon Lee,Joonkyu Min,Jaesik Park
Main category: cs.CV
TL;DR: 提出了一种名为CF3的3D高斯特征场的自上而下构建方法,通过快速加权融合多视图2D特征与预训练高斯,并引入自适应稀疏化方法,实现了更紧凑、更快速的3D特征场表示。
Details
Motivation: 3D Gaussian Splatting (3DGS)在整合2D基础模型的丰富信息时,通常依赖于自下而上的优化过程,导致计算成本增加。 Method: 首先对多视图2D特征与预训练高斯进行快速加权融合,然后直接在提升的特征上训练每个高斯的自动编码器,并引入自适应稀疏化方法优化高斯属性,同时剪枝和合并冗余高斯。 Result: 与Feature-3DGS相比,该方法仅使用其5%的高斯即可构建具有竞争力的3D特征场。 Conclusion: CF3提供了一种高效且紧凑的3D高斯特征场构建方法,能够有效减少计算成本并保持几何细节。 Abstract: 3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS.[102] Robust Tracking with Particle Filtering for Fluorescent Cardiac Imaging
Suresh Guttikonda,Maximilian Neidhart,Johanna Sprenger,Johannes Petersen,Christian Detter,Alexander Schlaefer
Main category: cs.CV
TL;DR: 论文提出了一种新的粒子滤波跟踪器,可在心脏手术中实现高精度、实时的图像跟踪。
Details
Motivation: 传统跟踪方法在心脏手术成像中因心脏运动和血管结构变化而受限,需要更鲁棒和精确的跟踪方法。 Method: 提出了一种基于循环一致性检验的粒子滤波跟踪器,通过采样粒子跟踪目标标记,并在心脏成像中进行实时估计。 Result: 该方法在25.4 fps下同时跟踪117个目标,跟踪误差为(5.00 +/- 0.22 px),显著优于深度学习和传统跟踪器。 Conclusion: 该论文提出了一种基于循环一致性检验的粒子滤波跟踪器,用于心脏荧光成像中的实时跟踪,相较于其他方法具有更高的精度和实时性。 Abstract: Intraoperative fluorescent cardiac imaging enables quality control following coronary bypass grafting surgery. We can estimate local quantitative indicators, such as cardiac perfusion, by tracking local feature points. However, heart motion and significant fluctuations in image characteristics caused by vessel structural enrichment limit traditional tracking methods. We propose a particle filtering tracker based on cyclicconsistency checks to robustly track particles sampled to follow target landmarks. Our method tracks 117 targets simultaneously at 25.4 fps, allowing real-time estimates during interventions. It achieves a tracking error of (5.00 +/- 0.22 px) and outperforms other deep learning trackers (22.3 +/- 1.1 px) and conventional trackers (58.1 +/- 27.1 px).[103] SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion
Xiaoyang Zhang,Zhen Hua,Yakun Ju,Wei Zhou,Jun Liu,Alex C. Kot
Main category: cs.CV
TL;DR: SGDFuse is a new method for infrared and visible image fusion that leverages a conditional diffusion model guided by the Segment Anything Model (SAM) to improve semantic understanding and preserve key targets, achieving state-of-the-art results.
Details
Motivation: Existing IVIF methods often fail to preserve key targets due to a lack of deep semantic understanding, and can introduce artifacts and detail loss. SGDFuse aims to overcome these issues by incorporating semantic understanding into the fusion process. Method: SGDFuse utilizes high-quality semantic masks from SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model in a two-stage process. Result: Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, along with adaptability to downstream visual tasks. Conclusion: SGDFuse provides a powerful solution to the core challenges in image fusion by achieving high-fidelity and semantically-aware fusion with the help of a conditional diffusion model guided by SAM. Abstract: Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model's coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.[104] B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding
Changho Choi,Youngwoo Shin,Gyojin Han,Dong-Jae Lee,Junmo Kim
Main category: cs.CV
TL;DR: This paper introduces B4DL, a benchmark and model for processing 4D LiDAR data in Multimodal Large Language Models to improve spatio-temporal reasoning in dynamic outdoor environments.
Details
Motivation: The motivation stems from the lack of high-quality annotations and suitable MLLM architectures for handling 4D LiDAR data, which limits its application in Multimodal Large Language Models. Method: The paper introduces a scalable data generation pipeline and an MLLM model that directly processes raw 4D LiDAR data, integrating it with language understanding. Result: The result is the creation of the B4DL benchmark, a new dataset, and an MLLM model that effectively processes 4D LiDAR for spatio-temporal reasoning in outdoor environments. Conclusion: The paper concludes that the proposed B4DL benchmark and the MLLM model offer a unified solution for spatio-temporal reasoning in dynamic outdoor environments by processing 4D LiDAR data. Abstract: Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://mmb4dl.github.io/mmb4dl/[105] Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection
Xiaoyang Zhang,Guodong Fan,Guang-Yong Chen,Zhen Hua,Jinjiang Li,Min Gan,C. L. Philip Chen
Main category: cs.CV
TL;DR: This paper proposes the Wavelet-Guided Dual-Frequency Encoding (WGDF) method for change detection in remote sensing imagery, combining wavelet domain feature modeling with deep learning to improve detection accuracy and robustness.
Details
Motivation: Traditional spatial-domain methods limit feature representation diversity, making it difficult to detect subtle changes in remote sensing imagery. Method: Wavelet-Guided Dual-Frequency Encoding (WGDF) combines Discrete Wavelet Transform (DWT) with a Dual-Frequency Feature Enhancement (DFFE) module and a Frequency-Domain Interactive Difference (FDID) module for high-frequency features, and Transformers with a Progressive Contextual Difference Module (PCDM) for low-frequency features. Result: Extensive experiments show that WGDF outperforms state-of-the-art methods in detection accuracy and robustness, particularly in alleviating edge ambiguity. Conclusion: The proposed WGDF method enhances change detection in remote sensing imagery by synergistically modeling high- and low-frequency components in the wavelet domain, leading to improved accuracy and robustness. Abstract: Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management. Despite the remarkable progress of deep learning in recent years, most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions. We observe that frequency-domain feature modeling particularly in the wavelet domain an amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain. Thus, we propose a method called Wavelet-Guided Dual-Frequency Encoding (WGDF). Specifically, we first apply Discrete Wavelet Transform (DWT) to decompose the input images into high-frequency and low-frequency components, which are used to model local details and global structures, respectively. In the high-frequency branch, we design a Dual-Frequency Feature Enhancement (DFFE) module to strengthen edge detail representation and introduce a Frequency-Domain Interactive Difference (FDID) module to enhance the modeling of fine-grained changes. In the low-frequency branch, we exploit Transformers to capture global semantic relationships and employ a Progressive Contextual Difference Module (PCDM) to progressively refine change regions, enabling precise structural semantic characterization. Finally, the high- and low-frequency features are synergistically fused to unify local sensitivity with global discriminability. Extensive experiments on multiple remote sensing datasets demonstrate that WGDF significantly alleviates edge ambiguity and achieves superior detection accuracy and robustness compared to state-of-the-art methods. The code will be available at https://github.com/boshizhang123/WGDF.[106] VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test
Meiqi Wu,Yaxuan Kang,Xuchen Li,Shiyu Hu,Xiaotang Chen,Yunfeng Kang,Weiqiang Wang,Kaiqi Huang
Main category: cs.CV
TL;DR: This study introduces a Visual-Semantic depression assessment method based on LLM (VS-LLM) for the automated analysis of PPAT sketches, significantly improving depression assessment efficiency and accuracy.
Details
Motivation: The interpretation of PPAT sketches for depression assessment is laborious and experience-dependent; this work aims to automate and enhance this process. Method: The Visual-Semantic depression assessment based on LLM (VS-LLM) method was developed to automate the analysis of PPAT sketches, addressing challenges like low drawing accuracy and lack of detailed depiction. Result: Experimental results showed a 17.6% improvement in performance compared to traditional psychologist assessment methods. Conclusion: The proposed VS-LLM method enhances the efficiency and accuracy of depression assessment using PPAT sketches, marking a significant contribution to mental state evaluation research. Abstract: The Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants' mental states through their sketches. Specifically, through sketches with the theme of "a person picking an apple from a tree (PPAT)", it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists' understanding of an individual's mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment; (2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method; (3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches' elements recognition. Our datasets and codes are available at https://github.com/wmeiqi/VS-LLM.[107] CoCAViT: Compact Vision Transformer with Robust Global Coordination
Xuyang Wang,Lingjuan Miao,Zhiqiang Zhou
Main category: cs.CV
TL;DR: This paper introduces CoCAViT, a visual backbone that enhances the generalization performance of smaller models on out-of-distribution data using a novel Coordinator-patch Cross Attention mechanism.
Details
Motivation: The motivation stems from the observation that smaller models suffer a significant performance drop on out-of-distribution data compared to larger models, indicating a lack of generalization capability in existing efficient architectures. Method: The paper identifies architectural bottlenecks in smaller models and proposes a Coordinator-patch Cross Attention (CoCA) mechanism to enhance local-global feature modeling. The CoCAViT model is developed based on these improvements. Result: CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant improvements on multiple OOD benchmarks, 52.2 mAP on COCO object detection, and 51.3 mIOU on ADE20K semantic segmentation, all while maintaining low latency. Conclusion: CoCAViT is a novel visual backbone that addresses the architectural bottlenecks of smaller models, improving their generalization performance on out-of-distribution data with minimal computational overhead. Abstract: In recent years, large-scale visual backbones have demonstrated remarkable capabilities in learning general-purpose features from images via extensive pre-training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in-domain benchmarks. However, we observe that for smaller models, the performance drop on out-of-distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator-patch Cross Attention (CoCA) mechanism, featuring dynamic, domain-aware global tokens that enhance local-global feature modeling and adaptively capture robust patterns across domains with minimal computational overhead. Integrating these advancements, we present CoCAViT, a novel visual backbone designed for robust real-time visual representation. Extensive experiments empirically validate our design. At a resolution of 224*224, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant gains on multiple OOD benchmarks, compared to competing models. It also attains 52.2 mAP on COCO object detection and 51.3 mIOU on ADE20K semantic segmentation, while maintaining low latency.[108] mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering
Xu Yuan,Liangbo Ning,Wenqi Fan,Qing Li
Main category: cs.CV
TL;DR: The paper proposes mKG-RAG, a novel framework that integrates structured multimodal knowledge graphs into RAG-based Visual Question Answering, enhancing accuracy and reliability by leveraging semantic and modality-aligned knowledge.
Details
Motivation: The motivation is to address the limitations of vanilla RAG-based VQA methods that rely on unstructured documents and often introduce irrelevant or misleading content, thus reducing answer accuracy and reliability. Method: The method involves constructing high-quality multimodal knowledge graphs using MLLM-powered keyword extraction and vision-text matching, followed by a dual-stage retrieval strategy with a question-aware multimodal retriever to enhance retrieval efficiency and precision. Result: The proposed mKG-RAG framework significantly outperforms existing methods, setting a new state-of-the-art for knowledge-based Visual Question Answering tasks. Conclusion: The paper concludes that integrating multimodal knowledge graphs into RAG-based VQA frameworks enhances generation quality by providing structured knowledge, leading to improved accuracy and reliability in knowledge-intensive VQA tasks. Abstract: Recently, Retrieval-Augmented Generation (RAG) has been proposed to expand internal knowledge of Multimodal Large Language Models (MLLMs) by incorporating external knowledge databases into the generation process, which is widely used for knowledge-based Visual Question Answering (VQA) tasks. Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relationships among knowledge elements frequently introduce irrelevant or misleading content, reducing answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks to enhance the generation by introducing structured multimodal knowledge. Therefore, in this paper, we propose a novel multimodal knowledge-augmented generation framework (mKG-RAG) based on multimodal KGs for knowledge-intensive VQA tasks. Specifically, our approach leverages MLLM-powered keyword extraction and vision-text matching to distill semantically consistent and modality-aligned entities/relationships from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. In addition, a dual-stage retrieval strategy equipped with a question-aware multimodal retriever is introduced to improve retrieval efficiency while refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art for knowledge-based VQA.[109] Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting
Frank Ruis,Gertjan Burghouts,Hugo Kuijf
Main category: cs.CV
TL;DR: The paper proposes a method inspired by Textual Inversion for open-vocabulary object detection, extending the model's vocabulary without losing its original capabilities.
Details
Motivation: Although large pre-trained vision language models have strong zero-shot capabilities, fine-tuning is still necessary for optimal performance on specific targets. However, this often leads to the loss of original capabilities. Method: Inspired by Textual Inversion (TI), the method learns new or improves existing tokens to accurately detect novel or fine-grained objects, with the tokens being compatible with the original VLM weights. Result: The method was evaluated to determine if it matches or outperforms baseline methods across a wide variety of experiments, showing its effectiveness in avoiding forgetting. Conclusion: The proposed method successfully extends the VLM vocabulary while maintaining the original model's performance and capabilities, offering a more efficient alternative to full-model fine-tuning. Abstract: Recent progress in large pre-trained vision language models (VLMs) has reached state-of-the-art performance on several object detection benchmarks and boasts strong zero-shot capabilities, but for optimal performance on specific targets some form of finetuning is still necessary. While the initial VLM weights allow for great few-shot transfer learning, this usually involves the loss of the original natural language querying and zero-shot capabilities. Inspired by the success of Textual Inversion (TI) in personalizing text-to-image diffusion models, we propose a similar formulation for open-vocabulary object detection. TI allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples. The learned tokens are completely compatible with the original VLM weights while keeping them frozen, retaining the original model's benchmark performance, and leveraging its existing capabilities such as zero-shot domain transfer (e.g., detecting a sketch of an object after training only on real photos). The storage and gradient calculations are limited to the token embedding dimension, requiring significantly less compute than full-model fine-tuning. We evaluated whether the method matches or outperforms the baseline methods that suffer from forgetting in a wide variety of quantitative and qualitative experiments.[110] 3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering
Junyu Zhou,Yuyang Huang,Wenrui Dai,Junni Zou,Ziyang Zheng,Nuowen Kan,Chenglin Li,Hongkai Xiong
Main category: cs.CV
TL;DR: 本文提出 3DGabSplat,通过基于 3D Gabor 的新基元和频率响应机制,提高了 3D 场景表示的效率和质量。
Details
Motivation: 3D Gaussian Splatting (3DGS) 因使用本质为低通滤波的高斯函数,难以表示 3D 场景中的高频细节,同时存在基元冗余、训练和渲染效率低下以及内存开销过大的问题。 Method: 提出了一种基于 3D Gabor 的新基元,结合多方向 3D 频率响应和滤波器组,通过多视角图像监督进行辐射场表示。开发了高效的 CUDA 光栅化器和频率自适应机制,并可扩展为即插即用的核。 Result: 3DGabSplat 在真实和合成场景中均优于 3DGS 及其变体,PSNR 增益高达 1.35 dB,同时减少了基元数量和内存消耗。 Conclusion: 3DGabSplat 能够克服 3DGS 在表示高频细节、训练和渲染效率以及内存开销方面的限制,并在新视角合成中实现了最先进的渲染质量。 Abstract: Recent prominence in 3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis. However, 3DGS resorts to the Gaussian function that is low-pass by nature and is restricted in representing high-frequency details in 3D scenes. Moreover, it causes redundant primitives with degraded training and rendering efficiency and excessive memory overhead. To overcome these limitations, we propose 3D Gabor Splatting (3DGabSplat) that leverages a novel 3D Gabor-based primitive with multiple directional 3D frequency responses for radiance field representation supervised by multi-view images. The proposed 3D Gabor-based primitive forms a filter bank incorporating multiple 3D Gabor kernels at different frequencies to enhance flexibility and efficiency in capturing fine 3D details. Furthermore, to achieve novel view rendering, an efficient CUDA-based rasterizer is developed to project the multiple directional 3D frequency components characterized by 3D Gabor-based primitives onto the 2D image plane, and a frequency-adaptive mechanism is presented for adaptive joint optimization of primitives. 3DGabSplat is scalable to be a plug-and-play kernel for seamless integration into existing 3DGS paradigms to enhance both efficiency and quality of novel view synthesis. Extensive experiments demonstrate that 3DGabSplat outperforms 3DGS and its variants using alternative primitives, and achieves state-of-the-art rendering quality across both real-world and synthetic scenes. Remarkably, we achieve up to 1.35 dB PSNR gain over 3DGS with simultaneously reduced number of primitives and memory consumption.[111] PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation
Kang Liu,Zhuoqi Ma,Zikang Fang,Yunan Li,Kun Xie,Qiguang Miao
Main category: cs.CV
TL;DR: PriorRG是一种胸X光报告生成框架,通过使用患者特定的先验知识来减少放射科医生的工作量。
Details
Motivation: 使用患者特定的先验知识(如症状、病史和最近的影像)对于诊断推理至关重要,但现有的大多数方法忽视了这一重要信息。 Method: PriorRG采用两阶段训练流程:第一阶段引入了先验引导的对比预训练方案,第二阶段提出了先验感知的由粗到精的报告生成解码。 Result: 在MIMIC-CXR和MIMIC-ABN数据集上的实验表明,PriorRG优于最先进的方法,分别在MIMIC-CXR上提升了3.6%的BLEU-4和3.8%的F1分数,在MIMIC-ABN上提升了5.9%的BLEU-1分数。 Conclusion: PriorRG有效地利用了患者特定的先验知识,提高了生成报告的临床准确性和流畅性。 Abstract: Chest X-ray report generation aims to reduce radiologists' workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge -- including clinical context (e.g., symptoms, medical history) and the most recent prior image -- which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder's hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance.[112] Cross-View Localization via Redundant Sliced Observations and A-Contrario Validation
Yongjun Zhang,Mingtao Xiong,Yi Wan,Gui-Song Xia
Main category: cs.CV
TL;DR: Slice-Loc是一种新的CVL方法,通过分割查询图像并使用几何刚性公式过滤错误姿态,从而提高定位准确性和可靠性。
Details
Motivation: 为了解决CVL方法缺乏冗余观测的问题,从而通过观测数据的相互验证来评估定位可靠性。 Method: Slice-Loc分为两个阶段,首先将查询图像分割成子图像并估计每个切片的3-DoF姿态,然后使用几何刚性公式过滤错误姿态,并合并内点生成最终相机姿态。 Result: Slice-Loc在DReSS数据集的跨城市测试中将平均定位误差从4.47米减少到1.86米,将平均方向误差从3.42度减少到1.24度。 Conclusion: Slice-Loc通过引入冗余观测和几何刚性公式,提高了CVL的定位准确性和可靠性。 Abstract: Cross-view localization (CVL) matches ground-level images with aerial references to determine the geo-position of a camera, enabling smart vehicles to self-localize offline in GNSS-denied environments. However, most CVL methods output only a single observation, the camera pose, and lack the redundant observations required by surveying principles, making it challenging to assess localization reliability through the mutual validation of observational data. To tackle this, we introduce Slice-Loc, a two-stage method featuring an a-contrario reliability validation for CVL. Instead of using the query image as a single input, Slice-Loc divides it into sub-images and estimates the 3-DoF pose for each slice, creating redundant and independent observations. Then, a geometric rigidity formula is proposed to filter out the erroneous 3-DoF poses, and the inliers are merged to generate the final camera pose. Furthermore, we propose a model that quantifies the meaningfulness of localization by estimating the number of false alarms (NFA), according to the distribution of the locations of the sliced images. By eliminating gross errors, Slice-Loc boosts localization accuracy and effectively detects failures. After filtering out mislocalizations, Slice-Loc reduces the proportion of errors exceeding 10 m to under 3\%. In cross-city tests on the DReSS dataset, Slice-Loc cuts the mean localization error from 4.47 m to 1.86 m and the mean orientation error from $\mathbf{3.42^{\circ}}$ to $\mathbf{1.24^{\circ}}$, outperforming state-of-the-art methods. Code and dataset will be available at: https://github.com/bnothing/Slice-Loc.[113] CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation
Hamza Kalisch,Fabian Hörst,Jens Kleesiek,Ken Herrmann,Constantin Seibold
Main category: cs.CV
TL;DR: CT-GRAPH introduces a hierarchical graph attention network to improve the automation of radiology report generation by capturing fine-grained organ relationships, achieving significant performance improvements on a large-scale chest CT dataset.
Details
Motivation: Current methods for automating radiology report generation rely on global image features, which fail to capture fine-grained organ relationships essential for accurate reporting. This motivates the need for a more structured approach that explicitly models radiological knowledge. Method: The study proposes CT-GRAPH, which uses a hierarchical graph attention network and pretrained 3D medical feature encoders to capture both global and organ-level features. Anatomical masks are utilized to refine these features within a graph structure, which are then integrated into a large language model to generate detailed medical reports. Result: The proposed method achieves a substantial improvement of 7.9% in F1 score over current state-of-the-art methods on the CT-RATE dataset for CT report generation. Conclusion: CT-GRAPH provides a novel method for generating radiology reports by modeling radiological knowledge through a hierarchical graph attention network, leading to more accurate reporting by linking fine-grained organ features with coarser anatomical systems and global patient context. Abstract: As medical imaging is central to diagnostic processes, automating the generation of radiology reports has become increasingly relevant to assist radiologists with their heavy workloads. Most current methods rely solely on global image features, failing to capture fine-grained organ relationships crucial for accurate reporting. To this end, we propose CT-GRAPH, a hierarchical graph attention network that explicitly models radiological knowledge by structuring anatomical regions into a graph, linking fine-grained organ features to coarser anatomical systems and a global patient context. Our method leverages pretrained 3D medical feature encoders to obtain global and organ-level features by utilizing anatomical masks. These features are further refined within the graph and then integrated into a large language model to generate detailed medical reports. We evaluate our approach for the task of report generation on the large-scale chest CT dataset CT-RATE. We provide an in-depth analysis of pretrained feature encoders for CT report generation and show that our method achieves a substantial improvement of absolute 7.9\% in F1 score over current state-of-the-art methods. The code is publicly available at https://github.com/hakal104/CT-GRAPH.[114] Deformable Attention Graph Representation Learning for Histopathology Whole Slide Image Analysis
Mingxi Fu,Xitong Ling,Yuxuan Chen,Jiawen Li,fanglei fu,Huaitian Yuan,Tian Guan,Yonghong He,Lianghui Zhu
Main category: cs.CV
TL;DR: 提出了一种基于可變形注意力的圖神經網絡框架,用於病理圖像分析,實現了最先進的性能。
Details
Motivation: 現有的多重實例學習方法和圖神經網絡在捕捉組織結構之間的空間依賴性方面存在不足,而且傳統的注意力機制缺乏特異性。 Method: 構建基於動態加權有向圖的新型圖神經網絡框架,並引入可學習空間偏移量來自適應地關注形態相關區域。 Result: 在四個基準數據集上達到了最先進的性能,包括TCGA-COAD、BRACS等。 Conclusion: 所提出的框架有效地捕捉了全切片圖像和興趣區域中的複雜空間結構。 Abstract: Accurate classification of Whole Slide Images (WSIs) and Regions of Interest (ROIs) is a fundamental challenge in computational pathology. While mainstream approaches often adopt Multiple Instance Learning (MIL), they struggle to capture the spatial dependencies among tissue structures. Graph Neural Networks (GNNs) have emerged as a solution to model inter-instance relationships, yet most rely on static graph topologies and overlook the physical spatial positions of tissue patches. Moreover, conventional attention mechanisms lack specificity, limiting their ability to focus on structurally relevant regions. In this work, we propose a novel GNN framework with deformable attention for pathology image analysis. We construct a dynamic weighted directed graph based on patch features, where each node aggregates contextual information from its neighbors via attention-weighted edges. Specifically, we incorporate learnable spatial offsets informed by the real coordinates of each patch, enabling the model to adaptively attend to morphologically relevant regions across the slide. This design significantly enhances the contextual field while preserving spatial specificity. Our framework achieves state-of-the-art performance on four benchmark datasets (TCGA-COAD, BRACS, gastric intestinal metaplasia grading, and intestinal ROI classification), demonstrating the power of deformable attention in capturing complex spatial structures in WSIs and ROIs.[115] UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation
Wonjun Kang,Byeongkeun Ahn,Minjae Lee,Kevin Galim,Seunghyuk Oh,Hyung Il Koo,Nam Ik Cho
Main category: cs.CV
TL;DR: 本文提出了一種名為UNCAGE的新型無訓練方法,通過優化注意力圖來提高文本到圖像生成的組合保真度,解決了現有模型在文本與圖像對齊方面的不足。
Details
Motivation: 文本到圖像生成仍然具有挑戰性,特別是在準確綁定屬性和實現文本與圖像對齊方面,現有模型如擴散模型和自回歸模型存在一定的限制。 Method: 提出了一種名為Unmasking with Contrastive Attention Guidance(UNCAGE)的方法,該方法利用注意力圖來優先處理能清楚代表個別物件的標記,從而提高生成質量。 Result: UNCAGE在多個基準和指標的定量與定性評估中均顯示出性能提升,且推理開銷極低。 Conclusion: UNCAGE是一種有效的訓練無關方法,能夠顯著改善文本到圖像生成的組合保真度,為未來的研究提供了新的方向。 Abstract: Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.[116] From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization
Farah Wahida,M. A. P. Chamikara,Yashothara Shanmugarasa,Mohan Baruwal Chhetri,Thilina Ranbaduge,Ibrahim Khalil
Main category: cs.CV
TL;DR: TrueBiometric is a novel method that accurately detects and corrects poisoned images in face recognition systems using vision language models and corrective noise, ensuring security without sacrificing performance.
Details
Motivation: Biometric systems are vulnerable to backdoor attacks that manipulate training data, and current defense mechanisms struggle to identify and mitigate poisoned images without reducing data utility. Method: TrueBiometric uses a majority voting mechanism leveraging multiple state-of-the-art large vision language models to detect poisoned images, and targeted, calibrated corrective noise to fix them. Result: TrueBiometric achieves 100% accuracy in detecting and correcting poisoned images while maintaining accuracy on clean data, outperforming existing state-of-the-art approaches in practicality, accuracy, and effectiveness. Conclusion: TrueBiometric is a reliable and adaptable solution for detecting and correcting poisoned images in biometric systems, offering 100% accuracy in detection and correction without compromising clean image accuracy. Abstract: Biometric systems, such as face recognition systems powered by deep neural networks (DNNs), rely on large and highly sensitive datasets. Backdoor attacks can subvert these systems by manipulating the training process. By inserting a small trigger, such as a sticker, make-up, or patterned mask, into a few training images, an adversary can later present the same trigger during authentication to be falsely recognized as another individual, thereby gaining unauthorized access. Existing defense mechanisms against backdoor attacks still face challenges in precisely identifying and mitigating poisoned images without compromising data utility, which undermines the overall reliability of the system. We propose a novel and generalizable approach, TrueBiometric: Trustworthy Biometrics, which accurately detects poisoned images using a majority voting mechanism leveraging multiple state-of-the-art large vision language models. Once identified, poisoned samples are corrected using targeted and calibrated corrective noise. Our extensive empirical results demonstrate that TrueBiometric detects and corrects poisoned images with 100\% accuracy without compromising accuracy on clean images. Compared to existing state-of-the-art approaches, TrueBiometric offers a more practical, accurate, and effective solution for mitigating backdoor attacks in face recognition systems.[117] Physical Adversarial Camouflage through Gradient Calibration and Regularization
Jiawei Liang,Siyuan Liang,Jianjie Huang,Chenxi Si,Ming Zhang,Xiaochun Cao
Main category: cs.CV
TL;DR: This paper proposes an improved adversarial camouflage framework that enhances gradient optimization to overcome challenges in physical adversarial attacks, significantly outperforming existing techniques.
Details
Motivation: Physical adversarial camouflage poses a significant security risk by deceiving object detectors, especially in safety-critical fields like autonomous driving. Existing techniques face challenges due to inconsistent sampling point densities and conflicting gradient updates from multiple angles. Method: The authors proposed a novel adversarial camouflage framework based on gradient optimization, introducing a gradient calibration strategy for consistent gradient updates and a gradient decorrelation method to eliminate redundant or conflicting updates during multi-angle optimization. Result: Extensive experimental results show that the proposed method achieves an average increase of 13.46% in attack success rate across distances and 11.03% across angles compared to the state of the art. Conclusion: The proposed adversarial camouflage framework significantly outperforms existing techniques in attack success rate and highlights the need for more robust system design in safety-critical applications. Abstract: The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles and distances show that our method significantly exceeds the state of the art, with an average increase in attack success rate (ASR) of 13.46% across distances and 11.03% across angles. Furthermore, empirical evaluation in real-world scenarios highlights the need for more robust system design.[118] Smoothing Slot Attention Iterations and Recurrences
Rongzhen Zhao,Wenyan Yang,Juho Kannala,Joni Pajarinen
Main category: cs.CV
TL;DR: SmoothSA improves Slot Attention by better handling initial and subsequent video frames, leading to more accurate object-centric learning.
Details
Motivation: The authors aim to address the limitations of Slot Attention (SA) in handling first-frame cold-start queries and the need for different transforms in subsequent video frames. Method: The authors propose SmoothSA, which preheats cold-start queries using a self-distilled module and applies differentiated transforms for first and subsequent video frames, validated through experiments on object discovery and recognition benchmarks. Result: Comprehensive experiments demonstrate that SmoothSA outperforms existing methods in object discovery, recognition, and downstream tasks, with further analyses showing improved smoothing of SA iterations and recurrences. Conclusion: SmoothSA improves the Slot Attention mechanism by preheating cold-start queries and differentiating transforms for first and non-first video frames, enhancing object discovery and recognition. Abstract: Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video's first frame; Also, non-first frames' queries are already sample-specific thus require transforms different from the first frame's aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video's first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method's effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our code is available in the supplement.[119] Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions
Hubert Baniecki,Maximilian Muschalik,Fabian Fumagalli,Barbara Hammer,Eyke Hüllermeier,Przemyslaw Biecek
Main category: cs.CV
TL;DR: 本文提出FIxLIP,一种基于博弈论的二阶交互解释方法,用于视觉-语言预训练模型,提高了归因解释的质量和计算效率。
Details
Motivation: 现有显著图方法仅捕捉一阶归因,难以反映视觉-语言模型中的复杂跨模态交互。 Method: 基于博弈论的加权Banzhaf交互指数,扩展了指针游戏和插入/删除曲线等评估指标。 Result: 在MS COCO和ImageNet-1k上的实验表明,二阶方法如FIxLIP优于一阶归因方法,并能有效比较不同模型。 Conclusion: FIxLIP提供了一种更高效和灵活的二阶交互解释方法,用于分析视觉-语言预训练模型中的特征重要性。 Abstract: Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, like the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on MS COCO and ImageNet-1k benchmarks validate that second-order methods like FIxLIP outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models like CLIP vs. SigLIP-2 and ViT-B/32 vs. ViT-L/16.[120] How and Why: Taming Flow Matching for Unsupervised Anomaly Detection and Localization
Liangwei Li,Lin Liu,Juanxiu Liu,Jing Zhang,Ruqian Hao,Xiaohui Du
Main category: cs.CV
TL;DR: This paper introduces WT-Flow, a new method for unsupervised anomaly detection that overcomes the limitations of traditional flow-based approaches by using Worst Transport displacement interpolation, achieving top performance on the MVTec dataset.
Details
Motivation: The motivation stems from the model expressivity limitations of conventional flow-based methods for unsupervised anomaly detection, prompting the need for a more effective approach such as Flow Matching (FM) and its reformulation as time-reversed Flow Matching (rFM). Method: The authors propose Worst Transport (WT) displacement interpolation to reconstruct a non-probabilistic evolution path, addressing the limitations of conventional flow-based methods in anomaly detection and localization. Result: The proposed WT-Flow provides a novel unsupervised paradigm with dynamical control over sample trajectories, creating 'degenerate potential wells' for anomaly-free samples and allowing anomalous samples to escape, thereby achieving state-of-the-art performance at a single scale on the MVTec dataset. Conclusion: The paper concludes that the proposed WT-Flow method enhances dynamical control over sample trajectories for unsupervised anomaly detection, offering a theoretically grounded separation mechanism and achieving state-of-the-art performance on the MVTec dataset. Abstract: We propose a new paradigm for unsupervised anomaly detection and localization using Flow Matching (FM), which fundamentally addresses the model expressivity limitations of conventional flow-based methods. To this end, we formalize the concept of time-reversed Flow Matching (rFM) as a vector field regression along a predefined probability path to transform unknown data distributions into standard Gaussian. We bring two core observations that reshape our understanding of FM. First, we rigorously prove that FM with linear interpolation probability paths is inherently non-invertible. Second, our analysis reveals that employing reversed Gaussian probability paths in high-dimensional spaces can lead to trivial vector fields. This issue arises due to the manifold-related constraints. Building on the second observation, we propose Worst Transport (WT) displacement interpolation to reconstruct a non-probabilistic evolution path. The proposed WT-Flow enhances dynamical control over sample trajectories, constructing ''degenerate potential wells'' for anomaly-free samples while allowing anomalous samples to escape. This novel unsupervised paradigm offers a theoretically grounded separation mechanism for anomalous samples. Notably, FM provides a computationally tractable framework that scales to complex data. We present the first successful application of FM for the unsupervised anomaly detection task, achieving state-of-the-art performance at a single scale on the MVTec dataset. The reproducible code for training will be released upon camera-ready submission.[121] F2PASeg: Feature Fusion for Pituitary Anatomy Segmentation in Endoscopic Surgery
Lumin Chen,Zhiying Wu,Tianye Lei,Xuexue Bai,Ming Feng,Yuxi Wang,Gaofeng Meng,Zhen Lei,Hongbin Liu
Main category: cs.CV
TL;DR: 本文提出了一種新的垂體解剖分割數據集PAS和F2PASeg方法,用於增強垂體手術的安全性與術中規劃。
Details
Motivation: 垂體腫瘤常導致相鄰重要結構的變形或包囊,解剖結構的分割可提供外科醫生手術風險區域的早期警告,但現有的數據集極為稀缺。 Method: 引入PAS數據集,包含7,845張時間一致的圖像,並設計F2PASeg方法結合高解析度圖像特徵與深度語義嵌入以解決特徵表示不一致的問題。 Result: 實驗結果顯示F2PASeg能實時一致地分割關鍵解剖結構,為垂體手術規劃提供可靠方案。 Conclusion: PAS數據集與F2PASeg方法有效提升垂體手術的安全性與精準度。 Abstract: Pituitary tumors often cause deformation or encapsulation of adjacent vital structures. Anatomical structure segmentation can provide surgeons with early warnings of regions that pose surgical risks, thereby enhancing the safety of pituitary surgery. However, pixel-level annotated video stream datasets for pituitary surgeries are extremely rare. To address this challenge, we introduce a new dataset for Pituitary Anatomy Segmentation (PAS). PAS comprises 7,845 time-coherent images extracted from 120 videos. To mitigate class imbalance, we apply data augmentation techniques that simulate the presence of surgical instruments in the training data. One major challenge in pituitary anatomy segmentation is the inconsistency in feature representation due to occlusions, camera motion, and surgical bleeding. By incorporating a Feature Fusion module, F2PASeg is proposed to refine anatomical structure segmentation by leveraging both high-resolution image features and deep semantic embeddings, enhancing robustness against intraoperative variations. Experimental results demonstrate that F2PASeg consistently segments critical anatomical structures in real time, providing a reliable solution for intraoperative pituitary surgery planning. Code: https://github.com/paulili08/F2PASeg.[122] Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification
Samuel Räber,Till Aczel,Andreas Plesner,Roger Wattenhofer
Main category: cs.CV
TL;DR: 本文研究了压缩模型对对抗攻击的鲁棒性,发现高保真重建模型更安全,揭示了未来攻击技术需克服真实感问题。
Details
Motivation: 之前的研究表明,通过有损压缩预处理图像可以防御对抗扰动,但缺乏全面的攻击评估。 Method: 论文通过构建强大的白盒和自适应攻击来评估不同压缩模型的安全性,并进行多场景攻击评估。 Result: 研究发现,高保真度重建的压缩模型显著抵抗攻击,而低真实感压缩模型容易被攻破,这种差异并非由于梯度掩码。 Conclusion: 该论文得出结论,保持自然图像分布对齐的高保真度重建压缩模型具有内在的鲁棒性,这为未来的对抗性攻击提出了重要的挑战。 Abstract: Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.[123] SMOL-MapSeg: Show Me One Label
Yunshuang Yuan,Frank Thiemann,Thorsten Dahms,Monika Sester
Main category: cs.CV
TL;DR: This paper proposes SMOL-MapSeg, a modified foundation model using OND prompting to improve semantic segmentation of historical maps, showing better performance and adaptability compared to existing methods.
Details
Motivation: Historical maps lack consistency in visual patterns, making it difficult for pre-trained foundation models—optimized for modern or domain-specific images—to perform accurate semantic segmentation. This inconsistency necessitates a new approach that can bridge the gap between abstract map features and meaningful concepts. Method: The authors introduce On-Need Declarative (OND) knowledge-based prompting by replacing the prompt encoder of the foundation model SAM. This allows explicit guidance on pattern-concept correspondence during inference. The model is then fine-tuned on historical maps to create SMOL-MapSeg. Result: SMOL-MapSeg accurately segments classes defined by OND knowledge, adapts to unseen classes through few-shot fine-tuning, and outperforms UNet-based baselines in average segmentation performance. Conclusion: The proposed SMOL-MapSeg model, which incorporates OND knowledge-based prompting, demonstrates superior performance in segmenting historical maps compared to traditional models like UNet, while also being adaptable to unseen classes through few-shot fine-tuning. Abstract: Historical maps are valuable for studying changes to the Earth's surface. With the rise of deep learning, models like UNet have been used to extract information from these maps through semantic segmentation. Recently, pre-trained foundation models have shown strong performance across domains such as autonomous driving, medical imaging, and industrial inspection. However, they struggle with historical maps. These models are trained on modern or domain-specific images, where patterns can be tied to predefined concepts through common sense or expert knowledge. Historical maps lack such consistency -- similar concepts can appear in vastly different shapes and styles. To address this, we propose On-Need Declarative (OND) knowledge-based prompting, which introduces explicit prompts to guide the model on what patterns correspond to which concepts. This allows users to specify the target concept and pattern during inference (on-need inference). We implement this by replacing the prompt encoder of the foundation model SAM with our OND prompting mechanism and fine-tune it on historical maps. The resulting model is called SMOL-MapSeg (Show Me One Label). Experiments show that SMOL-MapSeg can accurately segment classes defined by OND knowledge. It can also adapt to unseen classes through few-shot fine-tuning. Additionally, it outperforms a UNet-based baseline in average segmentation performance.[124] AutoIAD: Manager-Driven Multi-Agent Collaboration for Automated Industrial Anomaly Detection
Dongwei Ji,Bingzhang Hu,Yi Zhou
Main category: cs.CV
TL;DR: AutoIAD是一个用于工业视觉异常检测的自动化多智能体协作框架,通过集成特定领域知识库和专家子智能体提高了性能。
Details
Motivation: 工业异常检测通常需要大量手动工作,因此需要一个自动化框架来提高制造质量控制的效率。 Method: 引入了一个名为AutoIAD的多智能体协作框架,利用Manager-Driven中心智能体协调专业子智能体并集成特定领域知识库。 Result: 实验表明,AutoIAD在任务完成率和模型性能(AUROC)方面显著优于现有框架,并有效缓解了幻觉问题。 Conclusion: AutoIAD是一个专为工业视觉异常检测设计的多智能体协作框架,显著优于现有的通用智能体协作框架和传统AutoML框架。 Abstract: Industrial anomaly detection (IAD) is critical for manufacturing quality control, but conventionally requires significant manual effort for various application scenarios. This paper introduces AutoIAD, a multi-agent collaboration framework, specifically designed for end-to-end automated development of industrial visual anomaly detection. AutoIAD leverages a Manager-Driven central agent to orchestrate specialized sub-agents (including Data Preparation, Data Loader, Model Designer, Trainer) and integrates a domain-specific knowledge base, which intelligently handles the entire pipeline using raw industrial image data to develop a trained anomaly detection model. We construct a comprehensive benchmark using MVTec AD datasets to evaluate AutoIAD across various LLM backends. Extensive experiments demonstrate that AutoIAD significantly outperforms existing general-purpose agentic collaboration frameworks and traditional AutoML frameworks in task completion rate and model performance (AUROC), while effectively mitigating issues like hallucination through iterative refinement. Ablation studies further confirm the crucial roles of the Manager central agent and the domain knowledge base module in producing robust and high-quality IAD solutions.[125] Symmetry Understanding of 3D Shapes via Chirality Disentanglement
Weikang Wang,Tobias Weißberg,Nafie El Amrani,Florian Bernard
Main category: cs.CV
TL;DR: This paper proposes an unsupervised chirality feature extraction pipeline using the Diff3F framework and 2D foundation models to enhance shape analysis by effectively distinguishing left from right symmetric parts.
Details
Motivation: Chirality information is essential in various computer vision domains, but current shape descriptors for point clouds and meshes often fail to disambiguate between left and right symmetric parts. This highlights the necessity of developing a chirality-aware feature extractor. Method: Based on the recent Diff3F framework, an unsupervised chirality feature extraction pipeline is proposed to extract chirality-aware features from 2D foundation models and apply them to shape vertices. The pipeline is evaluated through quantitative and qualitative experiments across diverse datasets. Result: The extracted chirality features demonstrate effectiveness and practical utility across downstream tasks such as left-right disentanglement, shape matching, and part segmentation. Conclusion: The proposed unsupervised chirality feature extraction pipeline based on the Diff3F framework effectively decorates shape vertices with chirality-aware information, addressing the lack of chirality-aware features in current shape descriptors. Abstract: Chirality information (i.e. information that allows distinguishing left from right) is ubiquitous for various data modes in computer vision, including images, videos, point clouds, and meshes. While chirality has been extensively studied in the image domain, its exploration in shape analysis (such as point clouds and meshes) remains underdeveloped. Although many shape vertex descriptors have shown appealing properties (e.g. robustness to rigid-body transformations), they are often not able to disambiguate between left and right symmetric parts. Considering the ubiquity of chirality information in different shape analysis problems and the lack of chirality-aware features within current shape descriptors, developing a chirality feature extractor becomes necessary and urgent. Based on the recent Diff3F framework, we propose an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. We evaluated the extracted chirality features through quantitative and qualitative experiments across diverse datasets. Results from downstream tasks including left-right disentanglement, shape matching, and part segmentation demonstrate their effectiveness and practical utility. Project page: https://wei-kang-wang.github.io/chirality/[126] MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips
Shibo Wang,Haonan He,Maria Parelli,Christoph Gebhardt,Zicong Fan,Jie Song
Main category: cs.CV
TL;DR: MagicHOI is a template-free method for reconstructing 3D hand-object interactions from short monocular videos with limited viewpoint variation by leveraging novel view synthesis diffusion models and visible contact constraints, resulting in superior performance compared to existing methods.
Details
Motivation: Most RGB-based hand-object reconstruction methods rely on object templates or assume full object visibility, which often breaks in real-world settings, resulting in implausible reconstructions. This limitation motivates the development of a template-free method that can handle limited viewpoint variation. Method: MagicHOI integrates a novel view synthesis model into a hand-object reconstruction framework and incorporates visible contact constraints to align hands with objects. Result: MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods, and novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction. Conclusion: MagicHOI effectively reconstructs 3D hand-object interactions from monocular videos with limited viewpoint variation by leveraging novel view synthesis diffusion models and incorporating visible contact constraints, outperforming existing state-of-the-art methods. Abstract: Most RGB-based hand-object reconstruction methods rely on object templates, while template-free methods typically assume full object visibility. This assumption often breaks in real-world settings, where fixed camera viewpoints and static grips leave parts of the object unobserved, resulting in implausible reconstructions. To overcome this, we present MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited viewpoint variation. Our key insight is that, despite the scarcity of paired 3D hand-object data, large-scale novel view synthesis diffusion models offer rich object supervision. This supervision serves as a prior to regularize unseen object regions during hand interactions. Leveraging this insight, we integrate a novel view synthesis model into our hand-object reconstruction framework. We further align hand to object by incorporating visible contact constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. We also show that novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.[127] Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events
Lin Zhu,Ruonan Liu,Xiao Wang,Lizhi Wang,Hua Huang
Main category: cs.CV
TL;DR: 本文提出了一种用于事件相机数据的自监督预训练框架,以解决其稀疏性和噪声问题,从而在多种下游任务中实现更优的性能表现。
Details
Motivation: 事件相机作为一种新型的神经形态视觉传感器,虽然在高时间分辨率和宽动态范围方面具有优势,但其数据本质上稀疏且嘈杂,主要反映亮度变化,这使得有效的特征提取变得复杂。因此,需要一种方法来充分挖掘事件数据中的潜在信息,如边缘信息和纹理线索。 Method: 本文提出了一种三阶段的自监督预训练框架:1) 差异引导的掩码建模(Difference-guided Masked Modeling),通过重建时间强度差异图来提取原始事件数据中的增强信息;2) 骨干固定的特征转换(Backbone-fixed Feature Transition),通过对比事件特征和图像特征,保持掩码建模中学习到的表示并稳定其在对比学习中的效果;3) 聚焦目标的对比学习(Focus-aimed Contrastive Learning),通过关注高价值区域来提升语义判别能力。 Result: 实验表明,该框架在多种下游任务(包括物体识别、语义分割和光流估计)中均表现出色,具有鲁棒性,并且一致优于现有的最先进方法。 Conclusion: 该研究提出了一种新颖的自监督预训练框架,能够充分挖掘事件相机数据中的潜在信息,有效解决了事件数据稀疏和噪声问题,为事件相机在计算机视觉任务中的应用提供了新的可能性。 Abstract: Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT-Vision/EventPretrain.[128] Head Anchor Enhanced Detection and Association for Crowded Pedestrian Tracking
Zewei Wu,César Teixeira,Wei Ke,Zhang Xiong
Main category: cs.CV
TL;DR: 这篇论文针对视觉行人跟踪中的遮挡问题,提出了一种结合更丰富的特征表示和鲁棒运动模型的增强跟踪框架。
Details
Motivation: 现实应用中存在严重的遮挡挑战,传统跟踪方法在严重遮挡场景下表现不佳。 Method: 具体来说,该方法结合了目标检测器的回归和分类分支的检测特征,并引入了头部关键点检测模型。在运动建模方面,提出了一种迭代卡尔曼滤波方法,集成了3D先验信息。 Result: 论文提出的方法在遮挡普遍的复杂场景中提供了更稳定的跟踪轨迹。 Conclusion: 综上所述,该论文提出了一种增强的跟踪框架,通过结合更丰富的特征表示和更鲁棒的运动模型,为拥挤环境中多目标跟踪提供了更稳健的解决方案。 Abstract: Visual pedestrian tracking represents a promising research field, with extensive applications in intelligent surveillance, behavior analysis, and human-computer interaction. However, real-world applications face significant occlusion challenges. When multiple pedestrians interact or overlap, the loss of target features severely compromises the tracker's ability to maintain stable trajectories. Traditional tracking methods, which typically rely on full-body bounding box features extracted from {Re-ID} models and linear constant-velocity motion assumptions, often struggle in severe occlusion scenarios. To address these limitations, this work proposes an enhanced tracking framework that leverages richer feature representations and a more robust motion model. Specifically, the proposed method incorporates detection features from both the regression and classification branches of an object detector, embedding spatial and positional information directly into the feature representations. To further mitigate occlusion challenges, a head keypoint detection model is introduced, as the head is less prone to occlusion compared to the full body. In terms of motion modeling, we propose an iterative Kalman filtering approach designed to align with modern detector assumptions, integrating 3D priors to better complete motion trajectories in complex scenes. By combining these advancements in appearance and motion modeling, the proposed method offers a more robust solution for multi-object tracking in crowded environments where occlusions are prevalent.[129] FS-IQA: Certified Feature Smoothing for Robust Image Quality Assessment
Ekaterina Shumitskaya,Dmitriy Vatolin,Anastasia Antsiferova
Main category: cs.CV
TL;DR: The paper introduces a certified defense for IQA models using feature-space noise, offering better robustness, preserved image quality, and reduced inference time.
Details
Motivation: The motivation is to address the issue of visual quality degradation in prior methods that inject noise directly into input images. Method: The method involves randomized smoothing with noise applied in the feature space, analyzing the maximum singular value of the Jacobian to link feature-space noise with input-space perturbations. Result: The proposed method reduces inference time by 99.5% without certification and by 20.6% with certification, showing up to 30.9% improvement in correlation with subjective quality scores. Conclusion: The paper concludes that the proposed certified defense method for IQA models effectively enhances robustness while maintaining image fidelity and reducing inference time. Abstract: We propose a novel certified defense method for Image Quality Assessment (IQA) models based on randomized smoothing with noise applied in the feature space rather than the input space. Unlike prior approaches that inject Gaussian noise directly into input images, often degrading visual quality, our method preserves image fidelity while providing robustness guarantees. To formally connect noise levels in the feature space with corresponding input-space perturbations, we analyze the maximum singular value of the backbone network's Jacobian. Our approach supports both full-reference (FR) and no-reference (NR) IQA models without requiring any architectural modifications, suitable for various scenarios. It is also computationally efficient, requiring a single backbone forward pass per image. Compared to previous methods, it reduces inference time by 99.5% without certification and by 20.6% when certification is applied. We validate our method with extensive experiments on two benchmark datasets, involving six widely-used FR and NR IQA models and comparisons against five state-of-the-art certified defenses. Our results demonstrate consistent improvements in correlation with subjective quality scores by up to 30.9%.[130] Leveraging AI to Accelerate Clinical Data Cleaning: A Comparative Study of AI-Assisted vs. Traditional Methods
Matthew Purri,Amit Patel,Erik Deurrell
Main category: cs.CV
TL;DR: Octozi, an AI-assisted platform, significantly improves clinical trial data cleaning efficiency and accuracy, reducing errors and workload across reviewers of all experience levels.
Details
Motivation: Clinical trial data cleaning is a critical bottleneck due to exponentially increasing data volumes and complexity, which manual processes struggle to manage. Method: Octozi, an AI-assisted platform combining large language models with domain-specific heuristics, was evaluated through a controlled experimental study involving experienced clinical reviewers (n=10). Result: AI assistance increased data cleaning throughput by 6.03-fold, decreased cleaning errors from 54.67% to 8.48%, and reduced false positive queries by 15.48-fold, with consistent improvements across reviewers regardless of experience level. Conclusion: AI-assisted approaches like Octozi can significantly improve efficiency and accuracy in clinical trial data cleaning, suggesting a transformative potential for human-AI collaboration in pharmaceutical trials. Abstract: Clinical trial data cleaning represents a critical bottleneck in drug development, with manual review processes struggling to manage exponentially increasing data volumes and complexity. This paper presents Octozi, an artificial intelligence-assisted platform that combines large language models with domain-specific heuristics to transform clinical data review. In a controlled experimental study with experienced clinical reviewers (n=10), we demonstrate that AI assistance increased data cleaning throughput by 6.03-fold while simultaneously decreasing cleaning errors from 54.67% to 8.48% (a 6.44-fold improvement). Crucially, the system reduced false positive queries by 15.48-fold, minimizing unnecessary site burden. These improvements were consistent across reviewers regardless of experience level, suggesting broad applicability. Our findings indicate that AI-assisted approaches can address fundamental inefficiencies in clinical trial operations, potentially accelerating drug development timelines and reducing costs while maintaining regulatory compliance. This work establishes a framework for integrating AI into safety-critical clinical workflows and demonstrates the transformative potential of human-AI collaboration in pharmaceutical clinical trials.[131] Optimal Brain Connection: Towards Efficient Structural Pruning
Shaowu Chen,Wei Ma,Binhua Huang,Qingyuan Wang,Guoxin Wang,Weize Sun,Lei Huang,Deepu John
Main category: cs.CV
TL;DR: 本文提出了一种新的结构剪枝框架Optimal Brain Connection,包括Jacobian Criterion和Equivalent Pruning机制,以解决现有方法忽视参数间相互联系的局限性。
Details
Motivation: 为了克服现有方法常常忽视参数之间的相互联系的局限性,本文提出了一种称为Optimal Brain Connection的结构剪枝框架。 Method: 首先,我们引入了Jacobian Criterion,这是一种用于评估结构参数显著性的方法。其次,我们提出了Equivalent Pruning机制,该机制利用自动编码器在微调期间保留所有原始连接(包括被剪枝的连接)的贡献。 Result: Structural pruning因其在压缩神经网络方面的有效性而被广泛研究。 Conclusion: 实验结果表明,Jacobian Criterion在保持模型性能方面优于几种流行的指标,而Equivalent Pruning机制在微调后有效缓解了性能下降。 Abstract: Structural pruning has been widely studied for its effectiveness in compressing neural networks. However, existing methods often neglect the interconnections among parameters. To address this limitation, this paper proposes a structural pruning framework termed Optimal Brain Connection. First, we introduce the Jacobian Criterion, a first-order metric for evaluating the saliency of structural parameters. Unlike existing first-order methods that assess parameters in isolation, our criterion explicitly captures both intra-component interactions and inter-layer dependencies. Second, we propose the Equivalent Pruning mechanism, which utilizes autoencoders to retain the contributions of all original connection--including pruned ones--during fine-tuning. Experimental results demonstrate that the Jacobian Criterion outperforms several popular metrics in preserving model performance, while the Equivalent Pruning mechanism effectively mitigates performance degradation after fine-tuning. Code: https://github.com/ShaowuChen/Optimal_Brain_Connection[132] When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework
Haoyu Liu,Chaoyu Gong,Mengke He,Jiate Li,Kai Han,Siqiang Luo
Main category: cs.CV
TL;DR: This paper proposes SSTGNN, a lightweight graph neural network framework for detecting AI-generated and manipulated videos by analyzing spatial, spectral, and temporal inconsistencies.
Details
Motivation: The motivation is to address the urgent challenge of detecting AI-generated and manipulated videos, where existing methods struggle due to reliance on isolated spatial, temporal, or spectral information and require large models. Method: The method introduces SSTGNN, a Spatial-Spectral-Temporal Graph Neural Network that represents videos as structured graphs. It integrates learnable spectral filters and temporal differential modeling to capture subtle manipulation traces across multiple dimensions. Result: Experiments show that SSTGNN achieves superior performance in both in-domain and cross-domain settings, with strong robustness against unseen manipulations and using up to 42.4× fewer parameters than state-of-the-art models. Conclusion: SSTGNN is an effective and lightweight solution for detecting AI-generated and manipulated videos, offering improved generalization, robustness, and efficiency over existing approaches. Abstract: The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and temporal differential modeling into a graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong robustness against unseen manipulations. Remarkably, SSTGNN accomplishes these results with up to 42.4$\times$ fewer parameters than state-of-the-art models, making it highly lightweight and scalable for real-world deployment.[133] AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety
Adi Levi,Or Levi,Sardhendu Mishra,Jonathan Morra
Main category: cs.CV
TL;DR: This paper explores the use of Multimodal Large Language Models (MLLMs) in brand safety classification for content moderation, showing their effectiveness compared to human reviewers while highlighting their limitations. A new dataset is also introduced to support future research.
Details
Motivation: The exponential growth of online video content has made moderation challenging for humans, both operationally and concerning mental health. Although MLLMs have shown promise in video understanding, their application to nuanced multimodal content moderation remains underexplored. Method: A comparative analysis was conducted to evaluate the effectiveness of MLLMs like Gemini, GPT, and Llama in multimodal brand safety classification. The analysis included assessing accuracy and cost efficiency compared to human reviewers. Result: MLLMs demonstrated effectiveness in brand safety classification, with results showing their potential as a viable solution for content moderation when compared to professional human reviewers. Conclusion: The study concludes that MLLMs are effective in multimodal brand safety classification, offering a promising solution for content moderation, though they still have limitations and failure cases. The dataset introduced in this work is made publicly available for future research. Abstract: As the volume of video content online grows exponentially, the demand for moderation of unsafe videos has surpassed human capabilities, posing both operational and mental health challenges. While recent studies demonstrated the merits of Multimodal Large Language Models (MLLMs) in various video understanding tasks, their application to multimodal content moderation, a domain that requires nuanced understanding of both visual and textual cues, remains relatively underexplored. In this work, we benchmark the capabilities of MLLMs in brand safety classification, a critical subset of content moderation for safe-guarding advertising integrity. To this end, we introduce a novel, multimodal and multilingual dataset, meticulously labeled by professional reviewers in a multitude of risk categories. Through a detailed comparative analysis, we demonstrate the effectiveness of MLLMs such as Gemini, GPT, and Llama in multimodal brand safety, and evaluate their accuracy and cost efficiency compared to professional human reviewers. Furthermore, we present an in-depth discussion shedding light on limitations of MLLMs and failure cases. We are releasing our dataset alongside this paper to facilitate future research on effective and responsible brand safety and content moderation.[134] Looking into the Unknown: Exploring Action Discovery for Segmentation of Known and Unknown Actions
Federico Spurio,Emad Bahrami,Olga Zatsarynna,Yazan Abu Farha,Gianpiero Francesca,Juergen Gall
Main category: cs.CV
TL;DR: 本文提出了“动作发现”这一新方法,旨在解决部分标注数据集中模糊动作定义和不完整标注的问题,通过利用已知动作的标注信息,引导未知动作片段的时间和语义粒度识别,从而提升动作分割的效果。
Details
Motivation: 在许多领域(如神经科学和行为分析)中,数据集中存在定义模糊或标注不完整的动作,传统方法难以有效处理此类问题,因此需要一种新方法来更好地识别和分类未知动作。 Method: 本文提出了一种两步法:1)粒度引导分割模块(GGSM),通过模仿已标注动作的时间粒度来识别已知和未知动作的时间区间;2)未知动作段分配模块(UASA),基于学习到的嵌入相似性识别未知动作中的语义类别。 Result: 在三个具有挑战性的数据集(Breakfast、50Salads和Desktop Assembly)上的实验表明,该方法显著优于现有基线方法。 Conclusion: 本文提出的“动作发现”方法有效解决了部分标注数据集中的模糊动作识别问题,为时间动作分割提供了一种新的思路和解决方案。 Abstract: We introduce Action Discovery, a novel setup within Temporal Action Segmentation that addresses the challenge of defining and annotating ambiguous actions and incomplete annotations in partially labeled datasets. In this setup, only a subset of actions - referred to as known actions - is annotated in the training data, while other unknown actions remain unlabeled. This scenario is particularly relevant in domains like neuroscience, where well-defined behaviors (e.g., walking, eating) coexist with subtle or infrequent actions that are often overlooked, as well as in applications where datasets are inherently partially annotated due to ambiguous or missing labels. To address this problem, we propose a two-step approach that leverages the known annotations to guide both the temporal and semantic granularity of unknown action segments. First, we introduce the Granularity-Guided Segmentation Module (GGSM), which identifies temporal intervals for both known and unknown actions by mimicking the granularity of annotated actions. Second, we propose the Unknown Action Segment Assignment (UASA), which identifies semantically meaningful classes within the unknown actions, based on learned embedding similarities. We systematically explore the proposed setting of Action Discovery on three challenging datasets - Breakfast, 50Salads, and Desktop Assembly - demonstrating that our method considerably improves upon existing baselines.[135] Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis
Kunyu Feng,Yue Ma,Xinhua Zhang,Boshi Liu,Yikuang Yuluo,Yinhan Zhang,Runtao Liu,Hongyu Liu,Zhiyuan Qin,Shanhui Mo,Qifeng Chen,Zeyu Wang
Main category: cs.CV
TL;DR: Follow-Your-Instruction是一个基于多模态大语言模型的自动合成高质量2D、3D和4D数据的框架,解决了大规模真实数据收集的成本和效率问题,并显著提升了生成模型的性能。
Details
Motivation: 随着AI生成内容的需求增长,高质量、多样化和可扩展的数据需求变得尤为重要,但大规模真实数据的收集仍然昂贵且耗时。 Method: Follow-Your-Instruction框架通过MLLM-Collector收集资产及其描述,利用MLLM-Generator和MLLM-Optimizer构建3D布局并进行语义优化,最后使用MLLM-Planner生成时间连贯的未来帧。 Result: 实验结果表明,生成的数据显著提升了现有基线模型的性能。 Conclusion: Follow-Your-Instruction框架具有作为生成智能的可扩展和有效数据引擎的潜力,能够合成高质量的2D、3D和4D数据。 Abstract: With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.[136] DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition
Haijing Liu,Tao Pu,Hefeng Wu,Keze Wang,Liang Lin
Main category: cs.CV
TL;DR: DART improves Open-Vocabulary Multi-Label Recognition by integrating external relational knowledge from LLMs and refining features under weak supervision, achieving superior performance.
Details
Motivation: OV-MLR requires both precise intra-class localization and effective inter-class reasoning, but existing Vision-Language Pre-training models struggle with fine-grained localization under weak supervision and fail to leverage structured relational knowledge, limiting performance especially for unseen classes. Method: DART enhances a frozen VLP backbone using two adaptive modules: Adaptive Refinement Module (ARM) with Weakly Supervised Patch Selecting (WPS) loss for intra-class refinement, and Adaptive Transfer Module (ATM) leveraging a Class Relationship Graph (CRG) via graph attention network for inter-class transfer. Result: DART achieves new state-of-the-art performance on challenging benchmarks, validating its effectiveness. Conclusion: DART is the first framework to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while performing adaptive intra-class refinement under weak supervision for OV-MLR. Abstract: Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image, requiring both precise intra-class localization to pinpoint objects and effective inter-class reasoning to model complex category dependencies. While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. DART enhances a frozen VLP backbone via two synergistic adaptive modules. For intra-class refinement, an Adaptive Refinement Module (ARM) refines patch features adaptively, coupled with a novel Weakly Supervised Patch Selecting (WPS) loss that enables discriminative localization using only image-level labels. Concurrently, for inter-class transfer, an Adaptive Transfer Module (ATM) leverages a Class Relationship Graph (CRG), constructed using structured knowledge mined from a Large Language Model (LLM), and employs graph attention network to adaptively transfer relational information between class representations. DART is the first framework, to our knowledge, to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while simultaneously performing adaptive intra-class refinement under weak supervision for OV-MLR. Extensive experiments on challenging benchmarks demonstrate that our DART achieves new state-of-the-art performance, validating its effectiveness.[137] WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction
Shaobin Zhuang,Yiwei Guo,Canmiao Fu,Zhipeng Huang,Zeyue Tian,Ying Zhang,Chen Li,Yali Wang
Main category: cs.CV
TL;DR: WeTok is a visual tokenizer that offers superior compression and reconstruction fidelity compared to existing tokenizers. It uses two core innovations, Group-wise lookup-free Quantization (GQ) and Generative Decoding (GD), to overcome memory and computation limitations of prior tokenizers while achieving a breakthrough in reconstruction with more scalable codebooks.
Details
Motivation: Existing visual tokenizers face an unsatisfactory trade-off between compression ratios and reconstruction fidelity. Method: WeTok tokenizer uses two core innovations: Group-wise lookup-free Quantization (GQ) and Generative Decoding (GD). Result: On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID score of 0.12 and outperforms Cosmos in compression rate and rFID score. Conclusion: WeTok tokenizer provides superior performance in compression and reconstruction fidelity compared to previous tokenizers. Abstract: Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.[138] LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model
Tao Sun,Oliver Liu,JinJin Li,Lan Ma
Main category: cs.CV
TL;DR: LLaVA-RE는 멀티모달 대형 언어 모델을 사용하여 이미지-텍스트 관련성을 이진 평가하는 시도입니다.
Details
Motivation: 이미지와 텍스트의 관련성을 효과적으로 평가하는 것은 멀티모달 생성형 AI의 성능 향상에 중요하며, 다양한 텍스트 형식과 상황에 따라 달라지는 관련성 정의로 인해 도전적인 과제입니다. Method: LLaVA 아키텍처를 기반으로 하여 상세한 작업 지침과 멀티모달 인-컨텍스트 샘플을 채택한 LLaVA-RE 프레임워크를 제시했습니다. 또한 다양한 작업을 포함하는 이진 관련성 데이터 세트를 새로 제안하였습니다. Result: 실험 결과를 통해 제안된 LLaVA-RE 프레임워크의 효과성을 입증하였습니다. Conclusion: LLaVA-RE는 멀티모달 대형 언어 모델을 활용한 이미지-텍스트 관련성 이진 평가에 성공적인 첫 시도로, 향후 관련 연구 및 응용에 기여할 수 있습니다. Abstract: Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., ``Relevant'' vs. ``Not Relevant'', is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. In addition, we propose a novel binary relevancy data set that covers various tasks. Experimental results validate the effectiveness of our framework.[139] Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity
Yuhan Zhang,Long Zhuo,Ziyang Chu,Tong Wu,Zhibing Li,Liang Pan,Dahua Lin,Ziwei Liu
Main category: cs.CV
TL;DR: This paper introduces Hi3DEval, a hierarchical framework for evaluating 3D generative content, combining object-level and part-level assessments to capture spatial coherence, material authenticity, and local details more effectively than existing methods.
Details
Motivation: Despite rapid advances in 3D content generation, quality assessment remains challenging as existing methods rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. Method: The authors introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. They construct Hi3DBench, a large-scale dataset with diverse 3D assets and annotations, and propose a 3D-aware automated scoring system based on hybrid 3D representations. Video-based representations are used for object-level and material-subject evaluations, while pretrained 3D features are employed for part-level perception. Result: Extensive experiments demonstrate that Hi3DEval outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. Conclusion: Hi3DEval is a hierarchical evaluation framework for 3D generative content that combines object-level and part-level evaluation, enabling holistic assessments and fine-grained quality analysis. The proposed framework outperforms existing image-based metrics in modeling 3D characteristics and aligns well with human preference. Abstract: Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.[140] MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
Henghui Ding,Kaining Ying,Chang Liu,Shuting He,Xudong Jiang,Yu-Gang Jiang,Philip H. S. Torr,Song Bai
Main category: cs.CV
TL;DR: MOSEv2 是一个用于视频对象分割的新数据集,包含更多现实世界复杂性,如恶劣天气、遮挡、夜间场景等,展示了现有方法在实际应用中的局限性。
Details
Motivation: 现有数据集(如DAVIS和YouTube-VOS)主要包含显著、主导和孤立的对象,限制了视频对象分割(VOS)方法在现实场景中的泛化能力。为了推动VOS技术向更真实环境发展,需要引入更具挑战性的数据集。 Method: 构建了一个包含5,024个视频和超过701,976个高质量掩膜的数据集,涵盖200类中的10,074个对象,并引入了更多现实世界中的挑战。 Result: 在MOSEv2上,包括SAM2在内的VOS方法性能显著下降,例如SAM2在MOSEv1上得分为76.4%,而在MOSEv2上仅为50.9%。 Conclusion: MOSEv2 是一个更具挑战性的视频对象分割数据集,旨在推动现实世界场景下的研究,结果显示现有方法在面对真实世界复杂性时仍有显著不足。 Abstract: Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities. MOSEv2 is publicly available at https://MOSE.video.[141] GAP: Gaussianize Any Point Clouds with Text Guidance
Weiqi Zhang,Junsheng Zhou,Haotian Geng,Wenyuan Zhang,Yu-Shen Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为GAP的新方法,通过文本引导将原始点云转换为高保真的3D高斯模型。
Details
Motivation: 直接从无颜色的3D点云生成高斯模型仍然是一个未解决的挑战,GAP旨在填补点云和高斯表示之间的差距。 Method: 设计了一个多视角优化框架,利用深度感知的图像扩散模型来合成不同视角下的一致外观。引入了表面锚定机制以确保几何准确性,并结合了基于扩散的修复策略来完成难以观察的区域。 Result: GAP在从合成点云到具有挑战性的现实扫描以及大规模场景的Point-to-Gaussian生成任务中表现优异。 Conclusion: GAP提供了一种有效的方法,能够将无颜色的3D点云转化为高质量的3D高斯模型,并在多种复杂场景下展现出良好的性能。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes. Project Page: https://weiqi-zhang.github.io/GAP.[142] FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing
Mohammed Talha Alam,Fahad Shamshad,Fakhri Karray,Karthik Nandakumar
Main category: cs.CV
TL;DR: 提出了一种名为FaceAnonyMixer的新方法,用于生成可撤销的隐私保护人脸图像,具有良好的识别准确率和强大的隐私保护能力。