Skip to content

Table of Contents

cs.CL [Back]

[1] Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM

Thomas Thebaud,Yen-Ju Lu,Matthew Wiesner,Peter Viechnicki,Najim Dehak

Main category: cs.CL

TL;DR: 本文提出了一种方法,通过结合冻结的音频基础模型和LLAMA语言模型,为对话转录添加说话人特征元数据标签,无需任务特定的微调。

Details Motivation: 在对话转录流水线中,LLMs通常用于后处理以提高语法、标点和可读性。本文探索了一种补充的后处理步骤,即通过添加说话人特征元数据标签(如年龄、性别和情感)来丰富转录对话。 Method: 将音频基础模型(如Whisper或WavLM)与LLAMA语言模型结合,使用轻量级连接器桥接音频和语言表示。 Result: 在说话人分析任务中取得了具有竞争力的性能,并且证明冻结的LLAMA模型可以直接比较x向量,在某些场景下达到8.8%的等错误率。 Conclusion: 结合冻结的音频基础模型和LLAMA语言模型,可以有效地进行说话人属性推断,并保持模块化和速度优势。 Abstract: In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.

[2] Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Negar Foroutan,Clara Meister,Debjit Paul,Joel Niklaus,Sina Ahmadi,Antoine Bosselut,Rico Sennrich

Main category: cs.CL

TL;DR: This paper introduces Parity-aware BPE to address inequality in multilingual tokenization, ensuring fairer compression across languages with minimal trade-offs in efficiency and performance.

Details Motivation: Standard tokenization methods favor high-frequency languages, leading to disproportionate tokenization for lower-resource languages and amplifying computational and financial inequalities. Method: Parity-aware BPE algorithm, a variant of the BPE algorithm, was introduced, which maximizes the compression gain for the worst-compressed language at every merge step. Result: Empirical findings indicate that Parity-aware BPE results in more equitable token counts across languages, with negligible effects on global compression rates and no significant impact on downstream language-model performance. Conclusion: Parity-aware BPE provides a solution to the inequality problem in tokenization for different languages, achieving more equitable token counts with minimal impact on global compression and language-model performance. Abstract: Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.

[3] Pitch Accent Detection improves Pretrained Automatic Speech Recognition

David Sasu,Natalie Schluter

Main category: cs.CL

TL;DR: 本文介绍了一种联合自动语音识别和音高重音检测模型,通过使用半监督语音表示显著提升ASR系统性能,同时改进音高重音检测任务的表现。

Details Motivation: 展示互补的音高重音检测模块如何提高使用半监督语音表示的自动语音识别(ASR)系统的性能。 Method: 通过引入联合自动语音识别(ASR)和音高重音检测模型,展示了使用半监督语音表示的ASR系统性能可以得到提升。 Result: 音高重音检测组件在该任务的最先进方法上实现了显著改进,F1分数差距缩小了41%;联合训练中的ASR性能在LibriSpeech上减少了28.3%的词错误率(WER)。 Conclusion: 扩展预训练语音模型以保留或重新学习诸如音高重音等韵律线索的重要性得到了证明。 Abstract: We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.

[4] Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

Tommaso Tosato,Saskia Helbling,Yorguin-Jose Mantilla-Ramos,Mahmood Hegazy,Alberto Tosato,David John Lemay,Irina Rish,Guillaume Dumas

Main category: cs.CL

TL;DR: 研究表明大语言模型存在显著的行为不稳定性,即使在改变提示顺序或应用干预措施后,这种不稳定性依然存在,挑战了当前对模型安全部署的假设。

Details Motivation: 大语言模型需要一致的行为模式以确保安全部署,但其类似人格的特性仍理解不足。 Method: 使用传统(BFI-44, SD3)和新型LLM适配的人格测量工具,系统地改变了问题顺序、改写、角色和推理模式,对25多个开源模型(1B-671B参数)的50多万条响应进行了评估。 Result: (1)即使是400B+模型也表现出显著的响应变异性(SD > 0.4);(2)仅改变问题顺序就能导致人格测量值变化达20%;(3)预期稳定行为的干预措施反而可能增加变异;(4)LLM适配的测量工具显示出与人类中心工具相当的不稳定性,表明是架构而非翻译问题。 Conclusion: 当前的LLM在行为一致性方面存在根本性不足,基于人格的对齐策略可能无法实现安全关键应用所需的可预测行为。 Abstract: Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD > 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.

[5] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory

Jun Liu,Zhenglun Kong,Changdi Yang,Fan Yang,Tianqi Li,Peiyan Dong,Joannah Nanjekye,Hao Tang,Geng Yuan,Wei Niu,Wenbin Zhang,Pu Zhao,Xue Lin,Dong Huang,Yanzhi Wang

Main category: cs.CL

TL;DR: The paper introduces RCR-Router, a dynamic and efficient context routing framework for multi-agent LLM systems that reduces token usage while maintaining answer quality.

Details Motivation: The authors identified that most existing coordination schemes in multi-agent LLM systems rely on static or full-context routing strategies that lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. This prompted the need for a more efficient and adaptive collaboration method. Method: RCR-Router, a modular and role-aware context routing framework was introduced. It dynamically selects semantically relevant memory subsets for each agent based on its role and task stage while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are integrated into a shared memory store for context refinement. Result: Experiments on three multi-hop QA benchmarks showed that RCR-Router reduced token usage by up to 30% while improving or maintaining answer quality. The proposed Answer Quality Score metric also captured LLM-generated explanations beyond standard QA accuracy. Conclusion: RCR-Router proves to be a more efficient and adaptive method for multi-agent LLM collaboration by reducing token usage while maintaining or improving answer quality. This highlights the importance of structured memory routing and output-aware evaluation in multi-agent LLM systems. Abstract: Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks -- HotPotQA, MuSiQue, and 2WikiMultihop -- demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems.

[6] I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

Julia Kharchenko,Tanya Roosta,Aman Chadha,Chirag Shah

Main category: cs.CL

TL;DR: 本文提出了一个全面的基准,用于评估大型语言模型如何回应语言学中的“口音标记”,揭示其在性别、社会阶层或地区背景等方面的潜在偏见。

Details Motivation: 为了检测和衡量人工智能系统中的语言歧视,确保在自动化决策中的公平性。 Method: 通过构建100个经过验证的问题-回答对,模拟访谈场景,生成受控的语言变化,以隔离特定现象并保持语义等价。 Result: 研究显示,使用犹豫语言的回答平均评分低了25.6%,并且该基准能有效识别模型特定的偏见。 Conclusion: 该研究建立了一个基础框架,用于检测和衡量AI系统中的语言歧视,具有广泛的应用前景。 Abstract: This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark's effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts.

[7] Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering

Louie Hong Yao,Nicholas Jarvis,Tianyu Jiang

Main category: cs.CL

TL;DR: This paper proposes a vision-language clustering framework to better evaluate visual activity recognition systems by addressing ambiguities in verb semantics and image interpretation, showing improved alignment with human judgments over standard methods.

Details Motivation: Standard exact-match evaluation methods fail to account for ambiguities in verb semantics and image interpretation, leading to an incomplete assessment of model performance. Method: A vision-language clustering framework was developed to construct verb sense clusters for evaluating activity recognition models, which was compared with standard evaluation approaches. Result: Analysis of the imSitu dataset revealed that each image maps to an average of 2.8 sense clusters, and cluster-based evaluation aligns better with human judgments, offering a more nuanced assessment. Conclusion: The proposed vision-language clustering framework provides a more robust and nuanced evaluation of visual activity recognition systems, aligning better with human judgments compared to standard evaluation methods. Abstract: Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs verb sense clusters, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to an average of 2.8 sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our cluster-based evaluation with standard evaluation methods. Additionally, our human alignment analysis suggests that the cluster-based evaluation better aligns with human judgements, offering a more nuanced assessment of model performance.

Song Wang,Yishu Wei,Haotian Ma,Max Lovitt,Kelly Deng,Yuan Meng,Zihan Xu,Jingze Zhang,Yunyu Xiao,Ying Ding,Xuhai Xu,Joydeep Ghosh,Yifan Peng

Main category: cs.CL

TL;DR: This paper proposes a multi-stage large language model framework to enhance the extraction of suicide-related social determinants of health (SDoH) factors from unstructured text, achieving better accuracy, explainability, and efficiency compared to existing models.

Details Motivation: Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability. Method: A multi-stage large language model framework was proposed to enhance SDoH factor extraction from unstructured text. The approach was compared with state-of-the-art models like BioBERT, GPT-3.5-turbo, and reasoning models like DeepSeek-R1. The analysis included automated comparisons and a pilot user study. Result: The proposed framework demonstrated performance improvements in extracting SDoH factors and retrieving relevant context. Fine-tuning a smaller, task-specific model achieved comparable or better performance with reduced inference costs. Additionally, the multi-stage design improved model explainability by providing intermediate explanations. Conclusion: The approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts, supporting early identification of at-risk individuals and informing effective prevention strategies. Abstract: Background: Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches to this goal face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability. Methods: We present a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text. Our approach was compared to other state-of-the-art language models (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1). We also evaluated how the model's explanations help people annotate SDoH factors more quickly and accurately. The analysis included both automated comparisons and a pilot user study. Results: We show that our proposed framework demonstrated performance boosts in the overarching task of extracting SDoH factors and in the finer-grained tasks of retrieving relevant context. Additionally, we show that fine-tuning a smaller, task-specific model achieves comparable or better performance with reduced inference costs. The multi-stage design not only enhances extraction but also provides intermediate explanations, improving model explainability. Conclusions: Our approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts. These advancements have the potential to support early identification of individuals at risk and inform more effective prevention strategies.

[9] Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning

Kun Peng,Cong Cao,Hao Peng,Zhifeng Hao,Lei Jiang,Kongjing Gu,Yanbing Liu,Philip S. Yu

Main category: cs.CL

TL;DR: 本文提出了一种高效的对话方面情感四元组提取方法,通过结构熵最小化划分对话并采用两步框架提取情感四元组,性能优越且计算成本低。

Details Motivation: 现有方法在整个对话中学习词关系,但对话中通常包含多个语义独立的子对话,这会导致提取过程中的噪声增加。因此需要一种更有效的方法来解决这个问题。 Method: 通过结构熵最小化算法将对话划分为语义独立的子对话,并采用两步框架进行四元组提取:首先在话语级别提取情感元素,然后在子对话级别匹配四元组。 Result: 实验表明,该方法在对话方面的情感四元组提取任务中达到了最先进的性能,同时计算成本显著降低。 Conclusion: 该论文提出了一种新的基于结构熵最小化的对话划分方法,结合两步框架进行四元组情感分析,以提高对话方面的情感四元组提取性能。 Abstract: Dialogues Aspect-based Sentiment Quadruple Extraction (DiaASQ) aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue. Existing methods typically learn word relations across entire dialogues, assuming a uniform distribution of sentiment elements. However, we find that dialogues often contain multiple semantically independent sub-dialogues without clear dependencies between them. Therefore, learning word relationships across the entire dialogue inevitably introduces additional noise into the extraction process. To address this, our method focuses on partitioning dialogues into semantically independent sub-dialogues. Achieving completeness while minimizing these sub-dialogues presents a significant challenge. Simply partitioning based on reply relationships is ineffective. Instead, we propose utilizing a structural entropy minimization algorithm to partition the dialogues. This approach aims to preserve relevant utterances while distinguishing irrelevant ones as much as possible. Furthermore, we introduce a two-step framework for quadruple extraction: first extracting individual sentiment elements at the utterance level, then matching quadruples at the sub-dialogue level. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in DiaASQ with much lower computational costs.

[10] Evaluation of LLMs in AMR Parsing

Shu Han Ho

Main category: cs.CL

TL;DR: Fine-tuning decoder-only LLMs like LLaMA 3.2 can match the performance of advanced AMR parsers, offering a simpler and effective approach.

Details Motivation: The motivation is to explore whether fine-tuning decoder-only large language models can offer a simpler yet effective alternative to complex AMR parsing methods. Method: The authors conducted a comprehensive evaluation by fine-tuning four different LLM architectures—Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled—on the LDC2020T02 Gold AMR3.0 test set. Result: Fine-tuned LLaMA 3.2 achieved an SMATCH F1 score of 0.804, comparable to APT + Silver (IBM) at 0.804 and close to Graphene Smatch (MBSE) at 0.854. LLaMA 3.2 excelled in semantic performance, while Phi 3.5 performed best in structural validity. Conclusion: The study concludes that straightforward fine-tuning of decoder-only LLMs can achieve competitive performance compared to complex state-of-the-art AMR parsers, with LLaMA 3.2 showing particularly strong results. Abstract: Meaning Representation (AMR) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity.

[11] Align, Don't Divide: Revisiting the LoRA Architecture in Multi-Task Learning

Jinda Liu,Bo Cheng,Yi Chang,Yuan Wu

Main category: cs.CL

TL;DR: Align-LoRA 通过共享适配器空间中的任务对齐,提供了一种更简单且更有效的多任务学习方法。

Details Motivation: 现有的多任务学习方法依赖复杂的多适配器或多头架构,但作者质疑这种多组件范式并提出学习共享表示的新假设。 Method: 提出 Align-LoRA 方法,采用单一适配器并引入任务表示对齐损失以学习鲁棒的共享表示。 Result: 实验表明,Align-LoRA 在多任务学习中表现优异,显著优于其他基于 LoRA 的多任务方法。 Conclusion: Align-LoRA 提出了一种更简单且有效的多任务学习范式,通过共享适配器空间内的任务表示对齐损失,显著优于现有基线。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) is essential for adapting Large Language Models (LLMs). In practice, LLMs are often required to handle a diverse set of tasks from multiple domains, a scenario naturally addressed by multi-task learning (MTL). Within this MTL context, a prevailing trend involves LoRA variants with multiple adapters or heads, which advocate for structural diversity to capture task-specific knowledge. Our findings present a direct challenge to this paradigm. We first show that a simplified multi-head architecture with high inter-head similarity substantially outperforms complex multi-adapter and multi-head systems. This leads us to question the multi-component paradigm itself, and we further demonstrate that a standard single-adapter LoRA, with a sufficiently increased rank, also achieves highly competitive performance. These results lead us to a new hypothesis: effective MTL generalization hinges on learning robust shared representations, not isolating task-specific features. To validate this, we propose Align-LoRA, which incorporates an explicit loss to align task representations within the shared adapter space. Experiments confirm that Align-LoRA significantly surpasses all baselines, establishing a simpler yet more effective paradigm for adapting LLMs to multiple tasks. The code is available at https://github.com/jinda-liu/Align-LoRA.

[12] Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations

Aditya Kishore,Gaurav Kumar,Jasabanta Patro

Main category: cs.CL

TL;DR: MultiCheck是一个用于细粒度多模态事实验证的统一框架,通过结合文本和图像专用编码器与融合模块来推理结构化文本和视觉信号,实现对声明真实性的预测。

Details Motivation: 多模态错误信息的增长率,其中声明由文本和图像支持,给主要依赖文本证据的事实核查系统带来了重大挑战。因此需要一个能够处理多模态信息的统一框架。 Method: 提出了一种名为MultiCheck的统一框架,该框架结合了用于文本和图像的专用编码器以及一个使用逐元素交互捕捉跨模态关系的融合模块。此外,使用分类头预测声明的真实性,并采用对比学习目标鼓励在共享潜在空间中声明-证据对之间的语义对齐。 Result: 在Factify 2数据集上评估了该方法,取得了0.84的加权F1分数,明显优于基线方法。 Conclusion: 结果突显了显式多模态推理的有效性,并展示了该方法在复杂现实场景中进行可扩展且可解释的事实核查的潜力。 Abstract: The growing rate of multimodal misinformation, where claims are supported by both text and images, poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we have proposed a unified framework for fine-grained multimodal fact verification called "MultiCheck", designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios.

[13] BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation

Yuhao Wang,Ruiyang Ren,Yucheng Wang,Jing Liu,Wayne Xin Zhao,Hua Wu,Haifeng Wang

Main category: cs.CL

TL;DR: This paper introduces BEE-RAG, an entropy-engineered framework that improves the performance and adaptability of retrieval-augmented generation systems by stabilizing attention dynamics and managing entropy in long contexts.

Details Motivation: RAG systems often suffer from performance degradation due to long retrieval contexts, which cause unconstrained entropy growth and attention dilution. This work aims to address these limitations by engineering entropy for better RAG adaptability. Method: The paper proposes the balanced entropy-engineered RAG (BEE-RAG) framework, incorporating entropy invariance to stabilize attention dynamics. It introduces a zero-shot inference strategy for multi-importance estimation and a parameter-efficient adaptive fine-tuning mechanism to optimize balancing factors. Result: Extensive experiments demonstrate that BEE-RAG effectively improves RAG performance across various tasks, ensuring stable entropy levels and reducing the negative impact of long retrieval contexts. Conclusion: The proposed BEE-RAG framework enhances the adaptability of RAG systems to varying context lengths by leveraging balanced context entropy and reformulating attention dynamics, leading to improved performance across multiple RAG tasks. Abstract: With the rapid advancement of large language models (LLMs), retrieval-augmented generation (RAG) has emerged as a critical approach to supplement the inherent knowledge limitations of LLMs. However, due to the typically large volume of retrieved information, RAG tends to operate with long context lengths. From the perspective of entropy engineering, we identify unconstrained entropy growth and attention dilution due to long retrieval context as significant factors affecting RAG performance. In this paper, we propose the balanced entropy-engineered RAG (BEE-RAG) framework, which improves the adaptability of RAG systems to varying context lengths through the principle of entropy invariance. By leveraging balanced context entropy to reformulate attention dynamics, BEE-RAG separates attention sensitivity from context length, ensuring a stable entropy level. Building upon this, we introduce a zero-shot inference strategy for multi-importance estimation and a parameter-efficient adaptive fine-tuning mechanism to obtain the optimal balancing factor for different settings. Extensive experiments across multiple RAG tasks demonstrate the effectiveness of BEE-RAG.

[14] Attention Basin: Why Contextual Position Matters in Large Language Models

Zihao Yi,Delong Zeng,Zhenqing Ling,Haohao Luo,Zhe Xu,Wei Liu,Jian Luan,Wanxia Cao,Ying Shen

Main category: cs.CL

TL;DR: 本文提出了一种名为AttnRank的两阶段框架,通过估计模型的固有位置注意力偏好并重新排序内容,从而在不修改模型参数或训练过程的情况下提高大型语言模型的性能。

Details Motivation: 研究大型语言模型输入中信息的上下文位置对其性能的显著影响背后的机制。 Method: 引入了一个名为AttnRank的两阶段框架,通过使用小型校准集估计模型的固有位置注意力偏好,并重新排序检索到的文档或少样本示例,使最显著的内容与高注意力位置对齐。 Result: 实验表明,AttnRank在不同架构和规模的10个大型语言模型上都实现了显著提升,且无需修改模型参数或训练过程。 Conclusion: AttnRank有效地提高了大型语言模型在多跳问答和少样本上下文学习任务中的表现,并且是一个无需训练、即插即用的方法。 Abstract: The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model's intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.

[15] Towards Assessing Medical Ethics from Knowledge to Practice

Chang Hong,Minghao Wu,Qingying Xiao,Yuchi Wang,Xiang Wan,Guangjun Yu,Benyou Wang,Yan Hu

Main category: cs.CL

TL;DR: 本研究提出PrinciplismQA,一个基于医学伦理原则的全面基准测试,用于评估大语言模型在医疗伦理方面的表现。研究发现模型在实际应用伦理原则时存在显著不足,尤其是在处理仁慈原则相关的困境时。

Details Motivation: 将大语言模型集成到医疗保健中需要严格评估其伦理推理能力,而当前的基准测试往往忽视了这一领域。 Method: 基于Principlism原则构建包含3648个问题的高质量数据集,包括从权威教科书中整理的多项选择题和从医学伦理案例研究文献中获取的开放式问题,并由医学专家验证。 Result: 实验显示模型的伦理知识与实际应用之间存在显著差距,尤其是在动态应用伦理原则到现实情境时。大多数LLM在涉及仁慈原则的困境中表现困难,并且过度强调其他原则。前沿的闭源模型凭借强大的通用能力目前领先,而医学领域的微调可以提升模型的整体伦理能力。 Conclusion: PrinciplismQA 提供了一个可扩展的框架,用于诊断特定的伦理弱点,为更平衡和负责任的医疗人工智能铺平道路。 Abstract: The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs' alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models' ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models' overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.

[16] ATLANTIS at SemEval-2025 Task 3: Detecting Hallucinated Text Spans in Question Answering

Catherine Kobus,François Lancelot,Marion-Cécile Martin,Nawal Ould Amer

Main category: cs.CL

TL;DR: 本文探讨了ATLANTIS团队在SemEval-2025任务3中的贡献,旨在检测问答系统中的幻觉文本段落。

Details Motivation: 大型语言模型(LLMs)在自然语言生成方面取得了显著进步,但仍然容易产生幻觉,生成错误或误导性的内容。 Method: 利用少量样本提示、标记级别分类或在合成数据上微调的LLM,探索了有无外部上下文的方法。 Result: 在西班牙语中取得了顶级排名,在英语和德语中也获得了竞争力的排名。 Conclusion: 整合相关上下文对于减少幻觉的重要性,并展示了微调模型和提示工程的潜力。 Abstract: This paper presents the contributions of the ATLANTIS team to SemEval-2025 Task 3, focusing on detecting hallucinated text spans in question answering systems. Large Language Models (LLMs) have significantly advanced Natural Language Generation (NLG) but remain susceptible to hallucinations, generating incorrect or misleading content. To address this, we explored methods both with and without external context, utilizing few-shot prompting with a LLM, token-level classification or LLM fine-tuned on synthetic data. Notably, our approaches achieved top rankings in Spanish and competitive placements in English and German. This work highlights the importance of integrating relevant context to mitigate hallucinations and demonstrate the potential of fine-tuned models and prompt engineering.

[17] Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation

Haonan Shangguan,Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang,Ge Yu

Main category: cs.CL

TL;DR: 本文介绍了一种新的轻量级模型MulCoT-RD,用于在资源受限的环境下进行多模态情感分析,其通过“教师-助手-学生”的蒸馏范式实现了高效的情感推理生成和分类。

Details Motivation: 当前的方法主要依赖于重参数化的(多模态)大语言模型的能力来进行情感分类,而忽视了资源受限环境中自主的多模态情感推理生成。 Method: 提出了一种名为MulCoT-RD的多模态思维链推理蒸馏模型,该模型采用“教师-助手-学生”的蒸馏范式。 Result: MulCoT-RD仅有3B参数,在JMSRC上取得了良好的性能,同时表现出强大的泛化能力和增强的可解释性。 Conclusion: MulCoT-RD是一种适用于资源受限环境的轻量级模型,它在JMSRC任务上表现出色,具有良好的泛化能力和增强的可解释性。 Abstract: The surge in rich multimodal content on social media platforms has greatly advanced Multimodal Sentiment Analysis (MSA), with Large Language Models (LLMs) further accelerating progress in this field. Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) LLMs for sentiment classification, overlooking autonomous multimodal sentiment reasoning generation in resource-constrained environments. Therefore, we focus on the Resource-Limited Joint Multimodal Sentiment Reasoning and Classification task, JMSRC, which simultaneously performs multimodal sentiment reasoning chain generation and sentiment classification only with a lightweight model. We propose a Multimodal Chain-of-Thought Reasoning Distillation model, MulCoT-RD, designed for JMSRC that employs a "Teacher-Assistant-Student" distillation paradigm to address deployment constraints in resource-limited environments. We first leverage a high-performance Multimodal Large Language Model (MLLM) to generate the initial reasoning dataset and train a medium-sized assistant model with a multi-task learning mechanism. A lightweight student model is jointly trained to perform efficient multimodal sentiment reasoning generation and classification. Extensive experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.

[18] Pruning Large Language Models by Identifying and Preserving Functional Networks

Yiheng Liu,Junhao Ning,Sichen Xia,Xiaohui Gao,Ning Qiang,Bao Ge,Junwei Han,Xintao Hu

Main category: cs.CL

TL;DR: 本文提出了一种基于功能网络识别和关键神经元保留的大语言模型剪枝方法,以解决传统结构化剪枝方法忽略神经元之间交互协作的问题。

Details Motivation: 当前的结构化剪枝方法通常忽略了对大语言模型功能至关重要的神经元之间的相互作用和协作,导致模型宏观功能架构的破坏,从而降低了剪枝性能。 Method: 将大语言模型视为数字大脑,将其分解为功能网络,并通过保留这些功能网络中的关键神经元来实现模型剪枝。 Result: 提出了一种基于功能网络识别和关键神经元保留的大语言模型剪枝方法,解决了传统方法忽略神经元交互协作的问题。 Conclusion: 实验结果表明,该方法能够成功识别和定位大语言模型中的功能网络和关键神经元,从而实现高效模型剪枝。 Abstract: Structured pruning is one of the representative techniques for compressing large language models (LLMs) to reduce GPU memory consumption and accelerate inference speed. It offers significant practical value in improving the efficiency of LLMs in real-world applications. Current structured pruning methods typically rely on assessment of the importance of the structure units and pruning the units with less importance. Most of them overlooks the interaction and collaboration among artificial neurons that are crucial for the functionalities of LLMs, leading to a disruption in the macro functional architecture of LLMs and consequently a pruning performance degradation. Inspired by the inherent similarities between artificial neural networks and functional neural networks in the human brain, we alleviate this challenge and propose to prune LLMs by identifying and preserving functional networks within LLMs in this study. To achieve this, we treat an LLM as a digital brain and decompose the LLM into functional networks, analogous to identifying functional brain networks in neuroimaging data. Afterwards, an LLM is pruned by preserving the key neurons within these functional networks. Experimental results demonstrate that the proposed method can successfully identify and locate functional networks and key neurons in LLMs, enabling efficient model pruning. Our code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.

[19] CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL

Sijie Wang,Quanjiang Guo,Kai Zhao,Yawei Zhang,Xin Li,Xiang Li,Siqi Li,Rui She,Shangshu Yu,Wee Peng Tay

Main category: cs.CL

TL;DR: CodeBoost 是一种无需人工标注指令的代码大型语言模型后训练框架,通过多种创新方法提升模型性能。

Details Motivation: 现有的代码大型语言模型通常依赖人工标注的指令进行后训练,而高质量指令的收集既费力又难以扩展。相比之下,代码片段广泛可用,因此需要一种不依赖人工标注的后训练方法。 Method: CodeBoost 框架引入了五个关键组件:最大团簇筛选、双向预测、错误感知预测、异构增强和异构奖励,以增强代码大型语言模型。 Result: 实验结果表明,CodeBoost 在多个代码大型语言模型和基准测试中均能持续提升性能,证明了其有效性。 Conclusion: CodeBoost 是一种有效的代码大型语言模型后训练框架,它仅利用代码片段而不依赖人工标注的指令,提高了模型性能,并展示了其可扩展性和有效性。 Abstract: Code large language models (LLMs) have become indispensable tools for building efficient and automated coding pipelines. Existing models are typically post-trained using reinforcement learning (RL) from general-purpose LLMs using "human instruction-final answer" pairs, where the instructions are usually from manual annotations. However, collecting high-quality coding instructions is both labor-intensive and difficult to scale. On the other hand, code snippets are abundantly available from various sources. This imbalance presents a major bottleneck in instruction-based post-training. We propose CodeBoost, a post-training framework that enhances code LLMs purely from code snippets, without relying on human-annotated instructions. CodeBoost introduces the following key components: (1) maximum-clique curation, which selects a representative and diverse training corpus from code; (2) bi-directional prediction, which enables the model to learn from both forward and backward prediction objectives; (3) error-aware prediction, which incorporates learning signals from both correct and incorrect outputs; (4) heterogeneous augmentation, which diversifies the training distribution to enrich code semantics; and (5) heterogeneous rewarding, which guides model learning through multiple reward types including format correctness and execution feedback from both successes and failures. Extensive experiments across several code LLMs and benchmarks verify that CodeBoost consistently improves performance, demonstrating its effectiveness as a scalable and effective training pipeline.

[20] ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs

Dongxu Zhang,Ning Yang,Jihua Zhu,Jinnan Yang,Miao Xin,Baoliang Tian

Main category: cs.CL

TL;DR: 该论文挑战了早期错误对推理链影响最大的假设,发现后期错误更具破坏性,并提出ASCoT方法通过自适应验证和针对性纠错提升LLM推理可靠性。

Details Motivation: 尽管CoT提示显著提升了LLM的推理能力,但推理链的可靠性仍是一个关键挑战,尤其关注早期错误影响的假设被挑战。 Method: 通过系统性错误注入实验,提出ASCoT方法,包括自适应验证管理器(AVM)和多视角自我纠正引擎(MSCE),利用位置影响评分函数I(k)识别高风险步骤并进行针对性纠错。 Result: 实验表明ASCoT在GSM8K和MATH等基准测试中取得了卓越的准确性,证明了后期错误比早期错误更具破坏性,并成功缓解了这一问题。 Conclusion: ASCoT有效解决了LLM推理链中后期错误导致答案错误的问题,优于标准CoT等基线方法,强调了诊断特定失败模式的重要性。 Abstract: Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs), yet the reliability of these reasoning chains remains a critical challenge. A widely held "cascading failure" hypothesis suggests that errors are most detrimental when they occur early in the reasoning process. This paper challenges that assumption through systematic error-injection experiments, revealing a counter-intuitive phenomenon we term "Late-Stage Fragility": errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning. To address this specific vulnerability, we introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method. ASCoT employs a modular pipeline in which an Adaptive Verification Manager (AVM) operates first, followed by the Multi-Perspective Self-Correction Engine (MSCE). The AVM leverages a Positional Impact Score function I(k) that assigns different weights based on the position within the reasoning chains, addressing the Late-Stage Fragility issue by identifying and prioritizing high-risk, late-stage steps. Once these critical steps are identified, the MSCE applies robust, dual-path correction specifically to the failure parts. Extensive experiments on benchmarks such as GSM8K and MATH demonstrate that ASCoT achieves outstanding accuracy, outperforming strong baselines, including standard CoT. Our work underscores the importance of diagnosing specific failure modes in LLM reasoning and advocates for a shift from uniform verification strategies to adaptive, vulnerability-aware correction mechanisms.

[21] Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue

Sukannya Purkayastha,Nils Dycke,Anne Lauscher,Iryna Gurevych

Main category: cs.CL

TL;DR: 本研究探索了将元评审视为决策过程的方法,通过生成高质量合成数据训练定制对话代理,提高了元评审的效率。

Details Motivation: 元评审是决定论文是否被接受的最后一步,但现有的研究主要将其视为摘要问题,而忽略了其作为决策过程的本质。研究旨在通过对话代理帮助决策者更有效地进行元评审。 Method: 通过基于自我改进策略的大型语言模型生成合成数据,并使用这些数据训练专门用于元评审的对话代理。 Result: 实验表明,该方法能够生成高质量的合成数据,并成功训练出在元评审任务中表现优异的对话代理,且在实际应用中验证了其有效性。 Conclusion: 研究证明了生成的合成数据有效,并且定制的对话代理在现实世界的元评审任务中优于现成的大型语言模型代理,提高了元评审的效率。 Abstract: Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform \emph{off-the-shelf} LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.\footnote{Code and Data: https://github.com/UKPLab/arxiv2025-meta-review-as-dialog

[22] SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

Nikita Dragunov,Temurbek Rahmatullaev,Elizaveta Goncharova,Andrey Kuznetsov,Anton Razzhigaev

Main category: cs.CL

TL;DR: SONAR-LLM是一种基于连续SONAR嵌入空间的解码器-only变压器,通过标记级交叉熵监督训练,保留了LCM的语义抽象,同时消除了其扩散采样器并恢复了基于似然的训练信号。

Details Motivation: 为了解决Large Concept Model (LCM)中扩散采样器的限制,并恢复更有效的基于似然的训练信号。 Method: 提出SONAR-LLM,一种在连续SONAR嵌入空间中进行思考的解码器-only变压器,通过固定SONAR解码器传播标记级交叉熵进行监督训练。 Result: 在从39M到1.3B参数的不同模型规模上,SONAR-LLM均达到了具有竞争力的生成质量。 Conclusion: SONAR-LLM通过其混合目标函数成功结合了LCM的语义抽象与更有效的训练信号,展现出良好的生成能力和可扩展性。 Abstract: The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.

[23] Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression

Jiameng Huang,Baijiong Lin,Guhao Feng,Jierun Chen,Di He,Lu Hou

Main category: cs.CL

TL;DR: Certainty-Guided Reflection Suppression (CGRS)方法通过动态抑制反思触发词,在保持推理准确性的同时减少冗余步骤和token使用。

Details Motivation: 反思行为虽然增强模型性能,但导致过量的推理步骤,增加成本,降低实用性。 Method: Certainty-Guided Reflection Suppression (CGRS)方法通过动态抑制高置信度下的反思触发词生成,减少冗余推理。 Result: CGRS平均减少18.5%到41.9%的token使用,并在多个基准和模型上保持准确性。 Conclusion: CGRS有效减少了LRLMs中的冗余推理步骤,同时保持了推理准确性,具备实用价值。 Abstract: Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., "Wait" and "Alternatively") to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model's generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS's effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS's practical value for efficient reasoning.

[24] Evaluation of a Sign Language Avatar on Comprehensibility, User Experience \& Acceptability

Fenya Wasserroth,Eleftherios Avramidis,Vera Czehmann,Tanja Kojic,Fabrizio Nunnari,Sebastian Möller

Main category: cs.CL

TL;DR: This study investigates the impact of adding adjustment features to a sign language avatar on Microsoft Hololens 2. It finds that personalization alone is insufficient, and that sign language avatars must be comprehensible by default. Key recommendations include enhancing facial animation and improving interaction interfaces.

Details Motivation: The motivation behind this study is to understand the impact of adding adjustment features to a sign language avatar and to identify key factors influencing comprehensibility, user experience, and acceptability. Method: The investigation involved analyzing interactions of expert German Sign Language (DGS) users with both adjustable and non-adjustable avatars on a Microsoft Hololens 2 device. The analysis focused on factors influencing comprehensibility, user experience (UX), and acceptability of the system. Result: Despite user preference for adjustable settings, no significant improvements in UX or comprehensibility were observed. Hedonic quality was rated higher than pragmatic quality, stress levels were higher for the adjustable avatar, and concerns were raised about the intuitiveness of Hololens adjustment gestures. Conclusion: This study concludes that personalization alone is not enough for SL avatars; they must be comprehensible by default. The research highlights the importance of enhancing mouthing and facial animation, improving interaction interfaces, and applying participatory design for better user experience and acceptability. Abstract: This paper presents an investigation into the impact of adding adjustment features to an existing sign language (SL) avatar on a Microsoft Hololens 2 device. Through a detailed analysis of interactions of expert German Sign Language (DGS) users with both adjustable and non-adjustable avatars in a specific use case, this study identifies the key factors influencing the comprehensibility, the user experience (UX), and the acceptability of such a system. Despite user preference for adjustable settings, no significant improvements in UX or comprehensibility were observed, which remained at low levels, amid missing SL elements (mouthings and facial expressions) and implementation issues (indistinct hand shapes, lack of feedback and menu positioning). Hedonic quality was rated higher than pragmatic quality, indicating that users found the system more emotionally or aesthetically pleasing than functionally useful. Stress levels were higher for the adjustable avatar, reflecting lower performance, greater effort and more frustration. Additionally, concerns were raised about whether the Hololens adjustment gestures are intuitive and easy to familiarise oneself with. While acceptability of the concept of adjustability was generally positive, it was strongly dependent on usability and animation quality. This study highlights that personalisation alone is insufficient, and that SL avatars must be comprehensible by default. Key recommendations include enhancing mouthing and facial animation, improving interaction interfaces, and applying participatory design.

[25] Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025

Samy Ateia,Udo Kruschwitz

Main category: cs.CL

TL;DR: 本文研究了在生物医学专业搜索任务中使用大型语言模型的自我反馈机制,探索其对查询扩展和多类型答案生成的效果,并比较推理与非推理模型的能力差异。

Details Motivation: 现有的基于代理的检索增强生成(RAG)和“深度研究”系统旨在使大型语言模型(LLMs)能够自主迭代优化输出,但在生物医学等专业搜索任务中应用时,这些自动化系统可能降低用户参与度,与专家的信息需求不匹配。因此,本文旨在研究LLM自我修正能力,并探索其在专业搜索系统中的潜力与局限。 Method: 本文通过在BioASQ CLEF 2025挑战赛的数据集上测试Gemini-Flash 2.0、o3-mini、o4-mini和DeepSeek-R1等模型,采用一种自我反馈机制,让LLM生成、评估并优化自己的输出,以实现查询扩展和多种答案类型(是/否、事实型、列表型、理想型)的改进,并研究推理模型是否能更有效地生成有用反馈。 Result: 初步实验结果显示,不同的模型在自我反馈策略下的表现存在差异,推理模型在某些任务中展现出更强的反馈生成能力,但整体效果因任务类型而异。 Conclusion: 本文探讨了当前具有推理和非推理能力的大语言模型在生物医学专业搜索任务中的自我反馈机制的效果,初步结果表明不同模型和任务下自我反馈策略的表现各异,研究为未来比较LLM生成反馈与直接专家输入的有效性提供了参考。 Abstract: Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.

[26] The TUB Sign Language Corpus Collection

Eleftherios Avramidis,Vera Czehmann,Fabian Deckert,Lorenz Hufe,Aljoscha Lipski,Yuni Amaloa Quintero Villalobos,Tae Kwon Rhee,Mengqian Shi,Lennart Stölting,Fabrizio Nunnari,Sebastian Möller

Main category: cs.CL

TL;DR: 本文构建了包含12种手语的视频语料库,总时长超过1300小时,为手语研究提供了重要资源。

Details Motivation: 为了促进手语研究和相关应用的发展,构建了包含多种手语的大规模视频和平行字幕语料库。 Method: 通过从多个在线来源收集和处理多种手语的视频,主要包括新闻节目、政府机构和教育频道的广播材料,经过数据收集、内容创作者告知、使用许可申请、抓取和裁剪等多个阶段完成。 Result: 构建了一个包含12种手语的语料库,总时长超过1300小时,包含4381个视频文件和约1400万标记的字幕,其中8种拉丁美洲手语语料库是首次系统性构建,德国手语语料库规模达到此前的十倍。 Conclusion: 本文介绍了创建包含12种手语的平行语料库的过程和成果,为手语研究和应用提供了重要的数据资源。 Abstract: We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.

[27] MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints

Zhong Ken Hew,Jia Xin Low,Sze Jue Yang,Chee Seng chan

Main category: cs.CL

TL;DR: This paper introduces MyCulture, a new benchmark for evaluating Large Language Models on Malaysian culture, highlighting the need for culturally grounded and linguistically inclusive benchmarks due to observed disparities in cultural comprehension among existing models.

Details Motivation: The motivation for this study stems from the observed cultural biases in Large Language Models (LLMs) due to their training on data dominated by high-resource languages like English and Chinese. This bias poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. Method: The researchers introduced MyCulture, a benchmark designed to evaluate LLMs on Malaysian culture using a novel open-ended multiple-choice question format without predefined options. They analyzed structural bias by comparing model performance on structured versus free-form outputs and assessed language bias through multilingual prompt variations. Result: The evaluation across various regional and international LLMs showed significant disparities in cultural comprehension, emphasizing the need for more inclusive benchmarks. Conclusion: The study concludes that there is a significant disparity in cultural comprehension among LLMs, which highlights the importance of developing benchmarks that are culturally grounded and linguistically inclusive. Abstract: Large Language Models (LLMs) often exhibit cultural biases due to training data dominated by high-resource languages like English and Chinese. This poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. To address this, we introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on Malaysian culture across six pillars: arts, attire, customs, entertainment, food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options, thereby reducing guessing and mitigating format bias. We provide a theoretical justification for the effectiveness of this open-ended structure in improving both fairness and discriminative power. Furthermore, we analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations. Our evaluation across a range of regional and international LLMs reveals significant disparities in cultural comprehension, highlighting the urgent need for culturally grounded and linguistically inclusive benchmarks in the development and assessment of LLMs.

[28] LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Ming Zhang,Yujiong Shen,Jingyi Deng,Yuhui Wang,Yue Zhang,Junzhe Wang,Shichun Liu,Shihan Dou,Huayu Sha,Qiyuan Peng,Changhao Jiang,Jingqi Tong,Yilong Wu,Zhihao Zhang,Mingqi Wu,Zhiheng Xi,Mingxu Chai,Tao Liang,Zhihui Fei,Zhen Wang,Mingyang Wan,Guojun Ma,Tao Gui,Qi Zhang,Xuanjing Huang

Main category: cs.CL

TL;DR: LLMEval-3 is a dynamic evaluation framework for Large Language Models that addresses issues like data contamination and leaderboard overfitting, offering a more trustworthy method to assess true model capabilities.

Details Motivation: Existing evaluations of LLMs on static benchmarks are vulnerable to data contamination and leaderboard overfitting, which obscures the true capabilities of models. This paper aims to introduce a more reliable evaluation framework. Method: LLMEval-3 uses a proprietary bank of 220k graduate-level questions to dynamically sample unseen test sets for each evaluation run. It ensures integrity through contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process. The framework also includes a relative ranking system for fair comparison. Result: The LLMEval-3 framework demonstrates exceptional robustness in ranking stability and consistency. A 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. Conclusion: LLMEval-3 provides a robust and credible methodology for assessing the true capabilities of Large Language Models (LLMs) beyond leaderboard scores, promoting the development of more trustworthy evaluation standards. Abstract: Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.

[29] TASE: Token Awareness and Structured Evaluation for Multilingual Language Models

Chenzhuo Zhao,Xinda Wang,Yue Huang,Junting Lu,Ziqian Liu

Main category: cs.CL

TL;DR: TASE benchmark identifies persistent weaknesses in large language models' token-level reasoning across multiple languages.

Details Motivation: LLMs struggle with fine-grained token-level understanding despite strong high-level semantic performance. Method: Introduced TASE benchmark with 10 tasks across 3 languages to evaluate token-level understanding and structural reasoning. Result: Human performance significantly outpaces current LLMs on token-level reasoning tasks. Conclusion: TASE benchmark reveals current LLMs' weaknesses in token-level reasoning and provides insights for future improvements. Abstract: While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning--capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs' ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at https://github.com/cyzcz/Tase .

[30] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations

Li-Chun Lu,Miri Liu,Pin-Chun Lu,Yufei Tian,Shao-Hua Sun,Nanyun Peng

Main category: cs.CL

TL;DR: This paper evaluates common creativity metrics and finds them inconsistent and limited, calling for improved frameworks that better align with human creativity judgments.

Details Motivation: The motivation is to evaluate and understand the effectiveness of existing creativity metrics in capturing the true essence of creativity across different domains. Method: The authors systematically examined, analyzed, and compared various creativity measures—creativity index, perplexity, syntactic templates, and LLM-as-a-Judge—across multiple creative domains like writing, problem-solving, and research ideation. Result: The analysis showed that the metrics exhibit limited consistency, each capturing different aspects of creativity. Issues like the creativity index's focus on lexical diversity, perplexity's dependence on model confidence, syntactic templates' failure to capture conceptual creativity, and instability and bias in LLM-as-a-Judge were identified. Conclusion: The paper concludes that current creativity measures have significant limitations and do not consistently capture the multifaceted nature of creativity. It emphasizes the need for more robust frameworks that align with human judgments. Abstract: We systematically examine, analyze, and compare representative creativity measures--creativity index, perplexity, syntactic templates, and LLM-as-a-Judge--across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index's focus on lexical diversity, perplexity's sensitivity to model confidence, and syntactic templates' inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.

[31] LAG: Logic-Augmented Generation from a Cartesian Perspective

Yilin Xiao,Chuang Zhou,Qinggang Zhang,Su Dong,Shengyuan Chen,Xiao Huang

Main category: cs.CL

TL;DR: This paper proposes Logic-Augmented Generation (LAG), a novel approach to improve the reasoning capabilities of large language models (LLMs), addressing their limitations in knowledge-intensive tasks and offering a principled alternative to retrieval-augmented generation (RAG) systems.

Details Motivation: Large language models (LLMs) exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. Retrieval-augmented generation (RAG) struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Method: This paper introduces Logic-Augmented Generation (LAG), which reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. It decomposes complex questions into atomic sub-questions, resolves them sequentially while using prior answers to guide context retrieval, and synthesizes all sub-resolutions to generate verified responses. It also includes a logical termination mechanism to prevent error propagation. Result: Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition. Conclusion: LAG offers a principled alternative to existing RAG systems by enhancing reasoning robustness, reducing hallucination, and aligning LLM problem-solving with human cognition. Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la m\'ethode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems.

[32] The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities

Harsh Nishant Lalai,Raj Sanjay Shah,Jiaxin Pei,Sashank Varma,Yi-Chia Wang,Ali Emami

Main category: cs.CL

TL;DR: 论文通过20个问题游戏的新数据集Geo20Q+,揭示了大型语言模型在推理过程中存在的地理和文化隐性偏见。

Details Motivation: 尽管大型语言模型已经经过广泛调整以减轻显性偏见,但它们仍然表现出根植于预训练数据中的隐性偏见。本文旨在通过模型主动提问的方式来研究其行为,从而揭示这些隐性偏见。 Method: 论文使用了20个问题游戏作为测试平台,通过一个新的数据集Geo20Q+,对来自不同地区的知名人物和文化重要物品进行实体推理的地理性能差异进行了系统评估。测试涵盖了两种游戏配置和七种语言。 Result: 研究结果显示,LLMs在推断来自全球北部和西部的实体时表现明显优于全球南部和东部。维基百科页面浏览量和预训练语料库频率与性能有轻微相关性,但无法完全解释这些差异。 Conclusion: 该论文的结论是,大型语言模型(LLMs)在推理过程中存在地理和文化上的隐性偏见,这些偏见在标准提示设置中难以发现,但通过创造性的自由形式评估框架可以揭示。此外,游戏使用的语言对性能差距影响不大。 Abstract: Large Language Models (LLMs) have been extensively tuned to mitigate explicit biases, yet they often exhibit subtle implicit biases rooted in their pre-training data. Rather than directly probing LLMs with human-crafted questions that may trigger guardrails, we propose studying how models behave when they proactively ask questions themselves. The 20 Questions game, a multi-turn deduction task, serves as an ideal testbed for this purpose. We systematically evaluate geographic performance disparities in entity deduction using a new dataset, Geo20Q+, consisting of both notable people and culturally significant objects (e.g., foods, landmarks, animals) from diverse regions. We test popular LLMs across two gameplay configurations (canonical 20-question and unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs are substantially more successful at deducing entities from the Global North than the Global South, and the Global West than the Global East. While Wikipedia pageviews and pre-training corpus frequency correlate mildly with performance, they fail to fully explain these disparities. Notably, the language in which the game is played has minimal impact on performance gaps. These findings demonstrate the value of creative, free-form evaluation frameworks for uncovering subtle biases in LLMs that remain hidden in standard prompting setups. By analyzing how models initiate and pursue reasoning goals over multiple turns, we find geographic and cultural disparities embedded in their reasoning processes. We release the dataset (Geo20Q+) and code at https://sites.google.com/view/llmbias20q/home.

Santosh T. Y. S. S,Youssef Tarek Elkhayat,Oana Ichim,Pranav Shetty,Dongsheng Wang,Zhiqiang Ma,Armineh Nourbakhsh,Xiaomo Liu

Main category: cs.CL

TL;DR: CoCoLex 通过动态插值模型与上下文复制机制,提高了法律文本生成的准确性和忠实度。

Details Motivation: 由于 LLMs 倾向于生成不准确或虚构的输出,其在法律领域的应用受到限制。尽管检索增强生成可以提供基于外部知识的解决方案,但它不能保证上下文的有效整合。 Method: 引入了 Confidence-guided Copy-based Decoding(CoCoLex),动态插值模型生成的词汇分布和基于上下文复制的分布。 Result: 实验结果显示 CoCoLex 在五个法律基准测试中均优于现有的上下文感知解码方法,特别是在长文本生成任务中表现突出。 Conclusion: CoCoLex 是一种有效的法律文本生成解码策略,能确保生成内容更忠实于源文。 Abstract: Due to their ability to process long and complex contexts, LLMs can offer key benefits to the Legal domain, but their adoption has been hindered by their tendency to generate unfaithful, ungrounded, or hallucinatory outputs. While Retrieval-Augmented Generation offers a promising solution by grounding generations in external knowledge, it offers no guarantee that the provided context will be effectively integrated. To address this, context-aware decoding strategies have been proposed to amplify the influence of relevant context, but they usually do not explicitly enforce faithfulness to the context. In this work, we introduce Confidence-guided Copy-based Decoding for Legal Text Generation (CoCoLex)-a decoding strategy that dynamically interpolates the model produced vocabulary distribution with a distribution derived based on copying from the context. CoCoLex encourages direct copying based on the model's confidence, ensuring greater fidelity to the source. Experimental results on five legal benchmarks demonstrate that CoCoLex outperforms existing context-aware decoding methods, particularly in long-form generation tasks.

[34] Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees

Guang Yang,Xinyang Liu

Main category: cs.CL

TL;DR: This paper proposes a reliable uncertainty quantification method for LLMs in multiple-choice question answering, using frequency-based predictive entropy to ensure coverage guarantees and improve trustworthiness in black-box settings.

Details Motivation: The motivation stems from the unreliability of LLMs, such as hallucinations and overconfidence, which hinder their application in high-risk domains and necessitate a reliable uncertainty quantification framework. Method: The method uses conformal prediction with multiple samplings of the LLM's output distribution to calculate predictive entropy (PE), using the most frequent sample as a reference point for uncertainty quantification. Result: Experimental results show that the frequency-based PE outperforms logit-based PE in distinguishing correct from incorrect predictions (measured by AUROC) and effectively controls the empirical miscoverage rate under specified risk levels. Conclusion: The proposed frequency-based uncertainty quantification method enhances the reliability of LLMs in MCQA by providing provable coverage guarantees and effectively controlling miscoverage rates, making it a viable solution for black-box scenarios. Abstract: Large Language Models (LLMs) have shown remarkable progress in multiple-choice question answering (MCQA), but their inherent unreliability, such as hallucination and overconfidence, limits their application in high-risk domains. To address this, we propose a frequency-based uncertainty quantification method under black-box settings, leveraging conformal prediction (CP) to ensure provable coverage guarantees. Our approach involves multiple independent samplings of the model's output distribution for each input, with the most frequent sample serving as a reference to calculate predictive entropy (PE). Experimental evaluations across six LLMs and four datasets (MedMCQA, MedQA, MMLU, MMLU-Pro) demonstrate that frequency-based PE outperforms logit-based PE in distinguishing between correct and incorrect predictions, as measured by AUROC. Furthermore, the method effectively controls the empirical miscoverage rate under user-specified risk levels, validating that sampling frequency can serve as a viable substitute for logit-based probabilities in black-box scenarios. This work provides a distribution-free model-agnostic framework for reliable uncertainty quantification in MCQA with guaranteed coverage, enhancing the trustworthiness of LLMs in practical applications.

[35] Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs

Franziska Weeber,Tanise Ceron,Sebastian Padó

Main category: cs.CL

TL;DR: 研究发现多语言大语言模型中的政治观点在不同语言间具有可转移性,对齐后观点在所有语言中趋于一致,表明对齐存在挑战。

Details Motivation: 调查不同文化背景下公众舆论调查中政治观点的差异是否会在多语言大语言模型中产生跨语言差异。 Method: 通过使用投票建议应用程序中的政治声明,评估多语言大语言模型在不同语言下的政治观点,并通过直接偏好优化进行政治立场对齐。 Result: 未对齐的模型显示了极少的显著跨语言差异,政治对齐在所有五种语言中几乎均匀地改变了观点。 Conclusion: 政治观点在不同语言之间具有可转移性,这表明实现多语言大语言模型在社会语言学、文化和政治层面的对齐存在挑战。 Abstract: Public opinion surveys show cross-cultural differences in political opinions between socio-cultural contexts. However, there is no clear evidence whether these differences translate to cross-lingual differences in multilingual large language models (MLLMs). We analyze whether opinions transfer between languages or whether there are separate opinions for each language in MLLMs of various sizes across five Western languages. We evaluate MLLMs' opinions by prompting them to report their (dis)agreement with political statements from voting advice applications. To better understand the interaction between languages in the models, we evaluate them both before and after aligning them with more left or right views using direct preference optimization and English alignment data only. Our findings reveal that unaligned models show only very few significant cross-lingual differences in the political opinions they reflect. The political alignment shifts opinions almost uniformly across all five languages. We conclude that in Western language contexts, political opinions transfer between languages, demonstrating the challenges in achieving explicit socio-linguistic, cultural, and political alignment of MLLMs.

[36] MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Shaoxiong Zhan,Yanlin Lai,Ziyu Lu,Dahua Lin,Ziqing Yang,Fei Tang

Main category: cs.CL

TL;DR: MathSmith is a new framework for generating challenging mathematical problems that enhance LLM reasoning by ensuring data independence, increasing difficulty through structured strategies, and improving performance across multiple benchmarks.

Details Motivation: The advancement of large language models (LLMs) in mathematical reasoning is limited by the scarcity of high-quality, high-difficulty training data, and existing synthesis methods lack diversity and scalability. Method: MathSmith synthesizes mathematical problems from scratch using concept-explanation pairs from PlanetMath and applies nine predefined strategies as soft constraints. It uses reinforcement learning to optimize structural validity, reasoning complexity, and answer consistency, with problem difficulty reflected through autoregressive prompting length. Result: Experiments across five benchmarks (GSM8K, MATH-500, AIME2024, AIME2025, OlympiadBench) show that MathSmith outperforms existing baselines under both short and long chain-of-thought (CoT) settings, with additional benefits from its weakness-focused variant generation module. Conclusion: MathSmith demonstrates strong scalability, generalization, and transferability, highlighting the potential of high-difficulty synthetic data in advancing LLM reasoning capabilities. Abstract: Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.

[37] Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

Haitao Hong,Yuchen Yan,Xingyu Wu,Guiyang Hou,Wenqi Zhang,Weiming Lu,Yongliang Shen,Jun Xiao

Main category: cs.CL

TL;DR: 本研究提出了一种名为Cooper的强化学习框架,通过联合优化策略模型和奖励模型,解决了现有奖励范式的局限性,并在实际应用中展现了更高的性能。

Details Motivation: 当前主流的两种奖励范式(基于模型的奖励和基于规则的奖励)都存在局限性:基于规则的奖励缺乏鲁棒性,而基于模型的奖励容易受到奖励黑客攻击。 Method: 提出了一种名为Cooper的强化学习框架,该框架联合优化策略模型和奖励模型,并引入了一种混合注释策略和基于参考答案的奖励模型建模范式。 Result: Cooper在防止奖励黑客攻击的同时提高了端到端强化学习的性能,例如在Qwen2.5-1.5B-Instruct上平均准确率提高了0.54%。 Conclusion: 动态更新奖励模型是防止奖励黑客攻击的有效方法,为更好地将奖励模型集成到强化学习中提供了参考。 Abstract: Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.

[38] OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

Zixuan Wang,Dingming Li,Hongxing Li,Shuo Chen,Yuchen Yan,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang

Main category: cs.CL

TL;DR: OmniEAR框架揭示了当前语言模型在具身推理和多代理协作上的显著局限性,强调了具身AI系统的挑战。

Details Motivation: 大语言模型在抽象推理方面表现出色,但在具身代理推理方面的能力尚未得到充分探索。 Method: 提出OmniEAR框架,通过文本环境表示建模连续物理属性和复杂空间关系,系统评估语言模型在动态能力获取和自主协作策略方面的表现。 Result: 实验显示,在需要推理约束的任务中,模型性能显著下降,特别是在隐式协作和复合任务上。微调提升了单智能体任务表现,但对多智能体任务帮助有限。 Conclusion: OmniEAR作为严格基准,揭示了当前模型在具身推理方面的局限性,并为推进具身AI系统提供了方向。 Abstract: Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.

[39] Learning to Reason for Factuality

Xilun Chen,Ilia Kulikov,Vincent-Pierre Berges,Barlas Oğuz,Rulin Shao,Gargi Ghosh,Jason Weston,Wen-tau Yih

Main category: cs.CL

TL;DR: 本文提出了一种新的奖励函数结合在线强化学习,有效提升了推理大语言模型在长文本真实性任务中的表现,减少了幻觉并提高了回答的详细程度。

Details Motivation: 推理大语言模型在复杂推理任务上有了显著进步,但在长文本真实性方面存在挑战,容易产生幻觉。现有自动真实性评估方法在在线强化学习中会导致奖励黑客问题,因此需要一种新的奖励函数来克服这些问题。 Method: 通过设计一种同时考虑事实精确性、回答细节水平和答案相关性的奖励函数,并应用在线强化学习方法进行训练,从而提升推理大语言模型在长文本真实性任务中的表现。 Result: 论文中的方法在六个长文本真实性基准测试中取得了平均23.1个百分点的幻觉率降低,回答细节水平提升了23%,且没有降低整体回答的有用性。 Conclusion: 论文提出了一种新的奖励函数,并结合在线强化学习方法,有效提高了推理大语言模型在长文本真实性任务中的表现,同时减少了幻觉率并增加了回答的细节水平。 Abstract: Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.

[40] How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations

Brandon Jaipersaud,David Krueger,Ekdeep Singh Lubana

Main category: cs.CL

TL;DR: This paper uses linear probes to study persuasion dynamics in multi-turn conversations, showing that they are effective and efficient tools for analyzing complex behaviors like persuasion.

Details Motivation: To understand how large language models persuade humans and analyze this dynamic, especially in natural, multi-turn conversations. Method: Applied linear probes to analyze LLM representations in persuasion dynamics, based on insights from cognitive science, focusing on aspects like persuasion success, personality, and strategy. Result: Linear probes effectively captured persuasion dynamics, identified key conversation points, and outperformed prompting methods in certain aspects like uncovering persuasion strategy. Conclusion: Probes are a promising and efficient method for analyzing complex behaviors like persuasion, deception, and manipulation in multi-turn conversations and large datasets. Abstract: Large Language Models (LLMs) have started to demonstrate the ability to persuade humans, yet our understanding of how this dynamic transpires is limited. Recent work has used linear probes, lightweight tools for analyzing model representations, to study various LLM skills such as the ability to model user sentiment and political perspective. Motivated by this, we apply probes to study persuasion dynamics in natural, multi-turn conversations. We leverage insights from cognitive science to train probes on distinct aspects of persuasion: persuasion success, persuadee personality, and persuasion strategy. Despite their simplicity, we show that they capture various aspects of persuasion at both the sample and dataset levels. For instance, probes can identify the point in a conversation where the persuadee was persuaded or where persuasive success generally occurs across the entire dataset. We also show that in addition to being faster than expensive prompting-based approaches, probes can do just as well and even outperform prompting in some settings, such as when uncovering persuasion strategy. This suggests probes as a plausible avenue for studying other complex behaviours such as deception and manipulation, especially in multi-turn settings and large-scale dataset analysis where prompting-based methods would be computationally inefficient.

[41] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages

Mehrdad Zakershahrak,Samira Ghodratnama

Main category: cs.CL

TL;DR: H-NET++ 是一种分层动态分块模型,通过端到端训练学习语言信息分割,有效解决了在形态丰富语言(MRLs)中字节级语言模型面临的计算挑战,且无需分词器。

Details Motivation: 字节级语言模型虽然消除了脆弱的分词器,但在形态丰富语言(MRLs)中面临计算挑战,因为这些语言的单词包含很多字节。 Method: 提出 H-NET++ 模型,包括一个轻量级 Transformer 上下文混合器(1.9M 参数)用于跨块注意力、一个用于文档级一致性的两级潜在超先验、专门处理正字法错误(如波斯语 ZWNJ)的方法,以及基于课程的分阶段序列长度训练方法。 Result: 在 14 亿字节的波斯语语料库上,H-NET++ 取得了最先进的成果:相较于基于 BPE 的 GPT-2-fa,BPB 减少了 0.159(压缩率提高了 12%),ParsGLUE 上提升了 5.4pp,对 ZWNJ 错误的鲁棒性提高了 53%,在黄金形态边界上的 F1 得分为 73.8%。 Conclusion: H-NET++ 证明了分层动态分块提供了一种有效的无分词器解决方案,适用于形态丰富语言(MRLs),同时保持了计算效率。 Abstract: Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handling of orthographic artifacts (e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks align with Persian morphology without explicit supervision, demonstrating that hierarchical dynamic chunking provides an effective tokenizer-free solution for MRLs while maintaining computational efficiency.

cs.CV [Back]

[42] RetinexDual: Retinex-based Dual Nature Approach for Generalized Ultra-High-Definition Image Restoration

Mohab Kishawy,Ali Abdellatif Hussein,Jun Chen

Main category: cs.CV

TL;DR: The paper introduces RetinexDual, a novel framework combining spatial and frequency-domain networks (SAMBA and FIA), which effectively addresses Ultra-High-Definition Image Restoration (UHD IR) tasks like deraining, deblurring, dehazing, and Low-Light Image Enhancement.

Details Motivation: Traditional methods for UHD IR, such as downsampling and frequency-domain approaches, have significant drawbacks like information loss and ineffectiveness for spatial artifacts, motivating the need for a more robust solution. Method: The paper proposes RetinexDual, a framework combining two sub-networks—SAMBA for reflectance correction and FIA for illumination correction—to address UHD IR tasks. Result: RetinexDual outperforms recent methods in four UHD IR tasks—deraining, deblurring, dehazing, and LLIE—both qualitatively and quantitatively. Conclusion: RetinexDual is an effective solution for UHD IR, overcoming the limitations of traditional methods by combining spatial and frequency domain approaches. Abstract: Advancements in image sensing have elevated the importance of Ultra-High-Definition Image Restoration (UHD IR). Traditional methods, such as extreme downsampling or transformation from the spatial to the frequency domain, encounter significant drawbacks: downsampling induces irreversible information loss in UHD images, while our frequency analysis reveals that pure frequency-domain approaches are ineffective for spatially confined image artifacts, primarily due to the loss of degradation locality. To overcome these limitations, we present RetinexDual, a novel Retinex theory-based framework designed for generalized UHD IR tasks. RetinexDual leverages two complementary sub-networks: the Scale-Attentive maMBA (SAMBA) and the Frequency Illumination Adaptor (FIA). SAMBA, responsible for correcting the reflectance component, utilizes a coarse-to-fine mechanism to overcome the causal modeling of mamba, which effectively reduces artifacts and restores intricate details. On the other hand, FIA ensures precise correction of color and illumination distortions by operating in the frequency domain and leveraging the global context provided by it. Evaluating RetinexDual on four UHD IR tasks, namely deraining, deblurring, dehazing, and Low-Light Image Enhancement (LLIE), shows that it outperforms recent methods qualitatively and quantitatively. Ablation studies demonstrate the importance of employing distinct designs for each branch in RetinexDual, as well as the effectiveness of its various components.

[43] ACM Multimedia Grand Challenge on ENT Endoscopy Analysis

Trong-Thuan Nguyen,Viet-Tham Huynh,Thao Thi Phuong Dao,Ha Nguyen Thi,Tien To Vu Thuy,Uyen Hanh Tran,Tam V. Nguyen,Thanh Dinh Le,Minh-Triet Tran

Main category: cs.CV

TL;DR: 本文介绍了ENTRep,这是一个用于耳鼻喉内窥镜分析的多媒体挑战任务,旨在推动细粒度解剖分类和跨语言临床监督下的图像检索研究。

Details Motivation: 耳鼻喉护理中需要自动化分析内窥镜图像,但由于设备和操作员的差异性、细微且局部的发现以及如侧向性和声带状态等细粒度区分,这一领域的发展受到阻碍。此外,临床医生需要可靠的相似病例检索功能,包括视觉检索和简洁的文本描述检索,而现有的公共基准很少支持这些功能。 Method: 本文提出了ENTRep,这是一个集成细粒度解剖分类与图像到图像和文本到图像检索的挑战任务,数据集包含专家标注的图像,标注了解剖区域和正常或异常状态,并附有双语(越南语和英语)叙述性描述。此外,定义了三个基准任务,标准化了提交协议,并在公共和私有测试集上评估了性能。 Result: 报告了表现最佳的团队的结果,并提供了深入讨论。 Conclusion: ENTRep挑战任务为耳鼻喉内窥镜图像的自动化分析和检索提供了一个新的平台,推动了该领域的研究和技术发展。 Abstract: Automated analysis of endoscopic imagery is a critical yet underdeveloped component of ENT (ear, nose, and throat) care, hindered by variability in devices and operators, subtle and localized findings, and fine-grained distinctions such as laterality and vocal-fold state. In addition to classification, clinicians require reliable retrieval of similar cases, both visually and through concise textual descriptions. These capabilities are rarely supported by existing public benchmarks. To this end, we introduce ENTRep, the ACM Multimedia 2025 Grand Challenge on ENT endoscopy analysis, which integrates fine-grained anatomical classification with image-to-image and text-to-image retrieval under bilingual (Vietnamese and English) clinical supervision. Specifically, the dataset comprises expert-annotated images, labeled for anatomical region and normal or abnormal status, and accompanied by dual-language narrative descriptions. In addition, we define three benchmark tasks, standardize the submission protocol, and evaluate performance on public and private test splits using server-side scoring. Moreover, we report results from the top-performing teams and provide an insight discussion.

[44] CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework

Sriram Mandalika,Lalitha V

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、无参数的框架CoMAD,通过整合多个先进的自监督学习模型的知识,训练出一个性能优越且适合资源受限环境的小型模型。

Details Motivation: 现有的自监督学习范式通常单独预训练,忽略了互补的见解,并且产生了对于资源受限部署不切实际的大模型。 Method: CoMAD使用了一种不对称掩码策略,学生网络仅看到25%的补丁,而每个教师网络则接受逐渐减轻的独特掩码;通过线性适配器和层归一化将教师嵌入对齐到学生空间,并利用联合共识门控融合这些信息。 Result: 在ImageNet-1K上,CoMAD的ViT-Tiny达到了75.4%的Top-1准确率,在ADE20K上达到了47.3%的mIoU,在MS-COCO上达到了44.5%的框平均精度和40.5%的掩码平均精度。 Conclusion: CoMAD框架有效整合了多个自监督视觉变换模型的知识,创建了一个紧凑的学生网络,不仅在多个基准测试中实现了最先进的性能,而且更适合资源受限的部署。 Abstract: Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask, forcing the student to interpolate missing features under richer contexts. Teacher embeddings are aligned to the student's space via a linear adapter and layer normalization, then fused through our joint consensus gating, which weights each token by combining cosine affinity with inter-teacher agreement. The student is trained with dual-level KL divergence on visible tokens and reconstructed feature maps, capturing both local and global structure. On ImageNet-1K, CoMAD's ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art. In dense-prediction transfers, it attains 47.3 percent mIoU on ADE20K, and 44.5 percent box average precision and 40.5 percent mask average precision on MS-COCO, establishing a new state-of-the-art in compact SSL distillation.

[45] Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models

Mehrdad Moradi,Marco Grasso,Bianca Maria Colosimo,Kamran Paynabar

Main category: cs.CV

TL;DR: 本文提出了一种实时的基于注意力机制的扩散模型RADAR,用于异常检测,克服了基于重建方法的局限性,并在多个数据集上取得了最先进的性能。

Details Motivation: 基于重建的异常检测方法存在计算成本高、重建图像可能不准确以及中间噪声水平选择困难等问题。 Method: 提出了一种名为RADAR的方法,直接从扩散模型生成异常图,而不是重建输入图像。 Result: 在MVTec-AD数据集和3D打印材料数据集上,RADAR在准确性、精确度、召回率和F1分数等关键指标上均超过了最先进的扩散模型和统计机器学习模型。 Conclusion: RADAR有效地克服了基于重建的异常检测方法的局限性,提高了检测准确性和计算效率。 Abstract: Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling. However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings. We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model. Code available at: https://github.com/mehrdadmoradi124/RADAR

[46] A deep learning approach to track eye movements based on events

Chirag Seth,Divya Naiken,Keyan Lin

Main category: cs.CV

TL;DR: This research develops a deep learning-based algorithm using an event camera to accurately track eye movements and predict human attention, aiming to enhance user experience in VR and AR applications.

Details Motivation: The motivation is to develop a cost-effective and interpretable algorithm for eye tracking, which is crucial for improving comfort and user experience in VR and AR product development. This is driven by the challenge of accurately tracking rapid human eye movements typically requiring expensive high-speed cameras. Method: The study uses an event camera to locate the eye center position (x, y) and employs deep learning methods, particularly the CNN_LSTM model, to predict human attention. Result: The CNN_LSTM model achieved approximately 81% accuracy in predicting human attention based on eye movement data from an event camera. Conclusion: The research concludes that using deep learning methods, particularly the CNN_LSTM model, can effectively predict human attention through eye movement tracking, achieving about 81% accuracy. Future work will focus on improving model interpretability and performance using Layer-wise Relevance Propagation (LRP). Abstract: This research project addresses the challenge of accurately tracking eye movements during specific events by leveraging previous research. Given the rapid movements of human eyes, which can reach speeds of 300{\deg}/s, precise eye tracking typically requires expensive and high-speed cameras. Our primary objective is to locate the eye center position (x, y) using inputs from an event camera. Eye movement analysis has extensive applications in consumer electronics, especially in VR and AR product development. Therefore, our ultimate goal is to develop an interpretable and cost-effective algorithm using deep learning methods to predict human attention, thereby improving device comfort and enhancing overall user experience. To achieve this goal, we explored various approaches, with the CNN\_LSTM model proving most effective, achieving approximately 81\% accuracy. Additionally, we propose future work focusing on Layer-wise Relevance Propagation (LRP) to further enhance the model's interpretability and predictive performance.

[47] LuKAN: A Kolmogorov-Arnold Network Framework for 3D Human Motion Prediction

Md Zahidul Hasan,A. Ben Hamza,Nizar Bouguila

Main category: cs.CV

TL;DR: LuKAN是一种基于KAN和Lucas多项式激活的高效3D人体运动预测模型,通过离散小波变换和空间投影层实现准确且结构一致的预测。

Details Motivation: 现有的3D人体运动预测方法在预测准确性和计算效率之间存在平衡难题,需要一种更高效且准确的模型。 Method: LuKAN使用离散小波变换编码时间信息,利用空间投影层捕捉关节间的依赖关系,并通过基于Lucas多项式激活的KAN层进行高效函数逼近,最后使用逆离散小波变换重建运动序列。 Result: 在三个基准数据集上的实验表明,LuKAN在定量和定性评估中均表现出色,并具有较高的计算效率。 Conclusion: LuKAN实现了高效的3D人体运动预测,通过结合KAN和Lucas多项式激活,在计算效率和预测准确性之间取得了良好的平衡。 Abstract: The goal of 3D human motion prediction is to forecast future 3D poses of the human body based on historical motion data. Existing methods often face limitations in achieving a balance between prediction accuracy and computational efficiency. In this paper, we present LuKAN, an effective model based on Kolmogorov-Arnold Networks (KANs) with Lucas polynomial activations. Our model first applies the discrete wavelet transform to encode temporal information in the input motion sequence. Then, a spatial projection layer is used to capture inter-joint dependencies, ensuring structural consistency of the human body. At the core of LuKAN is the Temporal Dependency Learner, which employs a KAN layer parameterized by Lucas polynomials for efficient function approximation. These polynomials provide computational efficiency and an enhanced capability to handle oscillatory behaviors. Finally, the inverse discrete wavelet transform reconstructs motion sequences in the time domain, generating temporally coherent predictions. Extensive experiments on three benchmark datasets demonstrate the competitive performance of our model compared to strong baselines, as evidenced by both quantitative and qualitative evaluations. Moreover, its compact architecture coupled with the linear recurrence of Lucas polynomials, ensures computational efficiency.

[48] VER-Bench: Evaluating MLLMs on Reasoning with Fine-Grained Visual Evidence

Chenhui Qiang,Zhaoyang Wei,Xumeng Han Zipeng Wang,Siyao Li,Xiangyuan Lan,Jianbin Jiao,Zhenjun Han

Main category: cs.CV

TL;DR: 本文提出了一种新的评估框架VER-Bench,用于评估MLLMs识别细粒度视觉线索和进行复杂推理的能力,强调当前模型在提取微妙视觉证据和构建基于证据的论证方面的局限性。

Details Motivation: 现有的基准测试主要关注局部细节或显著图像元素,而忽略了对细微线索的复杂分析,而这些线索对于深度视觉理解和复杂推理至关重要。 Method: 引入VER-Bench,一个包含374个精心设计问题的框架,问题涵盖地理空间、时间、情境、意图、系统状态和符号推理,每个问题都附有结构化的证据:视觉线索和由此推导出的推理。 Result: VER-Bench揭示了当前模型在提取细微视觉证据和构建基于证据的论证方面的局限性,强调需要增强模型在细粒度视觉证据提取、整合和推理方面的能力。 Conclusion: 为了实现真正的视觉理解和类人分析,需要改进模型在细粒度视觉证据提取和推理方面的能力。 Abstract: With the rapid development of MLLMs, evaluating their visual capabilities has become increasingly crucial. Current benchmarks primarily fall into two main types: basic perception benchmarks, which focus on local details but lack deep reasoning (e.g., "what is in the image?"), and mainstream reasoning benchmarks, which concentrate on prominent image elements but may fail to assess subtle clues requiring intricate analysis. However, profound visual understanding and complex reasoning depend more on interpreting subtle, inconspicuous local details than on perceiving salient, macro-level objects. These details, though occupying minimal image area, often contain richer, more critical information for robust analysis. To bridge this gap, we introduce the VER-Bench, a novel framework to evaluate MLLMs' ability to: 1) identify fine-grained visual clues, often occupying on average just 0.25% of the image area; 2) integrate these clues with world knowledge for complex reasoning. Comprising 374 carefully designed questions across Geospatial, Temporal, Situational, Intent, System State, and Symbolic reasoning, each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them. VER-Bench reveals current models' limitations in extracting subtle visual evidence and constructing evidence-based arguments, highlighting the need to enhance models's capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis. Dataset and additional materials are available https://github.com/verbta/ACMMM-25-Materials.

[49] Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications

Noreen Anwar,Guillaume-Alexandre Bilodeau,Wassim Bouachir

Main category: cs.CV

TL;DR: DAMM是一种新的目标检测框架,通过多模态查询和双流注意力机制显著提升了检测精度和效率,尤其适用于复杂场景中的遮挡和细粒度定位问题。

Details Motivation: 基于Transformer的目标检测器在处理遮挡、细粒度定位和计算效率方面存在挑战,因此提出DAMM以解决这些问题。 Method: 提出了一种新的目标检测框架DAMM,包含查询自适应和结构化交叉注意力机制,并利用三种类型的查询(基于外观的查询、位置查询和随机学习查询)以及双流交叉注意力模块来分别优化语义和空间特征。 Result: DAMM在四个具有挑战性的基准数据集上达到了平均精度(AP)和召回率的最先进性能,验证了多模态查询自适应和双流注意力机制的有效性。 Conclusion: DAMM通过引入多模态查询和双流注意力机制,在目标检测任务中实现了更高的准确性和效率,尤其是在处理遮挡、细粒度定位和计算效率方面。 Abstract: Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: \href{https://github.com/DET-LIP/DAMM}{GitHub}.

[50] Revealing Temporal Label Noise in Multimodal Hateful Video Classification

Shuonan Yang,Tailin Chen,Rahul Singh,Jiangbei Yue,Jianbo Jiao,Zeyu Fu

Main category: cs.CV

TL;DR: 本文探讨了粗略的视频级别注释对多模态仇恨视频检测的影响,并通过时间戳分析展示了标签模糊性对模型决策边界和分类置信度的影响。

Details Motivation: 随着在线多媒体内容的迅速增加,仇恨言论的传播带来了严重的社会和监管挑战。然而,大多数现有方法依赖于粗略的视频级别注释,忽略了仇恨内容的时间粒度,导致大量的标签噪声。 Method: 作者通过使用标注的时间戳对HateMM和MultiHateClip英文数据集中的仇恨视频进行细粒度裁剪,隔离出明确的仇恨内容片段,并对这些片段进行探索性分析。 Result: 实验表明,时间戳噪声从根本上改变了模型的决策边界并削弱了分类置信度,突出了仇恨言论表达的上下文依赖性和时间连续性。 Conclusion: 研究发现为多模态仇恨视频的时间动态提供了新的见解,并强调了需要具有时间感知能力的模型和基准来提高鲁棒性和可解释性。 Abstract: The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise.

[51] Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations

Zahidul Islam,Sujoy Paul,Mrigank Rochan

Main category: cs.CV

TL;DR: 本文提出 Highlight-TTA,一种测试时自适应框架,结合辅助任务优化,显著提升了视频高光检测的性能。

Details Motivation: 现有视频高光检测方法使用固定模型难以适应不同测试视频的个体特征,导致性能下降,因此需要一种动态适应的框架。 Method: 提出 Highlight-TTA 测试时自适应框架,并通过辅助任务跨模态幻觉进行联合优化,使用元辅助训练方案提升模型适应性。 Result: 在三个最先进模型和三个基准数据集上的实验表明,引入 Highlight-TTA 可显著提升模型的高光检测效果。 Conclusion: Highlight-TTA 框架通过测试时自适应优化,显著提升了视频高光检测模型的泛化能力和性能。 Abstract: Existing video highlight detection methods, although advanced, struggle to generalize well to all test videos. These methods typically employ a generic highlight detection model for each test video, which is suboptimal as it fails to account for the unique characteristics and variations of individual test videos. Such fixed models do not adapt to the diverse content, styles, or audio and visual qualities present in new, unseen test videos, leading to reduced highlight detection performance. In this paper, we propose Highlight-TTA, a test-time adaptation framework for video highlight detection that addresses this limitation by dynamically adapting the model during testing to better align with the specific characteristics of each test video, thereby improving generalization and highlight detection performance. Highlight-TTA is jointly optimized with an auxiliary task, cross-modality hallucinations, alongside the primary highlight detection task. We utilize a meta-auxiliary training scheme to enable effective adaptation through the auxiliary task while enhancing the primary task. During testing, we adapt the trained model using the auxiliary task on the test video to further enhance its highlight detection performance. Extensive experiments with three state-of-the-art highlight detection models and three benchmark datasets show that the introduction of Highlight-TTA to these models improves their performance, yielding superior results.

[52] Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

Suchisrit Gangopadhyay,Jung-Hee Kim,Xien Chen,Patrick Rim,Hyoungseob Park,Alex Wong

Main category: cs.CV

TL;DR: 本文介绍了一种新颖的方法,通过引入校准令牌对潜在嵌入进行对齐调制,使基础单目深度估计器能够适应鱼眼图像,而无需重新训练或微调。

Details Motivation: 尽管基础单目深度估计器经过数千万张透视图像的训练,但它们对于因相机校准参数(内参、失真)变化引入的协变量偏移仍然敏感,导致深度估计错误。 Method: 该方法通过重新校准透视图像为鱼眼图像,并在训练期间强制执行它们之间的估计一致性,从而实现自我监督学习。引入的校准令牌作为一种轻量级的适应机制,用于对潜在嵌入进行对齐调制。 Result: 该方法在室内和室外场景下对多个基础单目深度估计器进行了评估,均一致地改进了现有最先进的方法,且仅需为两者使用一组令牌。 Conclusion: 本文提出了一种无需重新训练或微调即可将基础单目深度估计器扩展到鱼眼图像的方法,通过引入校准令牌对潜在嵌入进行调制,有效解决了因相机校准参数变化引起的协变量偏移问题。 Abstract: We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.

[53] Toward Errorless Training ImageNet-1k

Bo Deng,Levi Heath

Main category: cs.CV

TL;DR: 本研究通过一种新方法在ImageNet数据集上训练神经网络,实现了高准确率,但未达100%的原因可能与数据集中的双重标注问题有关。

Details Motivation: 旨在探索新方法在大规模图像识别任务上的有效性,并尝试实现更高的分类准确率。 Method: 使用[5]中提出的新方法,在ImageNet 2012竞赛数据集上训练了一个前馈人工神经网络。 Result: 模型达到了98.3%的准确率和99.69%的Top-1准确率,在数据集的10个批次分区上平均完美分类285.9个标签。表现最好的模型使用了322,430,160个参数,精度为四位小数。 Conclusion: 该模型未能达到100%准确率的原因可能是数据集中存在重复图像但标签不同,导致了双重标注问题。 Abstract: In this paper, we describe a feedforward artificial neural network trained on the ImageNet 2012 contest dataset [7] with the new method of [5] to an accuracy rate of 98.3% with a 99.69 Top-1 rate, and an average of 285.9 labels that are perfectly classified over the 10 batch partitions of the dataset. The best performing model uses 322,430,160 parameters, with 4 decimal places precision. We conjecture that the reason our model does not achieve a 100% accuracy rate is due to a double-labeling problem, by which there are duplicate images in the dataset with different labels.

[54] Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models

Phuoc-Nguyen Bui,Khanh-Binh Nguyen,Hyunseung Choo

Main category: cs.CV

TL;DR: ProMIM is a lightweight and efficient framework that enhances conditional prompt learning for vision-language models by integrating masked image modeling, improving generalization without altering existing architectures.

Details Motivation: Vision-language models (VLMs) require resource-intensive training for adaptation, while prompt learning techniques like CoOp and CoCoOp tend to overfit to known classes, limiting generalization to unseen categories. This necessitates an efficient and generalizable adaptation framework. Method: ProMIM integrates masked image modeling (MIM) into existing Vision-Language Model (VLM) pipelines, using a masking strategy to generate robust, instance-conditioned prompts that augment methods like CoOp and CoCoOp. Result: ProMIM improves feature robustness, mitigates overfitting, and demonstrates consistent performance improvements in zero-shot and few-shot classification tasks, all with negligible additional computational cost. Conclusion: ProMIM provides a practical and lightweight solution for vision-language applications by enhancing generalization performance without modifying existing architectures. Abstract: Vision-language models (VLMs) like CLIP excel in zero-shot learning but often require resource-intensive training to adapt to new tasks. Prompt learning techniques, such as CoOp and CoCoOp, offer efficient adaptation but tend to overfit to known classes, limiting generalization to unseen categories. We introduce ProMIM, a plug-and-play framework that enhances conditional prompt learning by integrating masked image modeling (MIM) into existing VLM pipelines. ProMIM leverages a simple yet effective masking strategy to generate robust, instance-conditioned prompts, seamlessly augmenting methods like CoOp and CoCoOp without altering their core architectures. By masking only visible image patches and using these representations to guide prompt generation, ProMIM improves feature robustness and mitigates overfitting, all while introducing negligible additional computational cost. Extensive experiments across zero-shot and few-shot classification tasks demonstrate that ProMIM consistently boosts generalization performance when plugged into existing approaches, providing a practical, lightweight solution for real-world vision-language applications.

[55] TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring

Zhu Xu,Ting Lei,Zhimin Li,Guan Wang,Qingchao Chen,Yuxin Peng,Yang liu

Main category: cs.CV

TL;DR: The paper proposes TRKT, a new method for Weakly Supervised Dynamic Scene Graph Generation that enhances object detection in dynamic video scenarios by incorporating relation-aware and motion-aware knowledge.

Details Motivation: To overcome the challenges in dynamic scene graph generation for videos, specifically the inaccuracies in localization and low-confidence proposals caused by using object detectors trained on static images. Method: The TRKT method includes two key components: Relation-aware knowledge mining, which uses object and relation class decoders and Inter-frame Attention Augmentation strategy, and a Dual-stream Fusion Module that integrates attention maps into external detections to refine object localization. Result: TRKT achieves state-of-the-art performance on the Action Genome dataset, demonstrating its effectiveness in improving object localization and confidence scores in dynamic, relation-aware scenarios. Conclusion: TRKT successfully addresses the limitations of existing WS-DSGG methods by enhancing detection in dynamic, relation-aware scenarios through its unique components, achieving state-of-the-art performance on the Action Genome dataset. Abstract: Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git.

[56] AdvDINO: Domain-Adversarial Self-Supervised Representation Learning for Spatial Proteomics

Stella Su,Marc Harary,Scott J. Rodig,William Lotter

Main category: cs.CV

TL;DR: AdvDINO是一种用于学习领域不变特征的自我监督学习框架,特别适用于生物医学成像。

Details Motivation: 标准SSL方法在领域转换下的鲁棒性尚不确定,这对生物医学成像尤其具有挑战性。 Method: AdvDINO方法结合了梯度反转层到DINOv2架构中,以促进领域不变特征的学习。 Result: AdvDINO减轻了特定幻灯片偏差,学习了比非对抗性基线更稳健且生物学上有意义的表示。 Conclusion: AdvDINO是一个具有广泛适用性的框架,适用于其他成像领域,如放射学、遥感和自动驾驶。 Abstract: Self-supervised learning (SSL) has emerged as a powerful approach for learning visual representations without manual annotations. However, the robustness of standard SSL methods to domain shift -- systematic differences across data sources -- remains uncertain, posing an especially critical challenge in biomedical imaging where batch effects can obscure true biological signals. We present AdvDINO, a domain-adversarial self-supervised learning framework that integrates a gradient reversal layer into the DINOv2 architecture to promote domain-invariant feature learning. Applied to a real-world cohort of six-channel multiplex immunofluorescence (mIF) whole slide images from non-small cell lung cancer patients, AdvDINO mitigates slide-specific biases to learn more robust and biologically meaningful representations than non-adversarial baselines. Across $>5.46$ million mIF image tiles, the model uncovers phenotype clusters with distinct proteomic profiles and prognostic significance, and improves survival prediction in attention-based multiple instance learning. While demonstrated on mIF data, AdvDINO is broadly applicable to other imaging domains -- including radiology, remote sensing, and autonomous driving -- where domain shift and limited annotated data hinder model generalization and interpretability.

[57] Open-world Point Cloud Semantic Segmentation: A Human-in-the-loop Framework

Peng Zhang,Songru Yang,Jinsheng Sun,Weiqing Li,Zhiyong Su

Main category: cs.CV

TL;DR: The paper proposes HOW-Seg, a human-in-the-loop framework for open-world point cloud semantic segmentation that dynamically improves predictions through iterative human feedback, achieving superior performance compared to existing methods.

Details Motivation: The motivation behind this research is the limitations of existing methods for open-world point cloud semantic segmentation, which rely on resource-intensive offline incremental learning or densely annotated support data. This limits their practicality in real-world scenarios. Method: The paper introduces HOW-Seg, a human-in-the-loop framework for open-world point cloud semantic segmentation. It constructs class prototypes directly on the query data, uses sparse human annotations as guidance, implements a hierarchical prototype disambiguation mechanism, and employs a dense conditional random field (CRF) to optimize label assignments. Result: Experiments show that HOW-Seg matches or surpasses the state-of-the-art GFS-Seg method under the 5-shot setting with sparse annotations. Using advanced backbones and denser annotations, HOW-Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2. Conclusion: HOW-Seg dynamically improves its predictions through iterative human feedback, achieving high-quality segmentation for both base and novel classes, and significantly outperforming alternatives when using advanced backbones and denser annotations. Abstract: Open-world point cloud semantic segmentation (OW-Seg) aims to predict point labels of both base and novel classes in real-world scenarios. However, existing methods rely on resource-intensive offline incremental learning or densely annotated support data, limiting their practicality. To address these limitations, we propose HOW-Seg, the first human-in-the-loop framework for OW-Seg. Specifically, we construct class prototypes, the fundamental segmentation units, directly on the query data, avoiding the prototype bias caused by intra-class distribution shifts between the support and query data. By leveraging sparse human annotations as guidance, HOW-Seg enables prototype-based segmentation for both base and novel classes. Considering the lack of granularity of initial prototypes, we introduce a hierarchical prototype disambiguation mechanism to refine ambiguous prototypes, which correspond to annotations of different classes. To further enrich contextual awareness, we employ a dense conditional random field (CRF) upon the refined prototypes to optimize their label assignments. Through iterative human feedback, HOW-Seg dynamically improves its predictions, achieving high-quality segmentation for both base and novel classes. Experiments demonstrate that with sparse annotations (e.g., one-novel-class-one-click), HOW-Seg matches or surpasses the state-of-the-art generalized few-shot segmentation (GFS-Seg) method under the 5-shot setting. When using advanced backbones (e.g., Stratified Transformer) and denser annotations (e.g., 10 clicks per sub-scene), HOW-Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2, significantly outperforming alternatives.

[58] UGOD: Uncertainty-Guided Differentiable Opacity and Soft Dropout for Enhanced Sparse-View 3DGS

Zhihao Guo,Peng Wang,Zidong Chen,Xiangyu Kong,Yan Lyu,Guanyu Gao,Liangxiu Han

Main category: cs.CV

TL;DR: This paper proposes an adaptive Gaussian weighting method in 3D Gaussian Splatting for improved rendering in sparse-view novel view synthesis.

Details Motivation: The motivation stems from the issue of overfitting in current 3DGS methods due to equal weighting of Gaussians, especially in sparse-view scenarios. Method: The method introduces learned uncertainties to adaptively weight Gaussians, guiding opacity updates and applying soft differentiable dropout regularization to improve rendering. Result: Experimental results show that the method achieves higher quality reconstruction with fewer Gaussians, with a 3.27% PSNR improvement over DropGaussian on the MipNeRF 360 dataset. Conclusion: The proposed method enhances the rendering quality in 3D Gaussian Splatting by adaptively weighting Gaussians through learned uncertainties, outperforming existing approaches in sparse-view scenarios. Abstract: 3D Gaussian Splatting (3DGS) has become a competitive approach for novel view synthesis (NVS) due to its advanced rendering efficiency through 3D Gaussian projection and blending. However, Gaussians are treated equally weighted for rendering in most 3DGS methods, making them prone to overfitting, which is particularly the case in sparse-view scenarios. To address this, we investigate how adaptive weighting of Gaussians affects rendering quality, which is characterised by learned uncertainties proposed. This learned uncertainty serves two key purposes: first, it guides the differentiable update of Gaussian opacity while preserving the 3DGS pipeline integrity; second, the uncertainty undergoes soft differentiable dropout regularisation, which strategically transforms the original uncertainty into continuous drop probabilities that govern the final Gaussian projection and blending process for rendering. Extensive experimental results over widely adopted datasets demonstrate that our method outperforms rivals in sparse-view 3D synthesis, achieving higher quality reconstruction with fewer Gaussians in most datasets compared to existing sparse-view approaches, e.g., compared to DropGaussian, our method achieves 3.27\% PSNR improvements on the MipNeRF 360 dataset.

[59] CSRAP: Enhanced Canvas Attention Scheduling for Real-Time Mission Critical Perception

Md Iftekharul Islam Sakib,Yigong Hu,Tarek Abdelzaher

Main category: cs.CV

TL;DR: This paper proposes an improved canvas-based attention scheduling method for real-time perception on edge platforms, achieving better performance in object detection with optimized resource usage.

Details Motivation: The core challenge is executing high-resolution object detection under stringent latency constraints on limited computing resources on edge platforms. Method: The paper extends canvas-based attention scheduling by introducing variable-size canvas frames and selectable canvas frame rates, using YOLOv11 as the perception module on an NVIDIA Jetson Orin Nano for evaluation. Result: The results show consistently higher mean average precision (mAP) and recall compared to the state of the art, demonstrating improved quality/cost trade-offs. Conclusion: The proposed method with variable-size canvas frames and selectable canvas frame rates outperforms the state of the art in real-time perception on edge platforms by achieving better quality/cost trade-offs. Abstract: Real-time perception on edge platforms faces a core challenge: executing high-resolution object detection under stringent latency constraints on limited computing resources. Canvas-based attention scheduling was proposed in earlier work as a mechanism to reduce the resource demands of perception subsystems. It consolidates areas of interest in an input data frame onto a smaller area, called a canvas frame, that can be processed at the requisite frame rate. This paper extends prior canvas-based attention scheduling literature by (i) allowing for variable-size canvas frames and (ii) employing selectable canvas frame rates that may depart from the original data frame rate. We evaluate our solution by running YOLOv11, as the perception module, on an NVIDIA Jetson Orin Nano to inspect video frames from the Waymo Open Dataset. Our results show that the additional degrees of freedom improve the attainable quality/cost trade-offs, thereby allowing for a consistently higher mean average precision (mAP) and recall with respect to the state of the art.

[60] Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression

Zheng Chen,Mingde Zhou,Jinpei Guo,Jiale Yuan,Yifei Ji,Yulun Zhang

Main category: cs.CV

TL;DR: SODEC is a novel, faster diffusion-based image compression method that offers better performance by using single-step decoding and fidelity guidance.

Details Motivation: To overcome the drawbacks of traditional diffusion-based image compression methods, which include excessive decoding latency and poor fidelity. Method: SODEC uses a pre-trained VAE-based model to produce informative latents and replaces the iterative denoising process with a single-step decoding. It also introduces a fidelity guidance module and employs a rate annealing training strategy. Result: SODEC improves decoding speed by more than 20x and achieves superior rate-distortion-perception performance compared to existing methods. Conclusion: SODEC is a novel single-step diffusion image compression model that significantly outperforms existing methods in rate-distortion-perception performance and decoding speed. Abstract: Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code is released at: https://github.com/zhengchen1999/SODEC.

[61] Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion

Shenglun Chen,Xinzhu Ma,Hong Zhang,Haojie Li,Zhihui Wang

Main category: cs.CV

TL;DR: This paper introduces a novel depth completion framework that leverages depth foundation models and dual-space propagation to achieve robust performance in out-of-distribution scenarios without large-scale training.

Details Motivation: Existing depth completion models degrade in performance in out-of-distribution scenarios due to reliance on limited training data, prompting the use of foundation models to enhance robustness without large-scale training. Method: The framework uses a depth foundation model to extract environmental cues from RGB images, employs a dual-space propagation approach to propagate sparse depth in 2D and 3D spaces, and includes a learnable correction module for refining depth predictions. Result: The framework achieves remarkable performance on 16 out-of-distribution datasets, surpassing state-of-the-art depth completion methods. Conclusion: The proposed depth completion framework leverages depth foundation models to achieve robustness without large-scale training and outperforms existing state-of-the-art methods in OOD scenarios. Abstract: Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagates sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in https://github.com/shenglunch/PSD.

[62] Unified modality separation: A vision-language framework for unsupervised domain adaptation

Xinyao Li,Jingjing Li,Zhekai Du,Lei Zhu,Heng Tao Shen

Main category: cs.CV

TL;DR: This paper proposes a modality separation framework to enhance unsupervised domain adaptation by effectively addressing the modality gap.

Details Motivation: The motivation is to overcome the limitations of direct unsupervised domain adaptation in the presence of a modality gap, which only transfers modality-invariant knowledge. Method: The method involves disentangling modality components from vision-language model features and handling them separately, with modality-adaptive ensemble weights used at test time. Result: The results show a performance gain of up to 9% with 9 times more computational efficiency, validated through extensive experiments. Conclusion: The paper concludes that by addressing the modality gap through a unified modality separation framework, improved performance in unsupervised domain adaptation can be achieved. Abstract: Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.

[63] Modeling Rapid Contextual Learning in the Visual Cortex with Fast-Weight Deep Autoencoder Networks

Yue Li,Weifan Wang,Tai Sing Lee

Main category: cs.CV

TL;DR: This study uses a Vision Transformer-based autoencoder to investigate how familiarity training can induce sensitivity to global context in early layers of a deep neural network, with results suggesting that a hybrid fast-and-slow weight architecture may provide a viable computational model for studying rapid global context learning in the brain.

Details Motivation: Recent neurophysiological studies have shown that the early visual cortex can rapidly learn global image context, primarily attributed to local recurrent interactions. This study aims to investigate this phenomenon from a functional perspective using a deep neural network. Method: A Vision Transformer (ViT)-based autoencoder was employed to investigate how familiarity training can induce sensitivity to global context in the early layers of a deep neural network. Low-Rank Adaptation (LoRA) was explored to implement fast weights within each Transformer layer. Result: The results show that (1) The ViT-based autoencoder's self-attention circuit performs a manifold transform similar to a neural circuit model of the familiarity effect. (2) Familiarity training aligns latent representations in early layers with those in the top layer containing global context information. (3) Familiarity training broadens the self-attention scope within the remembered image context. (4) These effects are significantly amplified by LoRA-based fast weights. Conclusion: The study concludes that familiarity training introduces global sensitivity to earlier layers in a hierarchical network, and that a hybrid fast-and-slow weight architecture may provide a viable computational model for studying rapid global context learning in the brain. Abstract: Recent neurophysiological studies have revealed that the early visual cortex can rapidly learn global image context, as evidenced by a sparsification of population responses and a reduction in mean activity when exposed to familiar versus novel image contexts. This phenomenon has been attributed primarily to local recurrent interactions, rather than changes in feedforward or feedback pathways, supported by both empirical findings and circuit-level modeling. Recurrent neural circuits capable of simulating these effects have been shown to reshape the geometry of neural manifolds, enhancing robustness and invariance to irrelevant variations. In this study, we employ a Vision Transformer (ViT)-based autoencoder to investigate, from a functional perspective, how familiarity training can induce sensitivity to global context in the early layers of a deep neural network. We hypothesize that rapid learning operates via fast weights, which encode transient or short-term memory traces, and we explore the use of Low-Rank Adaptation (LoRA) to implement such fast weights within each Transformer layer. Our results show that (1) The proposed ViT-based autoencoder's self-attention circuit performs a manifold transform similar to a neural circuit model of the familiarity effect. (2) Familiarity training aligns latent representations in early layers with those in the top layer that contains global context information. (3) Familiarity training broadens the self-attention scope within the remembered image context. (4) These effects are significantly amplified by LoRA-based fast weights. Together, these findings suggest that familiarity training introduces global sensitivity to earlier layers in a hierarchical network, and that a hybrid fast-and-slow weight architecture may provide a viable computational model for studying rapid global context learning in the brain.

[64] Attribute Guidance With Inherent Pseudo-label For Occluded Person Re-identification

Rui Zhi,Zhen Yang,Haiyang Zhang

Main category: cs.CV

TL;DR: AG-ReID improves person re-identification in occluded scenarios by leveraging fine-grained semantic attributes from pre-trained models.

Details Motivation: Pre-trained vision-language models struggle with occluded Re-ID scenarios by focusing on holistic image semantics and neglecting fine-grained attribute information. Method: AG-ReID uses a two-stage process: generating attribute pseudo-labels and employing a dual-guidance mechanism combining holistic and fine-grained attribute information. Result: AG-ReID achieves state-of-the-art results on multiple Re-ID datasets, showing significant improvements in handling occlusions and subtle attribute differences. Conclusion: AG-ReID is effective in handling occlusions and subtle attribute differences in person re-identification tasks without requiring additional data or annotations. Abstract: Person re-identification (Re-ID) aims to match person images across different camera views, with occluded Re-ID addressing scenarios where pedestrians are partially visible. While pre-trained vision-language models have shown effectiveness in Re-ID tasks, they face significant challenges in occluded scenarios by focusing on holistic image semantics while neglecting fine-grained attribute information. This limitation becomes particularly evident when dealing with partially occluded pedestrians or when distinguishing between individuals with subtle appearance differences. To address this limitation, we propose Attribute-Guide ReID (AG-ReID), a novel framework that leverages pre-trained models' inherent capabilities to extract fine-grained semantic attributes without additional data or annotations. Our framework operates through a two-stage process: first generating attribute pseudo-labels that capture subtle visual characteristics, then introducing a dual-guidance mechanism that combines holistic and fine-grained attribute information to enhance image feature extraction. Extensive experiments demonstrate that AG-ReID achieves state-of-the-art results on multiple widely-used Re-ID datasets, showing significant improvements in handling occlusions and subtle attribute differences while maintaining competitive performance on standard Re-ID scenarios.

[65] CRAM: Large-scale Video Continual Learning with Bootstrapped Compression

Shivani Mall,Joao F. Henriques

Main category: cs.CV

TL;DR: 本文提出了一种名为CRAM的视频持续学习方法,通过在线训练视频压缩器并刷新视频代码,有效降低了内存需求,并在大规模视频基准上取得了优异性能。

Details Motivation: 视频持续学习面临高内存需求的挑战,尤其是长视频和连续流与常见的回放缓冲区大小限制相冲突,因此需要一种有效的方法来降低内存占用。 Method: 提出了一种名为CRAM(Continually Refreshed Amodal Memory)的方法,使用在线训练的视频压缩器存储视频代码而非原始输入,并通过刷新视频代码来处理灾难性遗忘问题。 Result: CRAM在EpicKitchens-100和Kinetics-700等大规模视频持续学习基准上表现出色,能够在不到2GB的存储空间内存储数千个相对较长的视频,并显著优于现有方法。 Conclusion: CRAM有效地解决了视频持续学习中的高内存需求问题,通过在线训练视频压缩器并刷新视频代码,实现了在减少内存占用的同时优于现有方法的性能。 Abstract: Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling, which requires random access to a full dataset. This would allow for much smaller storage requirements and self-sufficiency of deployed systems that cope with natural distribution shifts, similarly to biological learning. We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer. We posit that part of the reason why practical video CL is challenging is the high memory requirements of video, further exacerbated by long-videos and continual streams, which are at odds with the common rehearsal-buffer size constraints. To address this, we propose to use compressed vision, i.e. store video codes (embeddings) instead of raw inputs, and train a video classifier by IID sampling from this rolling buffer. Training a video compressor online (so not depending on any pre-trained networks) means that it is also subject to catastrophic forgetting. We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one. We name our method Continually Refreshed Amodal Memory (CRAM). We expand current video CL benchmarks to large-scale settings, namely EpicKitchens-100 and Kinetics-700, storing thousands of relatively long videos in under 2 GB, and demonstrate empirically that our video CL method outperforms prior art with a significantly reduced memory footprint.

[66] Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation

Xusheng Liang,Lihua Zhou,Nianxin Li,Miao Xu,Ziyang Song,Dong Yi,Jinlin Wu,Hongbin Liu,Jiebo Luo,Zhen Lei

Main category: cs.CV

TL;DR: MCDRL是一种新的医学图像分割框架,结合CLIP和因果推理,有效解决领域泛化问题。

Details Motivation: 由于医学图像的高度可变性和复杂性,如设备差异、程序伪影和成像模式等混杂因素引起的显著领域转移,导致CLIP等现有VLM在医学成像中的应用面临泛化能力不足的挑战。 Method: 提出了一种名为MCDRL的方法,利用CLIP的跨模态能力识别候选病变区域,并构建一个用于表示领域特异性变化的混杂因素字典,随后训练一个因果干预网络以消除这些变化的影响。 Result: MCDRL在广泛的实验中表现出色,分割准确率更高,并显示出强大的泛化能力。 Conclusion: MCDRL通过整合因果推理和VLM,在医学图像分割中实现了更好的领域泛化能力,优于现有方法。 Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.

[67] AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content

Shushi Wang,Chunyi Li,Zicheng Zhang,Han Zhou,Wei Dong,Jun Chen,Guangtao Zhai,Xiaohong Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的基准数据集AU-IQA,用于评估AI增强用户生成内容的感知质量,并对现有质量评估模型进行了分析。

Details Motivation: 缺乏专门的质量评估模型限制了AI图像增强技术的发展,尤其是在AI增强用户生成内容(AI-UGC)这一新兴领域。 Method: 构建了一个包含4800张AI-UGC图像的基准数据集AU-IQA,并评估了一系列现有的质量评估模型,包括传统的IQA方法和大型多模态模型。 Result: 提供了一个新的基准数据集,并对现有质量评估模型在AI-UGC上的有效性进行了全面分析。 Conclusion: 构建了一个名为AU-IQA的基准数据集,用于评估AI增强用户生成内容(AI-UGC)的感知质量,并对现有质量评估模型进行了全面分析。 Abstract: AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.

[68] Skin-SOAP: A Weakly Supervised Framework for Generating Structured SOAP Notes

Sadia Kamal,Tim Oates,Joy Wan

Main category: cs.CV

TL;DR: The paper introduces skin-SOAP, an automated, weakly supervised framework for generating clinical SOAP notes for skin carcinoma patients, reducing clinician workload and annotation needs while maintaining high accuracy.

Details Motivation: Manual generation of SOAP notes is labor-intensive, contributing to clinician burnout and inefficiencies in healthcare systems. A scalable, automated solution is needed to reduce burden and healthcare costs associated with skin carcinoma treatment. Method: Development of a weakly supervised multimodal framework, skin-SOAP, that generates structured SOAP notes from lesion images and sparse clinical text. Evaluation using novel metrics MedConceptEval and Clinical Coherence Score (CCS) to assess clinical relevance. Result: The skin-SOAP framework achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro on clinical relevance metrics. Two novel evaluation metrics, MedConceptEval and CCS, were introduced to measure semantic alignment with medical concepts and input features. Conclusion: The proposed skin-SOAP framework effectively generates clinically structured SOAP notes using minimal inputs, reducing manual annotation requirements and clinician burden while maintaining performance comparable to state-of-the-art models. Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. Early diagnosis, accurate and timely treatment are critical to improving patient survival rates. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose skin-SOAP, a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate this clinical relevance, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.

[69] A Novel Image Similarity Metric for Scene Composition Structure

Md Redwanul Haque,Manzur Murshed,Manoranjan Paul,Tsz-Kwan Lee

Main category: cs.CV

TL;DR: SCSSIM is a new training-free metric for evaluating the structural accuracy of generative AI image outputs, focusing on Scene Composition Structure integrity.

Details Motivation: Traditional image quality metrics do not effectively assess the Scene Composition Structure, which is crucial for the accuracy of generative AI outputs. Method: SCSSIM uses Cuboidal hierarchical partitioning and statistical measures to evaluate Scene Composition Structure without training. Result: SCSSIM accurately reflects unchanged SCS with high invariance to non-compositional distortions and indicates altered SCS with a strong monotonic decrease for compositional distortions. Conclusion: SCSSIM is a valuable tool for evaluating generative models by ensuring the integrity of scene composition. Abstract: The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image's underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM's high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.

[70] HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID

Yiyang Su,Yunping Shi,Feng Liu,Xiaoming Liu

Main category: cs.CV

TL;DR: This paper proposes HAMoBE, a novel framework for video-based person re-identification that adaptively integrates biometric features and dynamically adjusts expert contributions, achieving significant performance improvements.

Details Motivation: Existing video-based person re-identification methods often neglect the importance of identifying and selecting the most discriminative features for effective matching, which limits their performance in dynamic environments. Method: The HAMoBE framework utilizes multi-layer features from a pre-trained large model, extracts low-level features in the first level, and employs specialized experts for long-term, short-term, and temporal features in the second level. A dual-input decision gating network dynamically adjusts expert contributions. Result: Extensive evaluations on benchmarks like MEVID show significant performance improvements, such as a +13.0% increase in Rank-1 accuracy. Conclusion: The proposed HAMoBE framework significantly improves the performance of video-based person re-identification by adaptively integrating key biometric features and dynamically adjusting expert contributions. Abstract: Recently, research interest in person re-identification (ReID) has increasingly focused on video-based scenarios, which are essential for robust surveillance and security in varied and dynamic environments. However, existing video-based ReID methods often overlook the necessity of identifying and selecting the most discriminative features from both videos in a query-gallery pair for effective matching. To address this issue, we propose a novel Hierarchical and Adaptive Mixture of Biometric Experts (HAMoBE) framework, which leverages multi-layer features from a pre-trained large model (e.g., CLIP) and is designed to mimic human perceptual mechanisms by independently modeling key biometric features--appearance, static body shape, and dynamic gait--and adaptively integrating them. Specifically, HAMoBE includes two levels: the first level extracts low-level features from multi-layer representations provided by the frozen large model, while the second level consists of specialized experts focusing on long-term, short-term, and temporal features. To ensure robust matching, we introduce a new dual-input decision gating network that dynamically adjusts the contributions of each expert based on their relevance to the input scenarios. Extensive evaluations on benchmarks like MEVID demonstrate that our approach yields significant performance improvements (e.g., +13.0% Rank-1 accuracy).

[71] Finding Needles in Images: Can Multimodal LLMs Locate Fine Details?

Parth Thakkar,Ankush Agarwal,Prasad Kasu,Pulkit Bansal,Chaitanya Devaguptapu

Main category: cs.CV

TL;DR: 本论文研究了多模态大语言模型在复杂文档中定位细粒度细节的能力,提出了Spot-IT方法以提升性能,并验证了其有效性。

Details Motivation: 研究MLLMs在复杂文档中定位和推理细粒度细节的能力仍不足。 Method: 提出了Spot-IT方法,通过智能补丁选择和高斯注意力机制提升MLLMs性能。 Result: Spot-IT在需要精确细节提取的场景中比基线方法有显著改进。 Conclusion: Spot-IT是一种有效增强MLLMs在复杂文档中提取细节能力的方法。 Abstract: While Multi-modal Large Language Models (MLLMs) have shown impressive capabilities in document understanding tasks, their ability to locate and reason about fine-grained details within complex documents remains understudied. Consider searching a restaurant menu for a specific nutritional detail or identifying a disclaimer in a lengthy newspaper article tasks that demand careful attention to small but significant details within a broader narrative, akin to Finding Needles in Images (NiM). To address this gap, we introduce NiM, a carefully curated benchmark spanning diverse real-world documents including newspapers, menus, and lecture images, specifically designed to evaluate MLLMs' capability in these intricate tasks. Building on this, we further propose Spot-IT, a simple yet effective approach that enhances MLLMs capability through intelligent patch selection and Gaussian attention, motivated from how humans zoom and focus when searching documents. Our extensive experiments reveal both the capabilities and limitations of current MLLMs in handling fine-grained document understanding tasks, while demonstrating the effectiveness of our approach. Spot-IT achieves significant improvements over baseline methods, particularly in scenarios requiring precise detail extraction from complex layouts.

[72] DualMat: PBR Material Estimation via Coherent Dual-Path Diffusion

Yifeng Huang,Zhang Chen,Yi Xu,Minh Hoai,Zhong Li

Main category: cs.CV

TL;DR: DualMat is a novel dual-path diffusion framework for PBR material estimation from single images, combining pretrained visual knowledge with material-specialized modeling to achieve superior accuracy and efficiency.

Details Motivation: Estimating PBR materials from single images under complex lighting is challenging, and existing methods struggle with accuracy and efficiency. DualMat aims to address these limitations through a novel dual-path diffusion approach. Method: DualMat operates in two latent spaces: an albedo-optimized path using RGB latent space and a material-specialized path for metallic and roughness estimation, with feature distillation ensuring coherence between paths. Rectified flow is used to improve efficiency. Result: DualMat achieves state-of-the-art performance on Objaverse and real-world data, with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors, while enhancing inference efficiency via rectified flow. Conclusion: DualMat introduces a dual-path diffusion framework that achieves state-of-the-art performance in estimating PBR materials from single images, showing significant improvements in albedo and metallic-roughness estimation. Abstract: We present DualMat, a novel dual-path diffusion framework for estimating Physically Based Rendering (PBR) materials from single images under complex lighting conditions. Our approach operates in two distinct latent spaces: an albedo-optimized path leveraging pretrained visual knowledge through RGB latent space, and a material-specialized path operating in a compact latent space designed for precise metallic and roughness estimation. To ensure coherent predictions between the albedo-optimized and material-specialized paths, we introduce feature distillation during training. We employ rectified flow to enhance efficiency by reducing inference steps while maintaining quality. Our framework extends to high-resolution and multi-view inputs through patch-based estimation and cross-view attention, enabling seamless integration into image-to-3D pipelines. DualMat achieves state-of-the-art performance on both Objaverse and real-world data, significantly outperforming existing methods with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors.

[73] Decoupling Continual Semantic Segmentation

Yifu Guo,Yuquan Lu,Wentao Zhang,Zishan Xu,Dexia Chen,Siyu Zhang,Yizhe Zhang,Ruixuan Wang

Main category: cs.CV

TL;DR: DecoupleCSS presents a novel two-stage approach for continual semantic segmentation that decouples class detection from segmentation, enhancing knowledge retention and adaptability.

Details Motivation: Existing CSS methods suffer from interference between old and new class learning due to tightly coupled architectures, prompting the need for a framework that better balances retention and plasticity. Method: DecoupleCSS decouples class-aware detection from class-agnostic segmentation. It uses pre-trained text and image encoders with LoRA in the first stage to generate location-aware prompts, and the Segment Anything Model (SAM) in the second stage to produce precise segmentation masks. Result: DecoupleCSS achieves state-of-the-art performance across various challenging continual semantic segmentation tasks while effectively preserving past knowledge and learning new classes. Conclusion: DecoupleCSS introduces a two-stage framework that improves continual semantic segmentation by balancing retention and adaptability, achieving state-of-the-art performance. Abstract: Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention-plasticity balance. We introduce DecoupleCSS, a novel two-stage framework for CSS. By decoupling class-aware detection from class-agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre-trained text and image encoders, adapted using LoRA, to encode class-specific information and generate location-aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state-of-the-art performance across a variety of challenging tasks. Our code is publicly available at: https://github.com/euyis1019/Decoupling-Continual-Semantic-Segmentation.

[74] Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks

Ruiyu Li,Changyuan Qiu,Hangrui Cao,Qihan Ren,Yuqing Qiu

Main category: cs.CV

TL;DR: This project explores automatic image colorization using classification and adversarial learning, adapting prior methods to specific scenarios for improved results, driven by the challenge and application potential of converting grayscale images to color.

Details Motivation: Image colorization is a challenging and ill-posed problem with significant applications in areas like color restoration and animation. Semantic and texture cues, along with available training data, provide opportunities for learning effective colorization models. Method: The study explores automatic image colorization using classification and adversarial learning techniques, building on previous research and adapting it to specific scenarios. Result: The project successfully demonstrates the application of classification and adversarial learning for automatic image colorization, highlighting the importance of adapting prior methods to specific scenarios. Conclusion: The project concludes that automatic image colorization can be effectively achieved through classification and adversarial learning, building on prior works and making scenario-specific modifications for improved results. Abstract: Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons.

[75] FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer

Jian Zhu,Shanyuan Liu,Liuzhuozheng Li,Yue Gong,He Wang,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Dawei Leng,Yuhui Yin,Yang Xu

Main category: cs.CV

TL;DR: 本文提出了一种名为FLUX-Makeup的新型化妆迁移框架,无需依赖任何辅助面部控制组件,实现了高保真度、身份一致性和鲁棒性,并在各种场景中表现出色。

Details Motivation: 现有的基于GAN的方法通常依赖于精心设计的损失函数来平衡迁移质量和面部身份一致性,而基于扩散的方法往往依赖额外的面部控制模块或算法来保持身份一致性,但这些辅助组件往往会引入额外错误,导致次优迁移结果。 Method: 基于FLUX-Kontext构建框架,使用源图像作为其原生条件输入;引入轻量级化妆特征注入器RefLoRAInjector,将参考路径与主干分离;设计了一个鲁棒且可扩展的数据生成流水线,以在训练期间提供更准确的监督。 Result: FLUX-Makeup在化妆迁移任务中实现了高保真度和身份一致性,并显著优于现有方法;所生成的配对化妆数据集质量显著超越所有现有数据集。 Conclusion: FLUX-Makeup是一个无需辅助面部控制组件的高质量、身份一致且鲁棒的化妆迁移框架,实现了最先进的性能,并在各种场景中表现出强大的鲁棒性。 Abstract: Makeup transfer aims to apply the makeup style from a reference face to a target face and has been increasingly adopted in practical applications. Existing GAN-based approaches typically rely on carefully designed loss functions to balance transfer quality and facial identity consistency, while diffusion-based methods often depend on additional face-control modules or algorithms to preserve identity. However, these auxiliary components tend to introduce extra errors, leading to suboptimal transfer results. To overcome these limitations, we propose FLUX-Makeup, a high-fidelity, identity-consistent, and robust makeup transfer framework that eliminates the need for any auxiliary face-control components. Instead, our method directly leverages source-reference image pairs to achieve superior transfer performance. Specifically, we build our framework upon FLUX-Kontext, using the source image as its native conditional input. Furthermore, we introduce RefLoRAInjector, a lightweight makeup feature injector that decouples the reference pathway from the backbone, enabling efficient and comprehensive extraction of makeup-related information. In parallel, we design a robust and scalable data generation pipeline to provide more accurate supervision during training. The paired makeup datasets produced by this pipeline significantly surpass the quality of all existing datasets. Extensive experiments demonstrate that FLUX-Makeup achieves state-of-the-art performance, exhibiting strong robustness across diverse scenarios.

[76] AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models

Yuxiang Xiao,Yang Hu,Bin Li,Tianyang Zhang,Zexi Li,Huazhu Fu,Jens Rittscher,Kaixiang Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的提示引导推理框架AdaFusion,用于动态整合多个病理基础模型的知识,以提高下游任务的性能和可解释性。

Details Motivation: 病理基础模型(PFMs)由于其多样且不透明的预训练背景,引入了阻碍下游应用泛化性和透明度的潜在偏差。 Method: 提出了一种新的提示引导推理框架AdaFusion,该框架压缩并对齐不同模型的瓦片级特征,并采用轻量级注意力机制根据组织表型背景自适应地融合这些特征。 Result: 在三个真实世界基准测试中,AdaFusion在分类和回归任务上均一致超过了单个PFMs,并提供了对每个模型生物语义专业性的可解释见解。 Conclusion: AdaFusion通过动态整合多个PFMs的互补知识,提高了下游任务的性能和模型特异性归纳偏差的可解释性。 Abstract: Pathology foundation models (PFMs) have demonstrated strong representational capabilities through self-supervised pre-training on large-scale, unannotated histopathology image datasets. However, their diverse yet opaque pretraining contexts, shaped by both data-related and structural/training factors, introduce latent biases that hinder generalisability and transparency in downstream applications. In this paper, we propose AdaFusion, a novel prompt-guided inference framework that, to our knowledge, is among the very first to dynamically integrate complementary knowledge from multiple PFMs. Our method compresses and aligns tile-level features from diverse models and employs a lightweight attention mechanism to adaptively fuse them based on tissue phenotype context. We evaluate AdaFusion on three real-world benchmarks spanning treatment response prediction, tumour grading, and spatial gene expression inference. Our approach consistently surpasses individual PFMs across both classification and regression tasks, while offering interpretable insights into each model's biosemantic specialisation. These results highlight AdaFusion's ability to bridge heterogeneous PFMs, achieving both enhanced performance and interpretability of model-specific inductive biases.

[77] PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation

Jingxuan He,Busheng Su,Finn Wong

Main category: cs.CV

TL;DR: PoseGen是一个新的视频生成框架,能够根据单张参考图像和驱动姿态序列生成任意长度的特定主体视频,解决了身份漂移和时间连贯性问题。

Details Motivation: 当前扩散模型在生成长视频时面临身份漂移问题,并且对主体身份和运动的控制有限,只能生成短片段。 Method: PoseGen采用了一种上下文中的LoRA微调策略,通过在token级别注入主体外观来保持身份一致性,同时在通道级别上利用姿态信息进行精细运动控制。此外,PoseGen采用了一种交错片段生成方法,通过共享KV缓存机制和特殊的过渡过程无缝拼接视频片段,确保背景一致性和时间平滑性。 Result: 在仅33小时的视频数据集上训练后,大量实验表明,PoseGen在身份保真度、姿态准确性和生成无限时长的连贯、无伪影视频方面显著优于最先进的方法。 Conclusion: PoseGen有效解决了当前扩散模型在长视频生成中的身份漂移和时间连贯性问题,提供了更优的生成效果和更长的视频生成能力。 Abstract: Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.

[78] Sculpting Margin Penalty: Intra-Task Adapter Merging and Classifier Calibration for Few-Shot Class-Incremental Learning

Liang Bai,Hong Song,Jinfu Li,Yucong Lin,Jingfan Fan,Tianyu Fu,Danni Ai,Deqiang Xiao,Jian Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为SMP的FSCIL方法,通过引入margin penalties和参数高效的微调策略,有效平衡基类可区分性和新类泛化能力,实验结果表明其在多个数据集上表现优异。

Details Motivation: 现实应用中的数据隐私限制和高获取成本使得增量任务中假设拥有充足训练数据不现实,导致性能显著下降。现有FSCIL方法难以平衡基类可区分性和新类泛化,且增量任务中数据访问受限导致类别边界模糊。 Method: 提出了一种名为SMP的方法,包括Margin-aware Intra-task Adapter Merging (MIAM) 和 Margin Penalty-based Classifier Calibration (MPCC)。MIAM通过两个不同的分类损失训练低秩适配器,然后自适应合并,而MPCC通过带有margin penalty的微调优化分类器决策边界。 Result: 在CIFAR100、ImageNet-R和CUB200上的实验表明,SMP在FSCIL中达到了最先进的性能,同时在基类和新类之间保持了更好的平衡。 Conclusion: SMP有效地解决了FSCIL中基类可区分性和新类泛化之间的平衡问题,通过在不同阶段引入margin penalties,提升了前向兼容性并优化了决策边界。 Abstract: Real-world applications often face data privacy constraints and high acquisition costs, making the assumption of sufficient training data in incremental tasks unrealistic and leading to significant performance degradation in class-incremental learning. Forward-compatible learning, which prospectively prepares for future tasks during base task training, has emerged as a promising solution for Few-Shot Class-Incremental Learning (FSCIL). However, existing methods still struggle to balance base-class discriminability and new-class generalization. Moreover, limited access to original data during incremental tasks often results in ambiguous inter-class decision boundaries. To address these challenges, we propose SMP (Sculpting Margin Penalty), a novel FSCIL method that strategically integrates margin penalties at different stages within the parameter-efficient fine-tuning paradigm. Specifically, we introduce the Margin-aware Intra-task Adapter Merging (MIAM) mechanism for base task learning. MIAM trains two sets of low-rank adapters with distinct classification losses: one with a margin penalty to enhance base-class discriminability, and the other without margin constraints to promote generalization to future new classes. These adapters are then adaptively merged to improve forward compatibility. For incremental tasks, we propose a Margin Penalty-based Classifier Calibration (MPCC) strategy to refine decision boundaries by fine-tuning classifiers on all seen classes' embeddings with a margin penalty. Extensive experiments on CIFAR100, ImageNet-R, and CUB200 demonstrate that SMP achieves state-of-the-art performance in FSCIL while maintaining a better balance between base and new classes.

[79] AHDMIL: Asymmetric Hierarchical Distillation Multi-Instance Learning for Fast and Accurate Whole-Slide Image Classification

Jiuyang Dong,Jiahan Li,Junjun Jiang,Kui Jiang,Yongbing Zhang

Main category: cs.CV

TL;DR: AHDMIL是一种高效的多实例学习框架,用于病理图像分类,结合了蒸馏策略和新型分类器,显著提高了性能和速度。

Details Motivation: 解决多实例学习在病理图像分类中推理成本高的问题,通过消除无关图像块提高效率。 Method: 提出AHDMIL框架,包括动态多实例网络(DMIN)和双分支轻量级实例预筛选网络(DB-LIPN),结合自蒸馏和非对称蒸馏策略,并设计基于切比雪夫多项式的CKA分类器。 Result: 在Camelyon16等四个数据集上,AHDMIL在准确率、AUC、F1分数和Brier分数上均取得提升,推理速度平均加速1.2到2.1倍。 Conclusion: AHDMIL通过两步训练过程和新的CKA分类器,在计算病理学中实现了快速且准确的分类,实验结果显示其在多个公开数据集上优于现有方法。 Abstract: Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to the need to process thousands of patches from each gigapixel whole slide image (WSI). To address this, we propose AHDMIL, an Asymmetric Hierarchical Distillation Multi-Instance Learning framework that enables fast and accurate classification by eliminating irrelevant patches through a two-step training process. AHDMIL comprises two key components: the Dynamic Multi-Instance Network (DMIN), which operates on high-resolution WSIs, and the Dual-Branch Lightweight Instance Pre-screening Network (DB-LIPN), which analyzes corresponding low-resolution counterparts. In the first step, self-distillation (SD), DMIN is trained for WSI classification while generating per-instance attention scores to identify irrelevant patches. These scores guide the second step, asymmetric distillation (AD), where DB-LIPN learns to predict the relevance of each low-resolution patch. The relevant patches predicted by DB-LIPN have spatial correspondence with patches in high-resolution WSIs, which are used for fine-tuning and efficient inference of DMIN. In addition, we design the first Chebyshev-polynomial-based Kolmogorov-Arnold (CKA) classifier in computational pathology, which improves classification performance through learnable activation layers. Extensive experiments on four public datasets demonstrate that AHDMIL consistently outperforms previous state-of-the-art methods in both classification performance and inference speed. For example, on the Camelyon16 dataset, it achieves a relative improvement of 5.3% in accuracy and accelerates inference by 1.2.times. Across all datasets, area under the curve (AUC), accuracy, f1 score, and brier score show consistent gains, with average inference speedups ranging from 1.2 to 2.1 times. The code is available.

[80] Latent Expression Generation for Referring Image Segmentation and Grounding

Seonghoon Yu,Joonbeom Hong,Joonseok Lee,Jeany Son

Main category: cs.CV

TL;DR: This paper proposes a visual grounding framework that generates multiple latent expressions to better align rich visual details with textual descriptions, improving performance on referring image segmentation and comprehension tasks.

Details Motivation: Existing methods rely on a single textual input, which fails to capture the rich visual details necessary for accurately identifying target objects in visual grounding tasks. Method: The framework introduces subject distributor and visual concept injector modules to enrich latent representations with shared-subject and distinct-attributes concepts, along with a positive-margin contrastive learning strategy to align latent expressions with the original text. Result: The method achieves superior performance on RIS and REC benchmarks and outstanding results on the GRES benchmark, demonstrating its effectiveness and generalization ability. Conclusion: The proposed visual grounding framework outperforms state-of-the-art methods on multiple benchmarks, proving its effectiveness in capturing target-specific visual cues and handling complex visual grounding tasks. Abstract: Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to embed both shared-subject and distinct-attributes concepts into the latent representations, thereby capturing unique and target-specific visual cues. We also propose a positive-margin contrastive learning strategy to align all latent expressions with the original text while preserving subtle variations. Experimental results show that our method not only outperforms state-of-the-art RIS and REC approaches on multiple benchmarks but also achieves outstanding performance on the generalized referring expression segmentation (GRES) benchmark.

[81] FedGIN: Federated Learning with Dynamic Global Intensity Non-linear Augmentation for Organ Segmentation using Multi-modal Images

Sachin Dudda Nagaraju,Ashkan Moradi,Bendik Skarre Abrahamsen,Mattijs Elschot

Main category: cs.CV

TL;DR: FedGIN is a privacy-preserving federated learning framework that enables effective multimodal medical image segmentation by harmonizing intensity distributions across modalities.

Details Motivation: Medical image segmentation is crucial for diagnostics and treatment, but variability across imaging modalities and privacy restrictions hinder model generalization and data sharing. A unified, privacy-preserving framework is needed to overcome these challenges. Method: Proposed FedGIN, a Federated Learning framework with a Global Intensity Non-linear (GIN) augmentation module, was developed to harmonize modality-specific intensity distributions during local training without sharing raw patient data. Result: FedGIN improved 3D Dice scores by 12–18% on MRI test cases in the limited-data scenario compared to FL without GIN. In the complete dataset scenario, it achieved a 30% Dice improvement over MRI-only and a 10% improvement over CT-only baselines. Conclusion: FedGIN demonstrates near-centralized performance in multimodal organ segmentation while preserving privacy, achieving significant improvements over modality-specific baselines. Abstract: Medical image segmentation plays a crucial role in AI-assisted diagnostics, surgical planning, and treatment monitoring. Accurate and robust segmentation models are essential for enabling reliable, data-driven clinical decision making across diverse imaging modalities. Given the inherent variability in image characteristics across modalities, developing a unified model capable of generalizing effectively to multiple modalities would be highly beneficial. This model could streamline clinical workflows and reduce the need for modality-specific training. However, real-world deployment faces major challenges, including data scarcity, domain shift between modalities (e.g., CT vs. MRI), and privacy restrictions that prevent data sharing. To address these issues, we propose FedGIN, a Federated Learning (FL) framework that enables multimodal organ segmentation without sharing raw patient data. Our method integrates a lightweight Global Intensity Non-linear (GIN) augmentation module that harmonizes modality-specific intensity distributions during local training. We evaluated FedGIN using two types of datasets: an imputed dataset and a complete dataset. In the limited dataset scenario, the model was initially trained using only MRI data, and CT data was added to assess its performance improvements. In the complete dataset scenario, both MRI and CT data were fully utilized for training on all clients. In the limited-data scenario, FedGIN achieved a 12 to 18% improvement in 3D Dice scores on MRI test cases compared to FL without GIN and consistently outperformed local baselines. In the complete dataset scenario, FedGIN demonstrated near-centralized performance, with a 30% Dice score improvement over the MRI-only baseline and a 10% improvement over the CT-only baseline, highlighting its strong cross-modality generalization under privacy constraints.

[82] Deep Learning-based Animal Behavior Analysis: Insights from Mouse Chronic Pain Models

Yu-Hsi Chen,Wei-Hsin Chen,Chien-Yao Wang,Hong-Yuan Mark Liao,James C. Liao,Chien-Chang Chen

Main category: cs.CV

TL;DR: 本研究提出了一种无需人工标记行为标签的小鼠慢性疼痛行为评估框架,该方法在多个疼痛分类任务中的表现均优于人类专家和现有技术,并揭示了药物对不同类型疼痛的疗效差异。

Details Motivation: 现有评估小鼠慢性疼痛的方法主要依赖于人工标记行为特征,而人类对于哪些行为最能代表慢性疼痛的理解并不明确。因此,现有方法难以准确捕捉到慢性疼痛中隐匿且持续的行为变化。 Method: 研究使用了一种通用动作空间投影仪来自动提取小鼠的动作特征,并避免了人工标记可能带来的偏见。此外,研究还收集了一个包含多种时间点的神经性疼痛和炎症性疼痛进展的小鼠疼痛行为数据集。 Result: 在15类疼痛分类任务中,该方法达到了48.41%的准确率,显著优于人类专家(21.33%)和广泛使用的方法B-SOiD(30.52%)。当分类简化为仅三个类别(神经性疼痛、炎症性疼痛和无疼痛)时,该方法准确率达到73.1%,也明显高于人类专家(48%)和B-SOiD(58.43%)。此外,在零样本加巴喷丁药物测试中,该方法揭示了不同类型疼痛药物疗效的差异,并且结果与过去的药物疗效文献一致。 Conclusion: 本研究展示了一种评估小鼠慢性疼痛行为的新框架,这种方法无需依赖人工标记行为标签,可以自动提取与疼痛相关的行为特征,为疼痛研究和相关药物开发提供了潜在的临床应用价值。 Abstract: Assessing chronic pain behavior in mice is critical for preclinical studies. However, existing methods mostly rely on manual labeling of behavioral features, and humans lack a clear understanding of which behaviors best represent chronic pain. For this reason, existing methods struggle to accurately capture the insidious and persistent behavioral changes in chronic pain. This study proposes a framework to automatically discover features related to chronic pain without relying on human-defined action labels. Our method uses universal action space projector to automatically extract mouse action features, and avoids the potential bias of human labeling by retaining the rich behavioral information in the original video. In this paper, we also collected a mouse pain behavior dataset that captures the disease progression of both neuropathic and inflammatory pain across multiple time points. Our method achieves 48.41\% accuracy in a 15-class pain classification task, significantly outperforming human experts (21.33\%) and the widely used method B-SOiD (30.52\%). Furthermore, when the classification is simplified to only three categories, i.e., neuropathic pain, inflammatory pain, and no pain, then our method achieves an accuracy of 73.1\%, which is notably higher than that of human experts (48\%) and B-SOiD (58.43\%). Finally, our method revealed differences in drug efficacy for different types of pain on zero-shot Gabapentin drug testing, and the results were consistent with past drug efficacy literature. This study demonstrates the potential clinical application of our method, which can provide new insights into pain research and related drug development.

[83] Rotation Equivariant Arbitrary-scale Image Super-Resolution

Qi Xie,Jiahong Fu,Zongben Xu,Deyu Meng

Main category: cs.CV

TL;DR: This paper proposes a rotation-equivariant ASISR method by redesigning encoder and INR modules, ensuring end-to-end rotational equivariance for better preservation of geometric patterns in high-resolution image recovery.

Details Motivation: ASISR faces challenges in preserving geometric patterns like edges and textures due to their deformation in low-resolution images, which leads to artifacts in high-resolution recoveries. Rotation equivariance is necessary to maintain structural integrity and original orientations. Method: The method involves redesigning the encoder and INR modules of existing ASISR networks to incorporate intrinsic rotation equivariance, enabling end-to-end rotational equivariance from input to output. Result: The proposed rotation-equivariant ASISR method demonstrates superior performance on simulated and real datasets, and can be integrated into existing ASISR methods in a plug-and-play manner. Conclusion: The proposed method successfully introduces end-to-end rotation equivariance in ASISR networks, enhancing the structural integrity and orientation preservation of recovered high-resolution images. It is theoretically analyzed and experimentally validated on simulated and real datasets. Abstract: The arbitrary-scale image super-resolution (ASISR), a recent popular topic in computer vision, aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image. This task is realized by representing the image as a continuous implicit function through two fundamental modules, a deep-network-based encoder and an implicit neural representation (INR) module. Despite achieving notable progress, a crucial challenge of such a highly ill-posed setting is that many common geometric patterns, such as repetitive textures, edges, or shapes, are seriously warped and deformed in the low-resolution images, naturally leading to unexpected artifacts appearing in their high-resolution recoveries. Embedding rotation equivariance into the ASISR network is thus necessary, as it has been widely demonstrated that this enhancement enables the recovery to faithfully maintain the original orientations and structural integrity of geometric patterns underlying the input image. Motivated by this, we make efforts to construct a rotation equivariant ASISR method in this study. Specifically, we elaborately redesign the basic architectures of INR and encoder modules, incorporating intrinsic rotation equivariance capabilities beyond those of conventional ASISR networks. Through such amelioration, the ASISR network can, for the first time, be implemented with end-to-end rotational equivariance maintained from input to output. We also provide a solid theoretical analysis to evaluate its intrinsic equivariance error, demonstrating its inherent nature of embedding such an equivariance structure. The superiority of the proposed method is substantiated by experiments conducted on both simulated and real datasets. We also validate that the proposed framework can be readily integrated into current ASISR methods in a plug \& play manner to further enhance their performance.

[84] X-MoGen: Unified Motion Generation across Humans and Animals

Xuan Wang,Kai Ruan,Liyang Qian,Zhizhi Guo,Chang Su,Gaoang Wang

Main category: cs.CV

TL;DR: X-MoGen是第一个统一的跨物种文本驱动运动生成框架,通过两阶段架构和大规模数据集UniMo4D,在可见和未见过的物种上都取得了优于现有技术的结果。

Details Motivation: 现有的方法通常分别建模人类和动物运动,而跨物种的方法提供了统一表示和改进泛化的优势。 Method: X-MoGen采用两阶段架构,包括一个条件图变分自编码器和一个自动编码器,并在训练期间使用形态一致性模块。 Result: X-MoGen在UniMo4D数据集上的实验表明,它在可见和未见过的物种上都优于现有技术。 Conclusion: X-MoGen是一个统一的跨物种文本驱动运动生成框架,通过采用两阶段架构和构建UniMo4D数据集,在可见和未见过的物种上都优于现有技术。 Abstract: Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose \textbf{X-MoGen}, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct \textbf{UniMo4D}, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.

[85] PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems

Qi Guo,Xiaojun Jia,Shanmin Pang,Simeng Qin,Lin Wang,Ju Jia,Yang Liu,Qing Guo

Main category: cs.CV

TL;DR: PhysPatch is a new adversarial patch framework for MLLM-based autonomous driving systems that improves attack effectiveness and real-world applicability by jointly optimizing patch location, shape, and content.

Details Motivation: MLLMs are increasingly used in autonomous driving systems but are vulnerable to adversarial patch attacks. Existing patch-based attacks designed for object detection models do not transfer well to MLLMs due to their complex architectures and reasoning capabilities. Method: PhysPatch jointly optimizes patch location, shape, and content, using a semantic-based mask initialization strategy, an SVD-based local alignment loss with patch-guided crop-resize, and a potential field-based mask refinement method. Result: PhysPatch significantly outperforms prior methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs while ensuring patches are placed in physically feasible regions. Conclusion: PhysPatch is a powerful and physically realizable adversarial patch framework specifically designed for MLLM-based autonomous driving systems, significantly outperforming previous methods in attack effectiveness and real-world applicability. Abstract: Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. However, MLLMs are vulnerable to adversarial attacks, particularly adversarial patch attacks, which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models and perform poorly when transferred to MLLM-based systems due to the latter's complex architectures and reasoning abilities. To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms prior methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.

[86] Multi-tracklet Tracking for Generic Targets with Adaptive Detection Clustering

Zewei Wu,Longhao Wang,Cui Wang,César Teixeira,Wei Ke,Zhang Xiong

Main category: cs.CV

TL;DR: 本文提出了一种新的多目标跟踪框架MTT,通过生成稳健的轨迹片段并结合多线索进行关联,有效解决了低置信度、弱约束和遮挡问题,在基准测试中表现优异。

Details Motivation: 现有的视觉多目标跟踪方法在面对现实场景中未见过的目标类别时,常常由于低置信度检测、弱运动和外观约束以及长期遮挡而表现不佳。因此需要一种更鲁棒的跟踪框架。 Method: 提出了一种称为多轨迹跟踪(MTT)的增强型跟踪方法,该方法首先根据检测结果的短期时空相关性自适应聚类生成稳健的轨迹片段,然后利用位置和外观等多线索估计最佳轨迹划分。 Result: MTT框架通过灵活的轨迹片段生成和多线索关联策略,有效缓解了长期关联中的误差传播问题,并在多目标跟踪基准测试中表现出色。 Conclusion: 提出的MTT框架在通用多目标跟踪基准实验中表现出竞争力,能够有效应对低置信度检测、弱运动和外观约束以及长期遮挡等挑战。 Abstract: Tracking specific targets, such as pedestrians and vehicles, has been the focus of recent vision-based multitarget tracking studies. However, in some real-world scenarios, unseen categories often challenge existing methods due to low-confidence detections, weak motion and appearance constraints, and long-term occlusions. To address these issues, this article proposes a tracklet-enhanced tracker called Multi-Tracklet Tracking (MTT) that integrates flexible tracklet generation into a multi-tracklet association framework. This framework first adaptively clusters the detection results according to their short-term spatio-temporal correlation into robust tracklets and then estimates the best tracklet partitions using multiple clues, such as location and appearance over time to mitigate error propagation in long-term association. Finally, extensive experiments on the benchmark for generic multiple object tracking demonstrate the competitiveness of the proposed framework.

[87] MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

Yufei Gao,Jiaying Fei,Nuo Chen,Ruirui Chen,Guohang Yan,Yunshi Lan,Botian Shi

Main category: cs.CV

TL;DR: This study addresses the limitations of Multimodal Large Language Models (MLLMs) in low-resource languages by introducing MELLA, a dataset designed to enhance both linguistic and cultural understanding, leading to improved model performance.

Details Motivation: The motivation stems from the observation that current MLLMs perform poorly in low-resource languages due to limited multilingual enhancement methods that focus only on text modality or machine translation, neglecting the importance of cultural groundedness and multimodal informativeness. Method: The researchers proposed a dual-source strategy to collect data for linguistic and cultural enhancement, and introduced MELLA, a multimodal, multilingual dataset. The dataset was used for fine-tuning MLLMs, and its effectiveness was evaluated across eight languages on various MLLM backbones. Result: After fine-tuning on the MELLA dataset, there was a general performance improvement in various MLLM backbones for eight languages, with models producing richer 'thick descriptions'. The improvements were attributed to both linguistic capability and cultural knowledge enhancement. Conclusion: The study concludes that enhancing both linguistic capability and cultural groundedness significantly improves the effectiveness of Multimodal Large Language Models (MLLMs) in low-resource language settings. Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce "thin descriptions", they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing "thick descriptions". We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.

[88] SPA++: Generalized Graph Spectral Alignment for Versatile Domain Adaptation

Zhiqing Xiao,Haobo Wang,Xu Lu,Wentao Ye,Gang Chen,Junbo Zhao

Main category: cs.CV

TL;DR: This paper introduces SPA++, a graph-based domain adaptation method that improves transferability and discriminability through spectral alignment, propagation mechanisms, and data augmentation, outperforming existing approaches.

Details Motivation: Most prior domain adaptation works focus on inter-domain transferability while neglecting intra-domain structures, leading to poor discriminability. This work addresses this tradeoff. Method: The paper proposes a graph spectral alignment framework called SPA++ that includes a coarse graph alignment mechanism with a spectral regularizer, a fine-grained neighbor-aware propagation mechanism, and incorporates data augmentation and consistency regularization. Result: Extensive experiments show that SPA++ consistently outperforms state-of-the-art methods, especially in challenging domain adaptation and distribution scenarios. Conclusion: The proposed method, SPA++, outperforms existing methods in domain adaptation by incorporating graph alignment, neighbor-aware propagation, and data augmentation, leading to improved robustness and adaptability. Abstract: Domain Adaptation (DA) aims to transfer knowledge from a labeled source domain to an unlabeled or sparsely labeled target domain under domain shifts. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. To tackle this tradeoff, we propose a generalized graph SPectral Alignment framework, SPA++. Its core is briefly condensed as follows: (1)-by casting the DA problem to graph primitives, it composes a coarse graph alignment mechanism with a novel spectral regularizer toward aligning the domain graphs in eigenspaces; (2)-we further develop a fine-grained neighbor-aware propagation mechanism for enhanced discriminability in the target domain; (3)-by incorporating data augmentation and consistency regularization, SPA++ can adapt to complex scenarios including most DA settings and even challenging distribution scenarios. Furthermore, we also provide theoretical analysis to support our method, including the generalization bound of graph-based DA and the role of spectral alignment and smoothing consistency. Extensive experiments on benchmark datasets demonstrate that SPA++ consistently outperforms existing cutting-edge methods, achieving superior robustness and adaptability across various challenging adaptation scenarios.

[89] Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Luozheng Qin,Jia Gong,Yuqing Sun,Tianjiao Li,Mengping Yang,Xiaomeng Yang,Chao Qu,Zhiyu Tan,Hao Li

Main category: cs.CV

TL;DR: 本文提出了一种名为 Uni-CoT 的统一链式推理框架,用于解决视觉语言推理任务中的视觉状态转换问题,通过双层推理范式和结构化训练方法,实现了高效且连贯的多模态推理,并在多个基准测试中取得了最先进的性能。

Details Motivation: 现有的视觉语言推理方法在建模视觉状态转换或因架构碎片化导致的视觉轨迹不连贯方面存在局限性,因此需要一种更连贯和统一的推理框架。 Method: 提出了一种名为 Uni-CoT 的统一链式推理框架,通过结合图像理解和生成模型进行视觉内容推理,并引入了一种双层推理范式:宏观层面的 CoT 用于高级任务规划,微观层面的 CoT 用于子任务执行。 Result: Uni-CoT 在 WISE 图像生成基准和 RISE、KRIS 编辑基准上均取得了 SOTA 性能,并且实验可以在仅使用 8 块 A100 GPU 的情况下高效完成。 Conclusion: Uni-CoT 是一种用于多模态推理的统一链式推理框架,能够实现可扩展且连贯的视觉语言推理,并在多个基准测试中展示了最先进的性能和较强的泛化能力。 Abstract: Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/

[90] SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images

Dongchen Si,Di Wang,Erzhong Gao,Xiaolei Qin,Liu Zhao,Jing Zhang,Minqiang Xu,Jianbo Zhan,Jianshe Wang,Lin Liu,Bo Du,Liangpei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为SPEX的多模态视觉-语言模型,专门用于光谱遥感图像中的土地覆盖提取,通过构建视觉-语言数据集SPIE和引入多种创新组件及训练策略,实现了优于现有方法的性能,并能生成对其预测结果的文本解释。

Details Motivation: 尽管已经开发了许多用于像素级解释的视觉-语言模型,但光谱信息仍未被充分利用,导致在多光谱场景下的性能欠佳。 Method: 基于经典光谱指数计算构建了一个视觉-语言指令跟随数据集SPIE,并提出了SPEX模型,包括多尺度特征聚合、标记上下文压缩和多光谱视觉预训练等组件和训练策略。 Result: 在五个公开的多光谱数据集上的广泛实验表明,SPEX在提取植被、建筑物和水体等典型土地覆盖类别方面始终优于现有的最先进方法。 Conclusion: SPEX是首个专门用于光谱遥感图像土地覆盖提取的多模态视觉-语言模型,能够生成对其预测结果的文本解释,提高了可解释性和用户友好性。 Abstract: Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to achieve precise and flexible pixel-level interpretation. To the best of our knowledge, SPEX is the first multimodal vision-language model dedicated to land cover extraction in spectral remote sensing imagery. Extensive experiments on five public multispectral datasets demonstrate that SPEX consistently outperforms existing state-of-the-art methods in extracting typical land cover categories such as vegetation, buildings, and water bodies. Moreover, SPEX is capable of generating textual explanations for its predictions, thereby enhancing interpretability and user-friendliness. Code will be released at: https://github.com/MiliLab/SPEX.

[91] Test-Time Reinforcement Learning for GUI Grounding via Region Consistency

Yong Du,Yuchen Yan,Fei Tang,Zhengxi Lu,Chang Zong,Weiming Lu,Shengpei Jiang,Yongliang Shen

Main category: cs.CV

TL;DR: The paper proposes GUI-RC and GUI-RCPO methods to enhance GUI grounding accuracy without additional training, achieving better performance through test-time scaling and reinforcement learning.

Details Motivation: The motivation is to improve the accuracy of GUI grounding without relying on extensive supervised training or labeled pixel-level annotations, which are costly and not always available. Method: The paper introduces GUI-RC and GUI-RCPO methods for improving GUI grounding accuracy. GUI-RC uses multiple predictions to create spatial voting grids that identify consensus regions, while GUI-RCPO transforms consistency patterns into rewards for reinforcement learning. Result: GUI-RC improved accuracy by 2-3% across different architectures on ScreenSpot benchmarks. GUI-RCPO further increased accuracy to 85.14% on ScreenSpot-v2 through self-supervised optimization. Conclusion: The paper concludes that GUI-RC and GUI-RCPO offer promising approaches for improving the accuracy of GUI grounding without requiring additional training, highlighting the potential of test-time scaling and reinforcement learning for developing more robust and data-efficient GUI agents. Abstract: Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.

[92] EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery

Bingyu Yang,Qingyao Tian,Yimeng Geng,Huai Liao,Xinyan Huang,Jiebo Luo,Hongbin Liu

Main category: cs.CV

TL;DR: EndoMatcher是一种用于内窥镜图像匹配的新方法,通过大规模多领域数据预训练实现零样本泛化,解决了视觉条件差和数据稀缺问题,显著提高了匹配性能。

Details Motivation: 内窥镜图像匹配在机器人辅助任务中至关重要,但面临视觉条件差和标注数据稀缺的挑战。因此,需要一种具有零样本泛化能力的通用匹配方法。 Method: EndoMatcher采用双分支Vision Transformer提取多尺度特征,并通过双交互块增强鲁棒性对应学习。同时,构建了包含约120万对真实和合成图像的多领域数据集Endo-Mix6,并采用渐进式多目标训练策略提升跨域表示质量。 Result: 在Hamlyn和Bladder数据集上,EndoMatcher将匹配内点数分别提高了140.69%和201.43%,在Gastro-Matching数据集上将匹配方向预测准确率(MDPA)提高了9.40%。 Conclusion: EndoMatcher是一个具有零样本泛化能力的内窥镜图像匹配方法,通过大规模多领域数据预训练,有效解决了内窥镜图像匹配中的视觉条件困难和数据稀缺问题。 Abstract: Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with correspondence labels generated using Structure-from-Motion and simulated transformations. The diversity and scale of Endo-Mix6 introduce new challenges in training stability due to significant variations in dataset sizes, distribution shifts, and error imbalance. To address them, a progressive multi-objective training strategy is employed to promote balanced learning and improve representation quality across domains. This enables EndoMatcher to generalize across unseen organs and imaging conditions in a zero-shot fashion. Extensive zero-shot matching experiments demonstrate that EndoMatcher increases the number of inlier matches by 140.69% and 201.43% on the Hamlyn and Bladder datasets over state-of-the-art methods, respectively, and improves the Matching Direction Prediction Accuracy (MDPA) by 9.40% on the Gastro-Matching dataset, achieving dense and accurate matching under challenging endoscopic conditions. The code is publicly available at https://github.com/Beryl2000/EndoMatcher.

[93] VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

Sihan Yang,Runsen Xu,Chenhang Cui,Tai Wang,Dahua Lin,Jiangmiao Pang

Main category: cs.CV

TL;DR: VFlowOpt是一种新的视觉token修剪框架,通过重要性图和渐进式修剪模块,显著降低计算成本并保持性能。

Details Motivation: 现有的视觉token修剪框架和策略过于简单,导致性能下降,因此需要更精细和探索充分的方法。 Method: 提出VFlowOpt框架,包括重要性图推导过程和带有回收机制的渐进式修剪模块,并通过视觉信息流引导的方法优化超参数。 Result: 实验表明,VFlowOpt可修剪90%的视觉token,保持性能不变,同时KV-Cache内存减少89%,推理速度快3.8倍。 Conclusion: VFlowOpt实现了在保持性能的同时减少视觉token,从而显著降低计算成本。 Abstract: Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning strategies tailored to different LMMs. Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8 times faster inference.

[94] Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation

Jianming Liu,Wenlong Qiu,Haitao Wei

Main category: cs.CV

TL;DR: This paper proposes a source-free cross-domain few-shot segmentation method that uses visual and textual information to improve adaptation without source data, achieving significant performance gains.

Details Motivation: The need for source-free CD-FSS approaches arises due to data privacy concerns and the necessity to reduce data transfer and training costs. Method: The method uses Task-Specific Attention Adapters (TSAA) and aligns features using Visual-Visual Embedding Alignment (VVEA) and Text-Visual Embedding Alignment (TVEA) modules. Result: The approach achieves improvements of 2.18% (1-shot) and 4.11% (5-shot) in segmentation accuracy across four cross-domain datasets. Conclusion: The proposed source-free CD-FSS method, leveraging textual and visual information, significantly improves cross-domain segmentation performance without source domain data. Abstract: Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18\% and 4.11\%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.

[95] ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking

Xiao Wang,Liye Jin,Xufeng Lou,Shiao Wang,Lan Chen,Bo Jiang,Zhipeng Zhang

Main category: cs.CV

TL;DR: This paper introduces ReasoningTrack, a novel vision-language tracking framework that enhances performance by incorporating reasoning and language generation. It uses a pre-trained vision-language model, SFT, and reinforcement learning for optimization. A new benchmark dataset, TNLLT, is also proposed to advance research in this area.

Details Motivation: The motivation stems from the limitations of existing vision-language tracking methods, which either inadequately fuse language and vision features or fail to fully exploit large models' capabilities. The lack of interpretability and adaptability in current approaches further limits their performance, prompting the need for a more advanced framework. Method: The authors propose a reasoning-based vision-language tracking framework using a pre-trained vision-language model (Qwen2.5-VL). They employ SFT (Supervised Fine-Tuning) and GRPO (reinforcement learning) for optimizing reasoning and language generation. Updated language descriptions are combined with vision features and fed into a unified tracking backbone network, followed by a tracking head for object localization. They also introduce a new large-scale dataset, TNLLT, with 200 video sequences for evaluation. Result: Extensive experiments on multiple vision-language tracking benchmark datasets validate the effectiveness of the proposed reasoning-based natural language generation strategy. The TNLLT dataset provides a solid foundation for future research, and the source code is publicly available to promote further development in this field. Conclusion: The paper concludes that their proposed framework, ReasoningTrack, significantly improves vision-language tracking by incorporating reasoning and language generation strategies, as demonstrated by extensive experiments on benchmark datasets. Abstract: Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack

[96] Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2

Semanur Küçük,Cosimo Della Santina,Angeliki Laskari

Main category: cs.CV

TL;DR: This study successfully applies a fine-tuned vision model (SAM v2.1) to segment irregular gas bubbles in multiphase flows, using minimal annotated data, addressing a longstanding challenge in industrial applications.

Details Motivation: The motivation stems from the industrial importance of segmenting gas bubbles in multiphase flows, where traditional and recent learning-based approaches fail due to assumptions of near-spherical bubble shapes, especially in complex regimes involving deformation, coalescence, or breakup. Method: The research frames the segmentation task as a transfer learning problem, utilizing the Segment Anything Model (SAM v2.1) and fine-tuning it with as few as 100 annotated images to segment irregular bubble structures. Result: The fine-tuned SAM v2.1 model demonstrated the ability to accurately segment highly non-convex and irregular bubble structures using a limited dataset of annotated images. Conclusion: The study concludes that fine-tuned Segment Anything Model (SAM v2.1) can effectively segment highly non-convex, irregular bubble structures in multiphase flows, overcoming the limitations of traditional and learning-based methods. Abstract: Segmenting gas bubbles in multiphase flows is a critical yet unsolved challenge in numerous industrial settings, from metallurgical processing to maritime drag reduction. Traditional approaches-and most recent learning-based methods-assume near-spherical shapes, limiting their effectiveness in regimes where bubbles undergo deformation, coalescence, or breakup. This complexity is particularly evident in air lubrication systems, where coalesced bubbles form amorphous and topologically diverse patches. In this work, we revisit the problem through the lens of modern vision foundation models. We cast the task as a transfer learning problem and demonstrate, for the first time, that a fine-tuned Segment Anything Model SAM v2.1 can accurately segment highly non-convex, irregular bubble structures using as few as 100 annotated images.

[97] ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models

Yatong Lan,Jingfeng Chen,Yiru Wang,Lei He

Main category: cs.CV

TL;DR: Arbiviewgen是一种无需真实数据监督的扩散框架,通过FAVS和CVC-SSL实现可控的任意视角相机图像生成,适用于自动驾驶领域。

Details Motivation: 由于缺乏外推视角的真实数据,自动驾驶中可控的任意视角图像生成面临挑战,因此需要一种无需真实数据监督的高保真生成模型训练方法。 Method: 提出了Arbiviewgen,结合了Feature-Aware Adaptive View Stitching (FAVS) 和 Cross-View Consistency Self-Supervised Learning (CVC-SSL)。FAVS利用分层匹配策略进行粗略几何对应和细粒度对齐,CVC-SSL则通过自监督学习方式训练模型,从合成图像中重建原始相机视角,从而确保跨视角一致性。 Result: Arbiviewgen仅需要训练多摄像头图像及其姿态信息,不需要额外的传感器或深度图,即可实现可控的任意视角相机图像生成。 Conclusion: Arbiviewgen是一个新的扩散框架,能够在没有真实数据监督的情况下,通过多摄像头图像和姿态进行可控的任意视角相机图像生成,并且据作者所知,这是首个能够在多个车辆配置中实现可控任意视角相机图像生成的方法。 Abstract: Arbitrary viewpoint image generation holds significant potential for autonomous driving, yet remains a challenging task due to the lack of ground-truth data for extrapolated views, which hampers the training of high-fidelity generative models. In this work, we propose Arbiviewgen, a novel diffusion-based framework for the generation of controllable camera images from arbitrary points of view. To address the absence of ground-truth data in unseen views, we introduce two key components: Feature-Aware Adaptive View Stitching (FAVS) and Cross-View Consistency Self-Supervised Learning (CVC-SSL). FAVS employs a hierarchical matching strategy that first establishes coarse geometric correspondences using camera poses, then performs fine-grained alignment through improved feature matching algorithms, and identifies high-confidence matching regions via clustering analysis. Building upon this, CVC-SSL adopts a self-supervised training paradigm where the model reconstructs the original camera views from the synthesized stitched images using a diffusion model, enforcing cross-view consistency without requiring supervision from extrapolated data. Our framework requires only multi-camera images and their associated poses for training, eliminating the need for additional sensors or depth maps. To our knowledge, Arbiviewgen is the first method capable of controllable arbitrary view camera image generation in multiple vehicle configurations.

[98] Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models

Zane Xu,Jason Sun

Main category: cs.CV

TL;DR: This paper reviews methods to improve adversarial robustness in vision-language models while preserving their ability to generalize without additional training.

Details Motivation: The motivation is to address the central challenge of balancing adversarial robustness with the preservation of zero-shot generalization capabilities in vision-language models. Method: The paper synthesizes findings from eight seminal papers, analyzing defense paradigms like Adversarial Fine-Tuning and Training-Free/Test-Time Defenses. Result: The analysis highlights the evolution of methods from alignment-preserving techniques to embedding space re-engineering and latent-space purification techniques. Conclusion: The paper concludes that there is a need to develop hybrid defense strategies and focus on adversarial pre-training to overcome the trade-off between adversarial robustness and zero-shot generalization in VLMs. Abstract: This report synthesizes eight seminal papers on the zero-shot adversarial robustness of vision-language models (VLMs) like CLIP. A central challenge in this domain is the inherent trade-off between enhancing adversarial robustness and preserving the model's zero-shot generalization capabilities. We analyze two primary defense paradigms: Adversarial Fine-Tuning (AFT), which modifies model parameters, and Training-Free/Test-Time Defenses, which preserve them. We trace the evolution from alignment-preserving methods (TeCoA) to embedding space re-engineering (LAAT, TIMA), and from input heuristics (AOM, TTC) to latent-space purification (CLIPure). Finally, we identify key challenges and future directions including hybrid defense strategies and adversarial pre-training.

[99] RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding

Tianchen Fang,Guiru Liu

Main category: cs.CV

TL;DR: 本研究提出了一种新的医学图像理解方法RegionMed-CLIP,通过区域感知的多模态对比学习框架解决医学图像分析中的数据标注不足和全局特征依赖问题,取得了优于现有模型的性能表现。

Details Motivation: 医学图像理解的发展受到高质量标注医学数据有限和过度依赖全局图像特征的阻碍,这通常会遗漏细微但具有临床意义的病理区域。 Method: 引入RegionMed-CLIP,一个区域感知的多模态对比学习框架,结合了局部病理信号和整体语义表示。 Result: 在大规模区域级表示学习的支持下,RegionMed-CLIP在图像-文本检索、零样本分类和视觉问答任务上的实验结果显示,其性能远远超过现有的视觉语言模型。 Conclusion: RegionMed-CLIP为推进多模态医学图像理解提供了一个强大基础,突出了区域感知对比预训练的重要性。 Abstract: Medical image understanding plays a crucial role in enabling automated diagnosis and data-driven clinical decision support. However, its progress is impeded by two primary challenges: the limited availability of high-quality annotated medical data and an overreliance on global image features, which often miss subtle but clinically significant pathological regions. To address these issues, we introduce RegionMed-CLIP, a region-aware multimodal contrastive learning framework that explicitly incorporates localized pathological signals along with holistic semantic representations. The core of our method is an innovative region-of-interest (ROI) processor that adaptively integrates fine-grained regional features with the global context, supported by a progressive training strategy that enhances hierarchical multimodal alignment. To enable large-scale region-level representation learning, we construct MedRegion-500k, a comprehensive medical image-text corpus that features extensive regional annotations and multilevel clinical descriptions. Extensive experiments on image-text retrieval, zero-shot classification, and visual question answering tasks demonstrate that RegionMed-CLIP consistently exceeds state-of-the-art vision language models by a wide margin. Our results highlight the critical importance of region-aware contrastive pre-training and position RegionMed-CLIP as a robust foundation for advancing multimodal medical image understanding.

[100] A Study of Gender Classification Techniques Based on Iris Images: A Deep Survey and Analysis

Basna Mohammed Salih Hasan,Ramadhan J. Mstafa

Main category: cs.CV

TL;DR: 这篇论文综述了性别分类的方法,重点分析了基于生物特征(如面部和虹膜)的技术,并探讨了研究空白与未来发展方向。

Details Motivation: 性别分类在监控、公司档案和人机交互等多个应用中具有吸引力,同时作为软生物特征信息,性别可以帮助推断个体身份。 Method: 研究回顾了文献中的各种性别分类方法,包括基于生物特征的方法(如面部、虹膜等)以及相关步骤的方法论。 Result: 研究总结了当前性别分类的方法,尤其是基于面部和虹膜的技术,并指出虹膜作为稳定、可见和非侵入性的特征在性别分类中的重要性。 Conclusion: 这项研究讨论了性别分类的不同方法,提供了对现有性别分类方法的分析和见解,并强调了该领域的空白和挑战,为未来的研究提供了建议和方向。 Abstract: Gender classification is attractive in a range of applications, including surveillance and monitoring, corporate profiling, and human-computer interaction. Individuals' identities may be gleaned from information about their gender, which is a kind of soft biometric.Over the years, several methods for determining a person's gender have been devised. Some of the most well-known ones are based on physical characteristics like face, fingerprint, palmprint, DNA, ears, gait, and iris. On the other hand, facial features account for the vast majority of gender classification methods. Also, the iris is a significant biometric trait because the iris, according to research, remains basically constant during an individual's life. Besides that, the iris is externally visible and is non-invasive to the user, which is important for practical applications. Furthermore, there are already high-quality methods for segmenting and encoding iris images, and the current methods facilitate selecting and extracting attribute vectors from iris textures. This study discusses several approaches to determining gender. The previous works of literature are briefly reviewed. Additionally, there are a variety of methodologies for different steps of gender classification. This study provides researchers with knowledge and analysis of the existing gender classification approaches. Also, it will assist researchers who are interested in this specific area, as well as highlight the gaps and challenges in the field, and finally provide suggestions and future paths for improvement.

[101] CF3: Compact and Fast 3D Feature Fields

Hyunjoon Lee,Joonkyu Min,Jaesik Park

Main category: cs.CV

TL;DR: CF3 improves 3D Gaussian feature field efficiency by using a top-down approach with fast fusion, direct 3D autoencoder training, and adaptive sparsification, achieving comparable results with only 5% of the Gaussians.

Details Motivation: Most 3D Gaussian Splatting approaches rely on bottom-up optimization, treating raw 2D features as ground truth and incurring high computational costs. CF3 aims to construct a compact and fast 3D Gaussian feature field. Method: CF3 uses fast weighted fusion of multi-view 2D features with pre-trained Gaussians, trains a per-Gaussian autoencoder directly on the fused features, and introduces adaptive sparsification to optimize and reduce redundant Gaussians. Result: CF3 achieves a competitive 3D feature field using only 5% of the Gaussians compared to Feature-3DGS, while preserving geometric details and improving efficiency. Conclusion: CF3 provides a more efficient and compact representation of 3D Gaussian feature fields by using a top-down pipeline involving fast weighted fusion, direct 3D autoencoder training, and adaptive sparsification. Abstract: 3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS.

[102] Robust Tracking with Particle Filtering for Fluorescent Cardiac Imaging

Suresh Guttikonda,Maximilian Neidhart,Johanna Sprenger,Johannes Petersen,Christian Detter,Alexander Schlaefer

Main category: cs.CV

TL;DR: 论文提出了一种新的粒子滤波跟踪器,能够在心脏荧光成像中稳健地跟踪目标特征点,并实现实时估计和较低的跟踪误差。

Details Motivation: 由于心脏运动和血管结构丰富导致的图像特征显著波动,传统的跟踪方法在心脏荧光成像中跟踪目标特征点时受到限制。 Method: 提出了一种基于循环一致性检验的粒子滤波跟踪器,用于稳健地跟踪采样以跟随目标特征点。 Result: 该方法能够以25.4 fps的速度同时跟踪117个目标,实时估计跟踪误差为(5.00 +/- 0.22 px),优于其他深度学习跟踪器(22.3 +/- 1.1 px)和传统跟踪器(58.1 +/- 27.1 px)。 Conclusion: 该论文提出的基于循环一致性检验的粒子滤波跟踪器在心脏荧光成像跟踪目标特征点方面优于其他深度学习和传统跟踪器。 Abstract: Intraoperative fluorescent cardiac imaging enables quality control following coronary bypass grafting surgery. We can estimate local quantitative indicators, such as cardiac perfusion, by tracking local feature points. However, heart motion and significant fluctuations in image characteristics caused by vessel structural enrichment limit traditional tracking methods. We propose a particle filtering tracker based on cyclicconsistency checks to robustly track particles sampled to follow target landmarks. Our method tracks 117 targets simultaneously at 25.4 fps, allowing real-time estimates during interventions. It achieves a tracking error of (5.00 +/- 0.22 px) and outperforms other deep learning trackers (22.3 +/- 1.1 px) and conventional trackers (58.1 +/- 27.1 px).

[103] SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion

Xiaoyang Zhang,Zhen Hua,Yakun Ju,Wei Zhou,Jun Liu,Alex C. Kot

Main category: cs.CV

TL;DR: This paper proposes SGDFuse, a new method for infrared and visible image fusion that uses a conditional diffusion model guided by the Segment Anything Model.

Details Motivation: Existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss. Method: SGDFuse uses a conditional diffusion model guided by the Segment Anything Model (SAM) for image fusion. Result: SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks. Conclusion: SGDFuse provides a powerful solution to the core challenges in image fusion and achieves state-of-the-art performance. Abstract: Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model's coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.

[104] B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding

Changho Choi,Youngwoo Shin,Gyojin Han,Dong-Jae Lee,Junmo Kim

Main category: cs.CV

TL;DR: This paper introduces B4DL, a benchmark and MLLM model for processing 4D LiDAR data, enabling better spatio-temporal reasoning in outdoor environments.

Details Motivation: The motivation is to explore the potential of LiDAR-based 4D point clouds for representing real-world scenes and to address the lack of high-quality annotations and suitable MLLM architectures for processing 4D LiDAR. Method: The paper introduces a scalable data generation pipeline and an MLLM model that directly processes raw 4D LiDAR data, integrating it with language understanding. Result: The result is the development of the B4DL benchmark, a new MLLM model capable of processing 4D LiDAR, and the provision of rendered 4D LiDAR videos and datasets for evaluation. Conclusion: The paper concludes that the proposed B4DL benchmark and MLLM model provide a unified solution for spatio-temporal reasoning in dynamic outdoor environments using 4D LiDAR. Abstract: Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://mmb4dl.github.io/mmb4dl/

[105] Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection

Xiaoyang Zhang,Guodong Fan,Guang-Yong Chen,Zhen Hua,Jinjiang Li,Min Gan,C. L. Philip Chen

Main category: cs.CV

TL;DR: This paper proposes a novel wavelet-based method for remote sensing change detection, combining high-frequency edge enhancement and low-frequency semantic modeling to achieve superior performance.

Details Motivation: Traditional spatial-domain methods have limited feature diversity, hindering the detection of subtle changes in remote sensing imagery. Method: Wavelet-Guided Dual-Frequency Encoding (WGDF) combines Discrete Wavelet Transform (DWT) for high- and low-frequency decomposition with specialized modules for edge detail enhancement, frequency-domain difference modeling, and global semantic relationship capture. Result: Experiments show that WGDF significantly reduces edge ambiguity and outperforms state-of-the-art methods in detection accuracy and robustness across multiple remote sensing datasets. Conclusion: The proposed WGDF method effectively enhances change detection in remote sensing imagery by synergistically modeling high- and low-frequency components in the wavelet domain, achieving superior accuracy and robustness. Abstract: Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management. Despite the remarkable progress of deep learning in recent years, most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions. We observe that frequency-domain feature modeling particularly in the wavelet domain an amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain. Thus, we propose a method called Wavelet-Guided Dual-Frequency Encoding (WGDF). Specifically, we first apply Discrete Wavelet Transform (DWT) to decompose the input images into high-frequency and low-frequency components, which are used to model local details and global structures, respectively. In the high-frequency branch, we design a Dual-Frequency Feature Enhancement (DFFE) module to strengthen edge detail representation and introduce a Frequency-Domain Interactive Difference (FDID) module to enhance the modeling of fine-grained changes. In the low-frequency branch, we exploit Transformers to capture global semantic relationships and employ a Progressive Contextual Difference Module (PCDM) to progressively refine change regions, enabling precise structural semantic characterization. Finally, the high- and low-frequency features are synergistically fused to unify local sensitivity with global discriminability. Extensive experiments on multiple remote sensing datasets demonstrate that WGDF significantly alleviates edge ambiguity and achieves superior detection accuracy and robustness compared to state-of-the-art methods. The code will be available at https://github.com/boshizhang123/WGDF.

[106] VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test

Meiqi Wu,Yaxuan Kang,Xuchen Li,Shiyu Hu,Xiaotang Chen,Yunfeng Kang,Weiqiang Wang,Kaiqi Huang

Main category: cs.CV

TL;DR: This paper proposes an automated method (VS-LLM) for analyzing PPAT sketches to assess depression, improving accuracy by 17.6% over traditional methods and enabling large-scale mental state evaluation.

Details Motivation: The motivation behind this research is to overcome the limitations of manual interpretation of Drawing Projection Test (DPT) sketches, which is time-consuming and dependent on the experience of psychologists. The goal is to develop an automated method for large-scale DPT analysis for depression assessment. Method: The authors proposed a Visual-Semantic depression assessment based on LLM (VS-LLM) method to analyze PPAT sketches. They created an experimental environment for automated analysis and tested the effectiveness of their method against traditional psychologist assessment methods. Result: The experimental results demonstrated that the proposed VS-LLM method improved performance by 17.6% compared to traditional psychologist assessment methods, showing its potential for effective depression assessment based on PPAT sketches. Conclusion: The study concludes that the proposed Visual-Semantic depression assessment based on LLM (VS-LLM) method enhances the efficiency and accuracy of mental state assessment using PPAT sketches, offering a promising tool for large-scale automated DPT analysis. Abstract: The Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants' mental states through their sketches. Specifically, through sketches with the theme of "a person picking an apple from a tree (PPAT)", it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists' understanding of an individual's mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment; (2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method; (3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches' elements recognition. Our datasets and codes are available at https://github.com/wmeiqi/VS-LLM.

[107] CoCAViT: Compact Vision Transformer with Robust Global Coordination

Xuyang Wang,Lingjuan Miao,Zhiqiang Zhou

Main category: cs.CV

TL;DR: This paper introduces CoCAViT, a visual backbone that improves the generalization performance of smaller models on out-of-distribution data by addressing architectural bottlenecks and introducing a new Coordinator-patch Cross Attention mechanism.

Details Motivation: The motivation stems from the observation that smaller models experience a disproportionately larger performance drop on out-of-distribution data compared to larger models, indicating a deficiency in generalization performance. Method: The method involves identifying architectural bottlenecks and design choices affecting generalization performance and introducing a Coordinator-patch Cross Attention (CoCA) mechanism to restore the global field of pure window attention. Result: At a resolution of 224*224, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, significant gains on multiple OOD benchmarks, 52.2 mAP on COCO object detection, and 51.3 mIOU on ADE20K semantic segmentation while maintaining low latency. Conclusion: CoCAViT is a novel visual backbone that addresses the generalization performance issue of existing efficient models on out-of-distribution data, achieving robust real-time visual representation with minimal computational overhead. Abstract: In recent years, large-scale visual backbones have demonstrated remarkable capabilities in learning general-purpose features from images via extensive pre-training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in-domain benchmarks. However, we observe that for smaller models, the performance drop on out-of-distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator-patch Cross Attention (CoCA) mechanism, featuring dynamic, domain-aware global tokens that enhance local-global feature modeling and adaptively capture robust patterns across domains with minimal computational overhead. Integrating these advancements, we present CoCAViT, a novel visual backbone designed for robust real-time visual representation. Extensive experiments empirically validate our design. At a resolution of 224*224, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant gains on multiple OOD benchmarks, compared to competing models. It also attains 52.2 mAP on COCO object detection and 51.3 mIOU on ADE20K semantic segmentation, while maintaining low latency.

[108] mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering

Xu Yuan,Liangbo Ning,Wenqi Fan,Qing Li

Main category: cs.CV

TL;DR: This paper proposes mKG-RAG, a novel framework integrating multimodal knowledge graphs into RAG for VQA, which improves answer accuracy and sets a new performance benchmark.

Details Motivation: Vanilla RAG-based VQA methods often introduce irrelevant or misleading content due to reliance on unstructured documents. Using structured multimodal knowledge graphs can enhance accuracy and reliability. Method: The authors propose mKG-RAG, which uses MLLM-powered keyword extraction and vision-text matching to build high-quality multimodal KGs, and employs a dual-stage retrieval strategy with a question-aware retriever to improve efficiency and precision. Result: Comprehensive experiments show that the mKG-RAG framework achieves state-of-the-art performance on knowledge-intensive VQA tasks. Conclusion: The proposed mKG-RAG framework enhances knowledge-based VQA performance by integrating structured multimodal knowledge graphs and a dual-stage retrieval strategy, significantly outperforming existing methods. Abstract: Recently, Retrieval-Augmented Generation (RAG) has been proposed to expand internal knowledge of Multimodal Large Language Models (MLLMs) by incorporating external knowledge databases into the generation process, which is widely used for knowledge-based Visual Question Answering (VQA) tasks. Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relationships among knowledge elements frequently introduce irrelevant or misleading content, reducing answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks to enhance the generation by introducing structured multimodal knowledge. Therefore, in this paper, we propose a novel multimodal knowledge-augmented generation framework (mKG-RAG) based on multimodal KGs for knowledge-intensive VQA tasks. Specifically, our approach leverages MLLM-powered keyword extraction and vision-text matching to distill semantically consistent and modality-aligned entities/relationships from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. In addition, a dual-stage retrieval strategy equipped with a question-aware multimodal retriever is introduced to improve retrieval efficiency while refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art for knowledge-based VQA.

[109] Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting

Frank Ruis,Gertjan Burghouts,Hugo Kuijf

Main category: cs.CV

TL;DR: This paper proposes a Textual Inversion-inspired method for open-vocabulary object detection that preserves the original VLM weights and zero-shot capabilities while enabling accurate detection of novel objects with minimal data and reduced computational cost.

Details Motivation: While large pre-trained vision language models (VLMs) have achieved state-of-the-art performance and zero-shot capabilities, fine-tuning for specific targets often results in the loss of these capabilities. The motivation is to preserve the original model's performance while enabling adaptation to new tasks with minimal data and computational cost. Method: The study introduces a method similar to Textual Inversion (TI), which learns new or improves existing tokens within the VLM to achieve accurate detection of novel or fine-grained objects using only a few examples. The approach keeps the original VLM weights frozen, ensuring compatibility and retention of existing capabilities. Result: The method successfully extends the VLM vocabulary for open-vocabulary object detection, maintaining the original model's performance while enabling few-shot learning with minimal computational cost. It outperforms baseline methods that suffer from forgetting and effectively retains zero-shot capabilities such as detecting sketches after training on real photos. Conclusion: The proposed method, inspired by Textual Inversion, effectively extends the vocabulary of VLMs for open-vocabulary object detection without fine-tuning the entire model. It maintains compatibility with the original weights, preserves zero-shot capabilities, and reduces computational overhead compared to full-model fine-tuning. Abstract: Recent progress in large pre-trained vision language models (VLMs) has reached state-of-the-art performance on several object detection benchmarks and boasts strong zero-shot capabilities, but for optimal performance on specific targets some form of finetuning is still necessary. While the initial VLM weights allow for great few-shot transfer learning, this usually involves the loss of the original natural language querying and zero-shot capabilities. Inspired by the success of Textual Inversion (TI) in personalizing text-to-image diffusion models, we propose a similar formulation for open-vocabulary object detection. TI allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples. The learned tokens are completely compatible with the original VLM weights while keeping them frozen, retaining the original model's benchmark performance, and leveraging its existing capabilities such as zero-shot domain transfer (e.g., detecting a sketch of an object after training only on real photos). The storage and gradient calculations are limited to the token embedding dimension, requiring significantly less compute than full-model fine-tuning. We evaluated whether the method matches or outperforms the baseline methods that suffer from forgetting in a wide variety of quantitative and qualitative experiments.

[110] 3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering

Junyu Zhou,Yuyang Huang,Wenrui Dai,Junni Zou,Ziyang Zheng,Nuowen Kan,Chenglin Li,Hongkai Xiong

Main category: cs.CV

TL;DR: 本文提出了一种基于3D Gabor的新颖点绘方法(3DGabSplat),在提升新视角合成质量和效率方面取得了显著成果。

Details Motivation: 3D高斯点绘(3DGS)由于其固有的低通特性,限制了高频细节的表达,并且存在冗余基元、训练和渲染效率低下以及内存开销过大的问题。 Method: 提出了一种基于3D Gabor的基元,并结合多视角图像进行监督学习,同时开发了一种高效的CUDA光栅化器和频率自适应优化机制。 Result: 3DGabSplat在真实世界和合成场景中均实现了最先进的渲染质量,相较于3DGS,PSNR增益高达1.35 dB,同时减少了基元数量和内存消耗。 Conclusion: 3DGabSplat通过引入基于3D Gabor的基元和频率自适应机制,提升了新视角合成的效率和质量,成为一种可扩展的即插即用模块,能够无缝集成到现有的3DGS框架中。 Abstract: Recent prominence in 3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis. However, 3DGS resorts to the Gaussian function that is low-pass by nature and is restricted in representing high-frequency details in 3D scenes. Moreover, it causes redundant primitives with degraded training and rendering efficiency and excessive memory overhead. To overcome these limitations, we propose 3D Gabor Splatting (3DGabSplat) that leverages a novel 3D Gabor-based primitive with multiple directional 3D frequency responses for radiance field representation supervised by multi-view images. The proposed 3D Gabor-based primitive forms a filter bank incorporating multiple 3D Gabor kernels at different frequencies to enhance flexibility and efficiency in capturing fine 3D details. Furthermore, to achieve novel view rendering, an efficient CUDA-based rasterizer is developed to project the multiple directional 3D frequency components characterized by 3D Gabor-based primitives onto the 2D image plane, and a frequency-adaptive mechanism is presented for adaptive joint optimization of primitives. 3DGabSplat is scalable to be a plug-and-play kernel for seamless integration into existing 3DGS paradigms to enhance both efficiency and quality of novel view synthesis. Extensive experiments demonstrate that 3DGabSplat outperforms 3DGS and its variants using alternative primitives, and achieves state-of-the-art rendering quality across both real-world and synthetic scenes. Remarkably, we achieve up to 1.35 dB PSNR gain over 3DGS with simultaneously reduced number of primitives and memory consumption.

[111] PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation

Kang Liu,Zhuoqi Ma,Zikang Fang,Yunan Li,Kun Xie,Qiguang Miao

Main category: cs.CV

TL;DR: PriorRG是一种新的胸部X光报告生成框架,通过模拟现实世界的临床工作流程,利用患者特定的先验知识,显著提高了生成报告的临床准确性和流畅性。

Details Motivation: 胸部X光报告生成旨在通过自动生成高质量的初步报告来减轻放射科医生的工作负担。然而,现有的大多数方法忽略了患者特定的先验信息,如临床背景和最近的先前图像,这导致无法捕捉诊断意图或疾病进展。 Method: PriorRG使用了一个两阶段的训练管道,第一阶段引入了先验引导的对比预训练方案,第二阶段提出了先验感知的由粗到精的解码方法。 Result: 在MIMIC-CXR和MIMIC-ABN数据集上的大量实验表明,PriorRG优于最先进的方法,在MIMIC-CXR上实现了3.6%的BLEU-4和3.8%的F1分数提升,在MIMIC-ABN上实现了5.9%的BLEU-1增益。 Conclusion: PriorRG是一个新的胸部X光报告生成框架,通过模拟现实世界的临床工作流程,有效利用患者特定的先验知识,提高了生成报告的临床准确性和流畅性。 Abstract: Chest X-ray report generation aims to reduce radiologists' workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge -- including clinical context (e.g., symptoms, medical history) and the most recent prior image -- which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder's hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance.

[112] Cross-View Localization via Redundant Sliced Observations and A-Contrario Validation

Yongjun Zhang,Mingtao Xiong,Yi Wan,Gui-Song Xia

Main category: cs.CV

TL;DR: Slice-Loc是一种两阶段的CVL方法,通过a-contrario可靠性验证提高了定位精度和可靠性,适用于GNSS受限环境下的智能车辆自定位。

Details Motivation: 现有的CVL方法只输出单一观测结果,缺乏测量原理所需的冗余观测,因此难以通过观测数据的相互验证来评估定位可靠性。 Method: Slice-Loc将查询图像分成子图像,并为每个切片估计3-DoF姿态,创建冗余和独立观测。然后使用几何刚性公式过滤错误的3-DoF姿态,并通过估计误报数量来量化定位的意义。 Result: Slice-Loc在DReSS数据集的跨城市测试中将平均定位误差从4.47米减少到1.86米,平均方向误差从3.42°减少到1.24°。 Conclusion: Slice-Loc通过引入冗余观测和几何刚性公式,提高了CVL的定位精度和可靠性,从而在DReSS数据集的跨城市测试中表现出色。 Abstract: Cross-view localization (CVL) matches ground-level images with aerial references to determine the geo-position of a camera, enabling smart vehicles to self-localize offline in GNSS-denied environments. However, most CVL methods output only a single observation, the camera pose, and lack the redundant observations required by surveying principles, making it challenging to assess localization reliability through the mutual validation of observational data. To tackle this, we introduce Slice-Loc, a two-stage method featuring an a-contrario reliability validation for CVL. Instead of using the query image as a single input, Slice-Loc divides it into sub-images and estimates the 3-DoF pose for each slice, creating redundant and independent observations. Then, a geometric rigidity formula is proposed to filter out the erroneous 3-DoF poses, and the inliers are merged to generate the final camera pose. Furthermore, we propose a model that quantifies the meaningfulness of localization by estimating the number of false alarms (NFA), according to the distribution of the locations of the sliced images. By eliminating gross errors, Slice-Loc boosts localization accuracy and effectively detects failures. After filtering out mislocalizations, Slice-Loc reduces the proportion of errors exceeding 10 m to under 3\%. In cross-city tests on the DReSS dataset, Slice-Loc cuts the mean localization error from 4.47 m to 1.86 m and the mean orientation error from $\mathbf{3.42^{\circ}}$ to $\mathbf{1.24^{\circ}}$, outperforming state-of-the-art methods. Code and dataset will be available at: https://github.com/bnothing/Slice-Loc.

[113] CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation

Hamza Kalisch,Fabian Hörst,Jens Kleesiek,Ken Herrmann,Constantin Seibold

Main category: cs.CV

TL;DR: CT-GRAPH is a new method for automating radiology report generation using a hierarchical graph attention network to model anatomical relationships, achieving significant performance improvements on a large chest CT dataset.

Details Motivation: The motivation is to address the limitations of current methods that rely solely on global image features and fail to capture fine-grained organ relationships essential for accurate radiology reporting. Method: The method involves a hierarchical graph attention network called CT-GRAPH that uses anatomical masks and pretrained 3D medical feature encoders to extract and refine global and organ-level features, which are then integrated into a large language model for generating medical reports. Result: The approach achieved a 7.9% improvement in F1 score over state-of-the-art methods on the CT-RATE dataset, demonstrating its effectiveness in generating detailed and accurate medical reports. Conclusion: CT-GRAPH provides a novel approach to medical imaging report generation by explicitly modeling radiological knowledge through a hierarchical graph attention network, outperforming current state-of-the-art methods. Abstract: As medical imaging is central to diagnostic processes, automating the generation of radiology reports has become increasingly relevant to assist radiologists with their heavy workloads. Most current methods rely solely on global image features, failing to capture fine-grained organ relationships crucial for accurate reporting. To this end, we propose CT-GRAPH, a hierarchical graph attention network that explicitly models radiological knowledge by structuring anatomical regions into a graph, linking fine-grained organ features to coarser anatomical systems and a global patient context. Our method leverages pretrained 3D medical feature encoders to obtain global and organ-level features by utilizing anatomical masks. These features are further refined within the graph and then integrated into a large language model to generate detailed medical reports. We evaluate our approach for the task of report generation on the large-scale chest CT dataset CT-RATE. We provide an in-depth analysis of pretrained feature encoders for CT report generation and show that our method achieves a substantial improvement of absolute 7.9\% in F1 score over current state-of-the-art methods. The code is publicly available at https://github.com/hakal104/CT-GRAPH.

[114] Deformable Attention Graph Representation Learning for Histopathology Whole Slide Image Analysis

Mingxi Fu,Xitong Ling,Yuxuan Chen,Jiawen Li,fanglei fu,Huaitian Yuan,Tian Guan,Yonghong He,Lianghui Zhu

Main category: cs.CV

TL;DR: 提出了一种基于可变形注意力机制的新型图神经网络框架,用于病理图像分析,通过动态加权有向图和学习空间偏移量,显著提高了上下文感知能力和空间特异性。

Details Motivation: 为了克服传统方法在捕捉组织结构空间依赖性和注意力机制特异性方面的局限性,研究者提出了新的解决方案。 Method: 构建基于动态加权有向图的图神经网络框架,利用可变形注意力机制和学习空间偏移量,以增强上下文感知能力并保持空间特异性。 Result: 在四个基准数据集(TCGA-COAD、BRACS、胃肠道化生分级和肠道ROI分类)上达到了最先进的性能。 Conclusion: 这项研究表明,结合可变形注意力机制的图神经网络能够有效捕捉病理图像中的复杂空间结构,为病理图像分析提供了新的解决方案。 Abstract: Accurate classification of Whole Slide Images (WSIs) and Regions of Interest (ROIs) is a fundamental challenge in computational pathology. While mainstream approaches often adopt Multiple Instance Learning (MIL), they struggle to capture the spatial dependencies among tissue structures. Graph Neural Networks (GNNs) have emerged as a solution to model inter-instance relationships, yet most rely on static graph topologies and overlook the physical spatial positions of tissue patches. Moreover, conventional attention mechanisms lack specificity, limiting their ability to focus on structurally relevant regions. In this work, we propose a novel GNN framework with deformable attention for pathology image analysis. We construct a dynamic weighted directed graph based on patch features, where each node aggregates contextual information from its neighbors via attention-weighted edges. Specifically, we incorporate learnable spatial offsets informed by the real coordinates of each patch, enabling the model to adaptively attend to morphologically relevant regions across the slide. This design significantly enhances the contextual field while preserving spatial specificity. Our framework achieves state-of-the-art performance on four benchmark datasets (TCGA-COAD, BRACS, gastric intestinal metaplasia grading, and intestinal ROI classification), demonstrating the power of deformable attention in capturing complex spatial structures in WSIs and ROIs.

[115] UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation

Wonjun Kang,Byeongkeun Ahn,Minjae Lee,Kevin Galim,Seunghyuk Oh,Hyung Il Koo,Nam Ik Cho

Main category: cs.CV

TL;DR: 本文提出了一种新的文本到图像生成方法UNCAGE,通过注意力图优化解masking过程,提高生成图像的组合保真度和效率。

Details Motivation: 尽管扩散模型和自回归模型已被广泛研究用于文本到图像生成,但组合文本到图像生成仍然具有挑战性。Masked Generative Transformers作为自回归模型的替代方案,通过双向注意力和平行解码实现高效和高质量的图像生成,但在组合文本到图像生成方面仍有局限。 Method: 提出了一种名为UNCAGE的新方法,该方法通过利用注意力图来优先解masking那些能清晰表示单个物体的token,以提高生成图像的组合保真度。 Result: UNCAGE在多个基准测试和指标上的定量和定性评估中均表现出一致的性能提升,且推理开销可以忽略不计。 Conclusion: UNCAGE是一个无需训练的方法,通过关注注意力图来提高解masking的物体表示,从而提高生成图像的质量和效率。 Abstract: Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.

[116] From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization

Farah Wahida,M. A. P. Chamikara,Yashothara Shanmugarasa,Mohan Baruwal Chhetri,Thilina Ranbaduge,Ibrahim Khalil

Main category: cs.CV

TL;DR: TrueBiometric is a novel approach to defending against backdoor attacks in face recognition systems by detecting and correcting poisoned images with high accuracy.

Details Motivation: Biometric systems are vulnerable to backdoor attacks that manipulate training data, and existing defenses struggle to identify and mitigate poisoned images without reducing system performance. Method: TrueBiometric employs a majority voting mechanism using state-of-the-art vision language models to detect poisoned images and applies calibrated corrective noise to mitigate the attacks. Result: TrueBiometric achieves 100% accuracy in detecting and correcting poisoned images while maintaining performance on clean data, outperforming existing methods. Conclusion: TrueBiometric offers a practical and effective solution for mitigating backdoor attacks in biometric systems, ensuring high accuracy and reliability. Abstract: Biometric systems, such as face recognition systems powered by deep neural networks (DNNs), rely on large and highly sensitive datasets. Backdoor attacks can subvert these systems by manipulating the training process. By inserting a small trigger, such as a sticker, make-up, or patterned mask, into a few training images, an adversary can later present the same trigger during authentication to be falsely recognized as another individual, thereby gaining unauthorized access. Existing defense mechanisms against backdoor attacks still face challenges in precisely identifying and mitigating poisoned images without compromising data utility, which undermines the overall reliability of the system. We propose a novel and generalizable approach, TrueBiometric: Trustworthy Biometrics, which accurately detects poisoned images using a majority voting mechanism leveraging multiple state-of-the-art large vision language models. Once identified, poisoned samples are corrected using targeted and calibrated corrective noise. Our extensive empirical results demonstrate that TrueBiometric detects and corrects poisoned images with 100\% accuracy without compromising accuracy on clean images. Compared to existing state-of-the-art approaches, TrueBiometric offers a more practical, accurate, and effective solution for mitigating backdoor attacks in face recognition systems.

[117] Physical Adversarial Camouflage through Gradient Calibration and Regularization

Jiawei Liang,Siyuan Liang,Jianjie Huang,Chenxi Si,Ming Zhang,Xiaochun Cao

Main category: cs.CV

TL;DR: This paper proposes a new adversarial camouflage framework to deceive deep object detectors by addressing challenges in gradient optimization. The framework improves attack effectiveness and stability, showing significant improvements in attack success rates.

Details Motivation: Deep object detectors are crucial in safety-critical domains like autonomous driving, but physical adversarial camouflage poses a significant threat by deceiving these detectors. Existing methods struggle with varying physical environments due to challenges like inconsistent sampling point densities and conflicting gradient updates from multiple angles. Method: The paper introduces two key techniques: gradient calibration, which propagates gradients from sparse to unsampled texture points for consistent updates, and gradient decorrelation, which prioritizes and orthogonalizes gradients to eliminate conflicts. These methods are validated through extensive experiments on various detection models, angles, and distances. Result: The proposed method achieves a significant improvement over the state of the art, with an average increase in attack success rate (ASR) of 13.46% across distances and 11.03% across angles. Real-world evaluations also demonstrate its effectiveness and highlight the need for robust system design. Conclusion: The proposed adversarial camouflage framework based on gradient optimization effectively addresses the issues of existing techniques by introducing gradient calibration and decorrelation methods, which enhance attack effectiveness and stability. The empirical evaluation emphasizes the importance of robust system design. Abstract: The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles and distances show that our method significantly exceeds the state of the art, with an average increase in attack success rate (ASR) of 13.46% across distances and 11.03% across angles. Furthermore, empirical evaluation in real-world scenarios highlights the need for more robust system design.

[118] Smoothing Slot Attention Iterations and Recurrences

Rongzhen Zhao,Wenyan Yang,Juho Kannala,Joni Pajarinen

Main category: cs.CV

TL;DR: SmoothSA improves Slot Attention by refining initial queries and handling video frames differently, enhancing object-centric learning performance.

Details Motivation: Cold-start queries in Slot Attention lack sample-specific cues, hindering precise object aggregation, especially in the first frame of images or videos. Non-first frames have sample-specific queries requiring different transforms. Method: The authors propose SmoothSA, which preheats cold-start queries using a self-distilled module and differentiates transforms for first and non-first video frames, using full and single iterations respectively. Result: Comprehensive experiments show that SmoothSA enhances performance on object discovery, recognition, and downstream benchmarks, validating its effectiveness in smoothing SA iterations and recurrences. Conclusion: SmoothSA improves the Slot Attention mechanism by preheating cold-start queries and differentiating transforms for video frames, enhancing object-centric learning effectiveness. Abstract: Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video's first frame; Also, non-first frames' queries are already sample-specific thus require transforms different from the first frame's aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video's first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method's effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our code is available in the supplement.

[119] Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions

Hubert Baniecki,Maximilian Muschalik,Fabian Fumagalli,Barbara Hammer,Eyke Hüllermeier,Przemyslaw Biecek

Main category: cs.CV

TL;DR: 本文提出FIxLIP,一种基于博弈论的二阶交互解释方法,用于提升视觉-语言预训练模型的解释性和模型比较能力。

Details Motivation: 现有显著图方法仅捕捉一阶归因,无法反映视觉-语言模型中复杂的跨模态交互。 Method: 基于博弈论,使用加权Banzhaf交互指数进行分解,扩展了解释评估指标以适应二阶交互解释。 Result: 在MS COCO和ImageNet-1k基准测试中,二阶方法如FIxLIP优于一阶归因方法,并可有效比较不同模型。 Conclusion: FIxLIP提供了一种更高效和灵活的二阶交互解释方法,用于分析视觉-语言预训练模型的相似性输出。 Abstract: Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, like the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on MS COCO and ImageNet-1k benchmarks validate that second-order methods like FIxLIP outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models like CLIP vs. SigLIP-2 and ViT-B/32 vs. ViT-L/16.

[120] How and Why: Taming Flow Matching for Unsupervised Anomaly Detection and Localization

Liangwei Li,Lin Liu,Juanxiu Liu,Jing Zhang,Ruqian Hao,Xiaohui Du

Main category: cs.CV

TL;DR: This paper proposes a new paradigm for unsupervised anomaly detection and localization using Flow Matching, offering a theoretically grounded separation mechanism for anomalous samples and achieving state-of-the-art performance on the MVTec dataset.

Details Motivation: The motivation is to address the model expressivity limitations of conventional flow-based methods in unsupervised anomaly detection and localization. Method: The paper proposes Worst Transport (WT) displacement interpolation to reconstruct a non-probabilistic evolution path for unsupervised anomaly detection and localization using Flow Matching (FM). Result: The paper presents the first successful application of Flow Matching for unsupervised anomaly detection, achieving state-of-the-art performance at a single scale on the MVTec dataset. Conclusion: The paper concludes that the proposed WT-Flow method offers a novel and effective unsupervised paradigm for anomaly detection, providing a theoretically grounded separation mechanism for anomalous samples. Abstract: We propose a new paradigm for unsupervised anomaly detection and localization using Flow Matching (FM), which fundamentally addresses the model expressivity limitations of conventional flow-based methods. To this end, we formalize the concept of time-reversed Flow Matching (rFM) as a vector field regression along a predefined probability path to transform unknown data distributions into standard Gaussian. We bring two core observations that reshape our understanding of FM. First, we rigorously prove that FM with linear interpolation probability paths is inherently non-invertible. Second, our analysis reveals that employing reversed Gaussian probability paths in high-dimensional spaces can lead to trivial vector fields. This issue arises due to the manifold-related constraints. Building on the second observation, we propose Worst Transport (WT) displacement interpolation to reconstruct a non-probabilistic evolution path. The proposed WT-Flow enhances dynamical control over sample trajectories, constructing ''degenerate potential wells'' for anomaly-free samples while allowing anomalous samples to escape. This novel unsupervised paradigm offers a theoretically grounded separation mechanism for anomalous samples. Notably, FM provides a computationally tractable framework that scales to complex data. We present the first successful application of FM for the unsupervised anomaly detection task, achieving state-of-the-art performance at a single scale on the MVTec dataset. The reproducible code for training will be released upon camera-ready submission.

[121] F2PASeg: Feature Fusion for Pituitary Anatomy Segmentation in Endoscopic Surgery

Lumin Chen,Zhiying Wu,Tianye Lei,Xuexue Bai,Ming Feng,Yuxi Wang,Gaofeng Meng,Zhen Lei,Hongbin Liu

Main category: cs.CV

TL;DR: The study introduces F2PASeg, a method for refining anatomical structure segmentation in pituitary surgery using a Feature Fusion module and data augmentation techniques, providing a reliable solution for real-time critical structure segmentation during surgery.

Details Motivation: Pituitary tumors often cause deformation or encapsulation of adjacent vital structures, and pixel-level annotated video stream datasets for pituitary surgeries are extremely rare. This study aims to enhance the safety of pituitary surgery by providing early warnings of regions that pose surgical risks. Method: F2PASeg incorporates a Feature Fusion module to refine anatomical structure segmentation by leveraging both high-resolution image features and deep semantic embeddings. Data augmentation techniques were also applied to mitigate class imbalance. Result: Experimental results demonstrate that F2PASeg consistently segments critical anatomical structures in real time. Conclusion: F2PASeg provides a reliable solution for intraoperative pituitary surgery planning by consistently segmenting critical anatomical structures in real time. Abstract: Pituitary tumors often cause deformation or encapsulation of adjacent vital structures. Anatomical structure segmentation can provide surgeons with early warnings of regions that pose surgical risks, thereby enhancing the safety of pituitary surgery. However, pixel-level annotated video stream datasets for pituitary surgeries are extremely rare. To address this challenge, we introduce a new dataset for Pituitary Anatomy Segmentation (PAS). PAS comprises 7,845 time-coherent images extracted from 120 videos. To mitigate class imbalance, we apply data augmentation techniques that simulate the presence of surgical instruments in the training data. One major challenge in pituitary anatomy segmentation is the inconsistency in feature representation due to occlusions, camera motion, and surgical bleeding. By incorporating a Feature Fusion module, F2PASeg is proposed to refine anatomical structure segmentation by leveraging both high-resolution image features and deep semantic embeddings, enhancing robustness against intraoperative variations. Experimental results demonstrate that F2PASeg consistently segments critical anatomical structures in real time, providing a reliable solution for intraoperative pituitary surgery planning. Code: https://github.com/paulili08/F2PASeg.

[122] Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification

Samuel Räber,Till Aczel,Andreas Plesner,Roger Wattenhofer

Main category: cs.CV

TL;DR: 本研究发现,高真实感图像压缩模型在面对对抗攻击时表现出更强的鲁棒性,这为未来对抗攻击提出了新的挑战。

Details Motivation: 先前的研究表明,通过对图像进行有损压缩作为预处理可以防御对抗扰动,但目前缺乏全面的攻击评估。因此,本文旨在填补这一空白,深入分析不同压缩模型在面对攻击时的表现。 Method: 本文通过构建强大的白盒和自适应攻击对抗各种压缩模型,并在多种攻击场景中进行严格评估,从而识别出攻击者面临的关键挑战,并分析了不同压缩模型的安全性。 Result: 研究结果显示,能够生成高真实感、高保真重建的压缩模型在面对攻击时具有更强的抵抗力,而低真实感的压缩模型则容易被攻破。此外,这种抵抗力并非源于梯度掩码,而是因为高真实感重建保持了自然图像的分布特性。 Conclusion: 本文得出的结论是,保持自然图像分布对齐的高真实感重建压缩模型似乎具有内在的鲁棒性,这为未来的对抗攻击带来了重大障碍。此外,克服高真实感重建的挑战对于全面的安全评估至关重要。 Abstract: Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.

[123] SMOL-MapSeg: Show Me One Label

Yunshuang Yuan,Frank Thiemann,Thorsten Dahms,Monika Sester

Main category: cs.CV

TL;DR: This paper introduces SMOL-MapSeg, a novel semantic segmentation model for historical maps that integrates OND knowledge-based prompting to improve performance over traditional models like UNet.

Details Motivation: Historical maps are challenging for existing pre-trained foundation models because they lack consistency in how concepts are visually represented. These models are typically trained on modern or domain-specific images with well-defined visual concepts, which do not translate well to the diverse styles of historical maps. Method: The authors modified the prompt encoder of the foundation model SAM with an On-Need Declarative (OND) knowledge-based prompting mechanism. This approach allows users to explicitly define concepts and patterns during inference. The model was fine-tuned on historical maps for semantic segmentation. Result: SMOL-MapSeg demonstrated accurate segmentation of classes defined by OND knowledge, adapted to unseen classes via few-shot learning, and outperformed a UNet-based baseline in average segmentation performance. Conclusion: The proposed SMOL-MapSeg model, which incorporates OND knowledge-based prompting, outperforms traditional models like UNet in segmenting historical maps and adapts to unseen classes through few-shot fine-tuning. Abstract: Historical maps are valuable for studying changes to the Earth's surface. With the rise of deep learning, models like UNet have been used to extract information from these maps through semantic segmentation. Recently, pre-trained foundation models have shown strong performance across domains such as autonomous driving, medical imaging, and industrial inspection. However, they struggle with historical maps. These models are trained on modern or domain-specific images, where patterns can be tied to predefined concepts through common sense or expert knowledge. Historical maps lack such consistency -- similar concepts can appear in vastly different shapes and styles. To address this, we propose On-Need Declarative (OND) knowledge-based prompting, which introduces explicit prompts to guide the model on what patterns correspond to which concepts. This allows users to specify the target concept and pattern during inference (on-need inference). We implement this by replacing the prompt encoder of the foundation model SAM with our OND prompting mechanism and fine-tune it on historical maps. The resulting model is called SMOL-MapSeg (Show Me One Label). Experiments show that SMOL-MapSeg can accurately segment classes defined by OND knowledge. It can also adapt to unseen classes through few-shot fine-tuning. Additionally, it outperforms a UNet-based baseline in average segmentation performance.

[124] AutoIAD: Manager-Driven Multi-Agent Collaboration for Automated Industrial Anomaly Detection

Dongwei Ji,Bingzhang Hu,Yi Zhou

Main category: cs.CV

TL;DR: AutoIAD是一个专为工业视觉异常检测设计的多智能体协作框架,通过集成特定领域知识库和协调子智能体实现了高效和高性能的异常检测。

Details Motivation: 工业异常检测在制造质量控制中至关重要,但传统方法需要大量手动努力。 Method: 引入了一个Manager-Driven的中心智能体,协调专门的子智能体并集成特定领域知识库,以处理从原始工业图像数据到训练异常检测模型的整个流程。 Result: 实验表明,AutoIAD在任务完成率和模型性能(AUROC)方面表现优异,并通过迭代优化有效缓解了幻觉问题。 Conclusion: AutoIAD是一个专为工业视觉异常检测设计的多智能体协作框架,显著优于现有的通用智能体协作框架和传统AutoML框架。 Abstract: Industrial anomaly detection (IAD) is critical for manufacturing quality control, but conventionally requires significant manual effort for various application scenarios. This paper introduces AutoIAD, a multi-agent collaboration framework, specifically designed for end-to-end automated development of industrial visual anomaly detection. AutoIAD leverages a Manager-Driven central agent to orchestrate specialized sub-agents (including Data Preparation, Data Loader, Model Designer, Trainer) and integrates a domain-specific knowledge base, which intelligently handles the entire pipeline using raw industrial image data to develop a trained anomaly detection model. We construct a comprehensive benchmark using MVTec AD datasets to evaluate AutoIAD across various LLM backends. Extensive experiments demonstrate that AutoIAD significantly outperforms existing general-purpose agentic collaboration frameworks and traditional AutoML frameworks in task completion rate and model performance (AUROC), while effectively mitigating issues like hallucination through iterative refinement. Ablation studies further confirm the crucial roles of the Manager central agent and the domain knowledge base module in producing robust and high-quality IAD solutions.

[125] Symmetry Understanding of 3D Shapes via Chirality Disentanglement

Weikang Wang,Tobias Weißberg,Nafie El Amrani,Florian Bernard

Main category: cs.CV

TL;DR: This paper proposes an unsupervised chirality feature extraction pipeline using the Diff3F framework to address the lack of chirality-aware features in shape analysis, effectively improving the ability to distinguish left from right in symmetric parts across various tasks.

Details Motivation: The motivation stems from the underdevelopment of chirality exploration in shape analysis compared to the image domain. Current shape descriptors often fail to distinguish between left and right symmetric parts, highlighting the need for chirality-aware features. Method: The method utilizes the Diff3F framework to develop an unsupervised chirality feature extraction pipeline. This pipeline extracts chirality-aware information from 2D foundation models and applies it to shape vertices in point clouds and meshes. Result: The results show that the extracted chirality features are effective in downstream tasks, including left-right disentanglement, shape matching, and part segmentation, as demonstrated through quantitative and qualitative experiments on diverse datasets. Conclusion: The paper concludes that the proposed unsupervised chirality feature extraction pipeline effectively decorates shape vertices with chirality-aware information, demonstrating practical utility across various datasets and tasks such as left-right disentanglement, shape matching, and part segmentation. Abstract: Chirality information (i.e. information that allows distinguishing left from right) is ubiquitous for various data modes in computer vision, including images, videos, point clouds, and meshes. While chirality has been extensively studied in the image domain, its exploration in shape analysis (such as point clouds and meshes) remains underdeveloped. Although many shape vertex descriptors have shown appealing properties (e.g. robustness to rigid-body transformations), they are often not able to disambiguate between left and right symmetric parts. Considering the ubiquity of chirality information in different shape analysis problems and the lack of chirality-aware features within current shape descriptors, developing a chirality feature extractor becomes necessary and urgent. Based on the recent Diff3F framework, we propose an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. We evaluated the extracted chirality features through quantitative and qualitative experiments across diverse datasets. Results from downstream tasks including left-right disentanglement, shape matching, and part segmentation demonstrate their effectiveness and practical utility. Project page: https://wei-kang-wang.github.io/chirality/

[126] MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips

Shibo Wang,Haonan He,Maria Parelli,Christoph Gebhardt,Zicong Fan,Jie Song

Main category: cs.CV

TL;DR: MagicHOI利用扩散模型先验和接触约束,解决了有限视角下手-物体重建的挑战,取得了优于现有方法的效果。

Details Motivation: 现有的RGB-based手-物体重建方法依赖于物体模板或假设完整的可见性,这在现实场景中难以满足,导致重建效果不佳。 Method: 将大规模的新视角合成扩散模型整合进手-物体重建框架,并通过可见接触约束实现手与物体的对齐。 Result: MagicHOI在有限视角变化的情况下也能实现高质量的单目视频手-物体重建,并有效规范未见区域。 Conclusion: MagicHOI显著优于现有的最先进的手-物体重建方法,同时扩散模型先验有效地规范了未见区域,提升了三维手-物体重建效果。 Abstract: Most RGB-based hand-object reconstruction methods rely on object templates, while template-free methods typically assume full object visibility. This assumption often breaks in real-world settings, where fixed camera viewpoints and static grips leave parts of the object unobserved, resulting in implausible reconstructions. To overcome this, we present MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited viewpoint variation. Our key insight is that, despite the scarcity of paired 3D hand-object data, large-scale novel view synthesis diffusion models offer rich object supervision. This supervision serves as a prior to regularize unseen object regions during hand interactions. Leveraging this insight, we integrate a novel view synthesis model into our hand-object reconstruction framework. We further align hand to object by incorporating visible contact constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. We also show that novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.

[127] Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events

Lin Zhu,Ruonan Liu,Xiao Wang,Lizhi Wang,Hua Huang

Main category: cs.CV

TL;DR: This paper proposes a self-supervised pre-training framework for event cameras that enhances feature extraction from sparse and noisy event data, leading to improved performance on downstream tasks like object recognition and semantic segmentation.

Details Motivation: The motivation is to address the inherent sparsity and noise in event data from neuromorphic vision sensors, which complicates effective feature extraction, by fully revealing latent information in the data. Method: The paper proposes a self-supervised pre-training framework consisting of three stages: Difference-guided Masked Modeling, Backbone-fixed Feature Transition, and Focus-aimed Contrastive Learning. Result: Extensive experiments show that the framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. Conclusion: The paper concludes that the proposed self-supervised pre-training framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. Abstract: Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT-Vision/EventPretrain.

[128] Head Anchor Enhanced Detection and Association for Crowded Pedestrian Tracking

Zewei Wu,César Teixeira,Wei Ke,Zhang Xiong

Main category: cs.CV

TL;DR: This paper proposes an enhanced visual pedestrian tracking framework that addresses occlusion challenges by leveraging richer feature representations and a more robust motion model.

Details Motivation: Visual pedestrian tracking faces significant occlusion challenges in real-world applications. Traditional tracking methods struggle in severe occlusion scenarios. Method: The proposed method incorporates detection features from both the regression and classification branches of an object detector, and introduces a head keypoint detection model. An iterative Kalman filtering approach is proposed for motion modeling. Result: The proposed method provides an enhanced tracking framework that leverages richer feature representations and a more robust motion model, thus mitigating occlusion challenges. Conclusion: The proposed method offers a more robust solution for multi-object tracking in crowded environments where occlusions are prevalent. Abstract: Visual pedestrian tracking represents a promising research field, with extensive applications in intelligent surveillance, behavior analysis, and human-computer interaction. However, real-world applications face significant occlusion challenges. When multiple pedestrians interact or overlap, the loss of target features severely compromises the tracker's ability to maintain stable trajectories. Traditional tracking methods, which typically rely on full-body bounding box features extracted from {Re-ID} models and linear constant-velocity motion assumptions, often struggle in severe occlusion scenarios. To address these limitations, this work proposes an enhanced tracking framework that leverages richer feature representations and a more robust motion model. Specifically, the proposed method incorporates detection features from both the regression and classification branches of an object detector, embedding spatial and positional information directly into the feature representations. To further mitigate occlusion challenges, a head keypoint detection model is introduced, as the head is less prone to occlusion compared to the full body. In terms of motion modeling, we propose an iterative Kalman filtering approach designed to align with modern detector assumptions, integrating 3D priors to better complete motion trajectories in complex scenes. By combining these advancements in appearance and motion modeling, the proposed method offers a more robust solution for multi-object tracking in crowded environments where occlusions are prevalent.

[129] FS-IQA: Certified Feature Smoothing for Robust Image Quality Assessment

Ekaterina Shumitskaya,Dmitriy Vatolin,Anastasia Antsiferova

Main category: cs.CV

TL;DR: 本文提出了一种新型的图像质量评估模型认证防御方法,通过在特征空间中注入噪声来保持图像质量的同时提供鲁棒性保证,并在多个数据集上验证了其有效性。

Details Motivation: 传统的防御方法通常直接在输入图像中注入高斯噪声,这往往会降低图像的视觉质量。本文旨在解决这一问题,在保持图像保真度的同时提供鲁棒性保障。 Method: 通过在特征空间而非输入空间中注入高斯噪声来实现随机平滑,同时分析主干网络雅可比矩阵的最大奇异值,以正式连接特征空间中的噪声水平与输入空间扰动。 Result: 该方法在两个基准数据集上进行了验证,涉及六种广泛使用的全参考和无参考图像质量评估模型,并与五种最先进的认证防御方法进行了比较,结果显示与主观质量评分的相关性提高了最多30.9%。 Conclusion: 该论文提出了一种基于特征空间随机平滑的新型图像质量评估模型认证防御方法,能够在保持图像质量的同时提供鲁棒性保证,并且适用于全参考和无参考的图像质量评估模型。 Abstract: We propose a novel certified defense method for Image Quality Assessment (IQA) models based on randomized smoothing with noise applied in the feature space rather than the input space. Unlike prior approaches that inject Gaussian noise directly into input images, often degrading visual quality, our method preserves image fidelity while providing robustness guarantees. To formally connect noise levels in the feature space with corresponding input-space perturbations, we analyze the maximum singular value of the backbone network's Jacobian. Our approach supports both full-reference (FR) and no-reference (NR) IQA models without requiring any architectural modifications, suitable for various scenarios. It is also computationally efficient, requiring a single backbone forward pass per image. Compared to previous methods, it reduces inference time by 99.5% without certification and by 20.6% when certification is applied. We validate our method with extensive experiments on two benchmark datasets, involving six widely-used FR and NR IQA models and comparisons against five state-of-the-art certified defenses. Our results demonstrate consistent improvements in correlation with subjective quality scores by up to 30.9%.

[130] Leveraging AI to Accelerate Clinical Data Cleaning: A Comparative Study of AI-Assisted vs. Traditional Methods

Matthew Purri,Amit Patel,Erik Deurrell

Main category: cs.CV

TL;DR: Octozi, an AI-assisted platform for clinical trial data cleaning, significantly improves efficiency and accuracy, reducing errors and workload across reviewers of all experience levels.

Details Motivation: Clinical trial data cleaning is a critical bottleneck due to increasing data volumes and complexity, which manual processes struggle to manage. Method: Octozi combines large language models with domain-specific heuristics. The platform was tested in a controlled experimental study involving 10 experienced clinical reviewers. Result: AI assistance increased data cleaning throughput by 6.03-fold, reduced cleaning errors from 54.67% to 8.48%, and decreased false positive queries by 15.48-fold, with consistent improvements across reviewers of all experience levels. Conclusion: AI-assisted approaches like Octozi can significantly improve the efficiency and accuracy of clinical trial data cleaning, demonstrating the potential of human-AI collaboration in pharmaceutical trials. Abstract: Clinical trial data cleaning represents a critical bottleneck in drug development, with manual review processes struggling to manage exponentially increasing data volumes and complexity. This paper presents Octozi, an artificial intelligence-assisted platform that combines large language models with domain-specific heuristics to transform clinical data review. In a controlled experimental study with experienced clinical reviewers (n=10), we demonstrate that AI assistance increased data cleaning throughput by 6.03-fold while simultaneously decreasing cleaning errors from 54.67% to 8.48% (a 6.44-fold improvement). Crucially, the system reduced false positive queries by 15.48-fold, minimizing unnecessary site burden. These improvements were consistent across reviewers regardless of experience level, suggesting broad applicability. Our findings indicate that AI-assisted approaches can address fundamental inefficiencies in clinical trial operations, potentially accelerating drug development timelines and reducing costs while maintaining regulatory compliance. This work establishes a framework for integrating AI into safety-critical clinical workflows and demonstrates the transformative potential of human-AI collaboration in pharmaceutical clinical trials.

[131] Optimal Brain Connection: Towards Efficient Structural Pruning

Shaowu Chen,Wei Ma,Binhua Huang,Qingyuan Wang,Guoxin Wang,Weize Sun,Lei Huang,Deepu John

Main category: cs.CV

TL;DR: This paper proposes Optimal Brain Connection, a structural pruning framework that considers parameter interconnections, resulting in improved model performance preservation.

Details Motivation: The motivation is to overcome the limitation of existing structural pruning methods that neglect interconnections among parameters. Method: The paper introduces the Jacobian Criterion for evaluating parameter saliency and proposes the Equivalent Pruning mechanism using autoencoders for preserving contributions during fine-tuning. Result: The Jacobian Criterion outperforms popular metrics in preserving model performance, and the Equivalent Pruning mechanism effectively reduces performance degradation after fine-tuning. Conclusion: Optimal Brain Connection proves to be an effective structural pruning framework that addresses the limitations of existing methods by considering parameter interconnections. Abstract: Structural pruning has been widely studied for its effectiveness in compressing neural networks. However, existing methods often neglect the interconnections among parameters. To address this limitation, this paper proposes a structural pruning framework termed Optimal Brain Connection. First, we introduce the Jacobian Criterion, a first-order metric for evaluating the saliency of structural parameters. Unlike existing first-order methods that assess parameters in isolation, our criterion explicitly captures both intra-component interactions and inter-layer dependencies. Second, we propose the Equivalent Pruning mechanism, which utilizes autoencoders to retain the contributions of all original connection--including pruned ones--during fine-tuning. Experimental results demonstrate that the Jacobian Criterion outperforms several popular metrics in preserving model performance, while the Equivalent Pruning mechanism effectively mitigates performance degradation after fine-tuning. Code: https://github.com/ShaowuChen/Optimal_Brain_Connection

[132] When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework

Haoyu Liu,Chaoyu Gong,Mengke He,Jiate Li,Kai Han,Siqiang Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为SSTGNN的轻量级视频检测方法,通过图神经网络结合空间、频谱和时间信息,有效检测AI生成和篡改视频,具有高性能和强鲁棒性。

Details Motivation: 现有的视频检测方法依赖孤立的空间、时间或频谱信息,且模型较大,难以泛化到多样化的篡改类型。 Method: SSTGNN通过将视频表示为结构化图,结合了空间、频谱和时间信息,并引入了可学习频谱滤波器和时间微分建模。 Result: SSTGNN在多个基准数据集上表现出色,尤其是在跨域设置和未见过的篡改类型中,同时模型参数比现有最先进模型减少了42.4倍。 Conclusion: SSTGNN是一个轻量级的视频检测框架,能够在面对不同类型的AI生成和篡改视频时表现出色,并具有较强的鲁棒性和可扩展性。 Abstract: The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and temporal differential modeling into a graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong robustness against unseen manipulations. Remarkably, SSTGNN accomplishes these results with up to 42.4$\times$ fewer parameters than state-of-the-art models, making it highly lightweight and scalable for real-world deployment.

[133] AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety

Adi Levi,Or Levi,Sardhendu Mishra,Jonathan Morra

Main category: cs.CV

TL;DR: This paper explores the use of Multimodal Large Language Models (MLLMs) like Gemini, GPT, and Llama in brand safety classification for content moderation, showing their effectiveness compared to human reviewers, while identifying limitations and failure cases.

Details Motivation: The exponential growth of online video content has made moderation beyond human capabilities, creating a need for automated solutions. While MLLMs have shown promise in video understanding tasks, their application in nuanced multimodal content moderation remains underexplored. Method: A detailed comparative analysis was conducted on MLLMs using a newly introduced, multimodal, and multilingual dataset labeled by professional reviewers in various risk categories. The performance of MLLMs was evaluated in terms of accuracy and cost efficiency compared to human reviewers. Result: MLLMs demonstrated effectiveness in multimodal brand safety classification, showing potential for content moderation while also revealing specific limitations and failure cases. Conclusion: The study concludes that MLLMs like Gemini, GPT, and Llama are effective in multimodal brand safety classification, offering a viable solution for content moderation, although they have certain limitations. Abstract: As the volume of video content online grows exponentially, the demand for moderation of unsafe videos has surpassed human capabilities, posing both operational and mental health challenges. While recent studies demonstrated the merits of Multimodal Large Language Models (MLLMs) in various video understanding tasks, their application to multimodal content moderation, a domain that requires nuanced understanding of both visual and textual cues, remains relatively underexplored. In this work, we benchmark the capabilities of MLLMs in brand safety classification, a critical subset of content moderation for safe-guarding advertising integrity. To this end, we introduce a novel, multimodal and multilingual dataset, meticulously labeled by professional reviewers in a multitude of risk categories. Through a detailed comparative analysis, we demonstrate the effectiveness of MLLMs such as Gemini, GPT, and Llama in multimodal brand safety, and evaluate their accuracy and cost efficiency compared to professional human reviewers. Furthermore, we present an in-depth discussion shedding light on limitations of MLLMs and failure cases. We are releasing our dataset alongside this paper to facilitate future research on effective and responsible brand safety and content moderation.

[134] Looking into the Unknown: Exploring Action Discovery for Segmentation of Known and Unknown Actions

Federico Spurio,Emad Bahrami,Olga Zatsarynna,Yazan Abu Farha,Gianpiero Francesca,Juergen Gall

Main category: cs.CV

TL;DR: 本文提出了一种新的时间动作分割方法——Action Discovery,通过两步法(GGSM和UASA)解决部分标注数据集中模糊或未标注动作的识别问题,并在多个数据集上验证了其有效性。

Details Motivation: 论文旨在解决时间动作分割中模糊动作定义和部分标注数据集带来的挑战,尤其是在标注数据中仅有一部分动作(称为已知动作)被标注而其他未知动作未被标注的情况下。 Method: 论文提出了一种两步方法:1)Granularity-Guided Segmentation Module(GGSM),用于识别已知和未知动作的时间间隔;2)Unknown Action Segment Assignment(UASA),基于学习到的嵌入相似性来识别未知动作中的语义类别。 Result: 论文方法在三个具有挑战性的数据集(Breakfast、50Salads和Desktop Assembly)上进行了系统探索,结果表明该方法显著优于现有基线方法。 Conclusion: 论文提出了一种两步方法,即Granularity-Guided Segmentation Module(GGSM)和Unknown Action Segment Assignment(UASA),用于解决部分标注数据集中模糊动作定义和不完整标注的问题,并在三个具有挑战性的数据集上验证了其优于现有基线方法的效果。 Abstract: We introduce Action Discovery, a novel setup within Temporal Action Segmentation that addresses the challenge of defining and annotating ambiguous actions and incomplete annotations in partially labeled datasets. In this setup, only a subset of actions - referred to as known actions - is annotated in the training data, while other unknown actions remain unlabeled. This scenario is particularly relevant in domains like neuroscience, where well-defined behaviors (e.g., walking, eating) coexist with subtle or infrequent actions that are often overlooked, as well as in applications where datasets are inherently partially annotated due to ambiguous or missing labels. To address this problem, we propose a two-step approach that leverages the known annotations to guide both the temporal and semantic granularity of unknown action segments. First, we introduce the Granularity-Guided Segmentation Module (GGSM), which identifies temporal intervals for both known and unknown actions by mimicking the granularity of annotated actions. Second, we propose the Unknown Action Segment Assignment (UASA), which identifies semantically meaningful classes within the unknown actions, based on learned embedding similarities. We systematically explore the proposed setting of Action Discovery on three challenging datasets - Breakfast, 50Salads, and Desktop Assembly - demonstrating that our method considerably improves upon existing baselines.

[135] Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis

Kunyu Feng,Yue Ma,Xinhua Zhang,Boshi Liu,Yikuang Yuluo,Yinhan Zhang,Runtao Liu,Hongyu Liu,Zhiyuan Qin,Shanhui Mo,Qifeng Chen,Zeyu Wang

Main category: cs.CV

TL;DR: Follow-Your-Instruction是一种MLLM驱动的框架,能够自动合成高质量的2D、3D和4D数据,有效提升生成模型的性能。

Details Motivation: 随着AI生成内容(AIGC)需求的增长,对高质量、多样化和可扩展的数据的需求日益增加,但传统的数据收集方法成本高且耗时。 Method: 使用MLLM驱动的框架,通过MLLM-Collector、MLLM-Generator、MLLM-Optimizer和MLLM-Planner四个模块,自动合成高质量的2D、3D和4D数据。 Result: 通过在2D、3D和4D生成任务上的综合实验,结果表明合成数据显著提升了现有基线模型的性能。 Conclusion: Follow-Your-Instruction框架展示了作为生成智能的可扩展且有效的数据引擎的潜力,显著提升了现有基线模型的性能。 Abstract: With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.

[136] DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition

Haijing Liu,Tao Pu,Hefeng Wu,Keze Wang,Liang Lin

Main category: cs.CV

TL;DR: The paper proposes DART, a framework for Open-Vocabulary Multi-Label Recognition that improves both intra-class localization and inter-class reasoning by integrating adaptive refinement and LLM-derived relational knowledge transfer.

Details Motivation: Existing Vision-Language Pre-training (VLP) models struggle with fine-grained localization under weak supervision and fail to utilize structured relational knowledge, limiting their performance on Open-Vocabulary Multi-Label Recognition (OV-MLR), especially for unseen classes. Method: DART enhances a frozen Vision-Language Pre-training (VLP) backbone with two adaptive modules: an Adaptive Refinement Module (ARM) with a Weakly Supervised Patch Selecting (WPS) loss for intra-class refinement, and an Adaptive Transfer Module (ATM) leveraging a Class Relationship Graph (CRG) from a Large Language Model (LLM) for inter-class reasoning. Result: Extensive experiments show that DART achieves new state-of-the-art performance on challenging OV-MLR benchmarks, demonstrating its effectiveness in both intra-class localization and inter-class reasoning through adaptive refinement and transfer mechanisms. Conclusion: DART is the first framework to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while performing adaptive intra-class refinement under weak supervision, achieving state-of-the-art performance on OV-MLR tasks. Abstract: Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image, requiring both precise intra-class localization to pinpoint objects and effective inter-class reasoning to model complex category dependencies. While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. DART enhances a frozen VLP backbone via two synergistic adaptive modules. For intra-class refinement, an Adaptive Refinement Module (ARM) refines patch features adaptively, coupled with a novel Weakly Supervised Patch Selecting (WPS) loss that enables discriminative localization using only image-level labels. Concurrently, for inter-class transfer, an Adaptive Transfer Module (ATM) leverages a Class Relationship Graph (CRG), constructed using structured knowledge mined from a Large Language Model (LLM), and employs graph attention network to adaptively transfer relational information between class representations. DART is the first framework, to our knowledge, to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while simultaneously performing adaptive intra-class refinement under weak supervision for OV-MLR. Extensive experiments on challenging benchmarks demonstrate that our DART achieves new state-of-the-art performance, validating its effectiveness.

[137] WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

Shaobin Zhuang,Yiwei Guo,Canmiao Fu,Zhipeng Huang,Zeyue Tian,Ying Zhang,Chen Li,Yali Wang

Main category: cs.CV

TL;DR: The paper introduces WeTok, a visual tokenizer that improves compression ratio and reconstruction fidelity compared to previous tokenizers, by implementing Group-wise lookup-free Quantization (GQ) and Generative Decoding (GD).

Details Motivation: The existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. Method: WeTok tokenizer uses Group-wise lookup-free Quantization (GQ) and Generative Decoding (GD). Result: On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID of 0.12. Conclusion: WeTok tokenizer outperforms previous tokenizers in terms of compression ratio and reconstruction fidelity. Abstract: Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.

[138] LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model

Tao Sun,Oliver Liu,JinJin Li,Lan Ma

Main category: cs.CV

TL;DR: 本文提出了LLaVA-RE,一个基于多模态大语言模型的二进制图像-文本相关性评估框架,并构建了新的相关性数据集。

Details Motivation: 二进制图像-文本相关性评估是衡量多模态生成AI响应质量的关键问题,但文本多样性和相关性定义差异使其具有挑战性。 Method: 基于LLaVA架构,采用详细任务指令和多模态上下文样本,提出了一种新的二进制相关性数据集。 Result: 实验结果验证了LLaVA-RE框架的有效性。 Conclusion: LLaVA-RE是一个有效的二进制图像-文本相关性评估框架,利用多模态大语言模型处理复杂的文本格式并结合任务信息。 Abstract: Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., ``Relevant'' vs. ``Not Relevant'', is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. In addition, we propose a novel binary relevancy data set that covers various tasks. Experimental results validate the effectiveness of our framework.

[139] Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity

Yuhan Zhang,Long Zhuo,Ziyang Chu,Tong Wu,Zhibing Li,Liang Pan,Dahua Lin,Ziwei Liu

Main category: cs.CV

TL;DR: Hi3DEval is a new hierarchical framework for evaluating 3D generative content that combines object- and part-level assessments, includes material realism analysis, and uses a hybrid 3D representation scoring system. It aligns better with human preference and offers a scalable alternative to manual evaluation.

Details Motivation: Existing 3D content evaluation methods are limited to object-level and image-based metrics, failing to capture spatial coherence, material authenticity, and fine details. There is a need for a more comprehensive and automated evaluation framework that aligns with human perception. Method: Hi3DEval uses a hierarchical framework combining object-level and part-level evaluations. It uses video-based representations for object and material evaluation and pre-trained 3D features for part-level perception. A 3D-aware automated scoring system is developed using a hybrid approach, supported by the Hi3DBench dataset and a multi-agent annotation pipeline. Result: Hi3DEval outperforms existing image-based metrics in capturing 3D characteristics and aligns better with human judgment. The method demonstrates effectiveness across multiple dimensions and provides scalable, automated evaluation for 3D generative models. Conclusion: The proposed Hi3DEval framework provides a scalable and comprehensive solution for evaluating 3D generative content, with strong alignment to human preferences and support for large-scale assessment. Abstract: Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.

[140] MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

Henghui Ding,Kaining Ying,Chang Liu,Shuting He,Xudong Jiang,Yu-Gang Jiang,Philip H. S. Torr,Song Bai

Main category: cs.CV

TL;DR: MOSEv2 是一个更复杂、更具挑战性的视频对象分割数据集,揭示了当前方法在真实世界场景中的局限性。

Details Motivation: 现有数据集(如 DAVIS 和 YouTube-VOS)主要包含突出、孤立的对象,限制了真实世界场景的泛化能力。 Method: 构建了一个包含 5,024 个视频和超过 701,976 个高质量掩码的新数据集,并对现有方法进行了基准测试。 Result: 20 种代表性 VOS 方法在 MOSEv2 上表现显著下降,如 SAM2 从 MOSEv1 的 76.4% 下降到 50.9%。 Conclusion: MOSEv2 是一个更具挑战性的视频对象分割数据集,旨在推动真实世界复杂场景下的 VOS 方法发展。 Abstract: Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities. MOSEv2 is publicly available at https://MOSE.video.

[141] GAP: Gaussianize Any Point Clouds with Text Guidance

Weiqi Zhang,Junsheng Zhou,Haotian Geng,Wenyuan Zhang,Yu-Shen Liu

Main category: cs.CV

TL;DR: 本文提出GAP方法,通过文本引导将无颜色3D点云转换为高保真3D高斯分布,结合扩散模型和表面锚定机制,在多复杂场景下实现高效生成。

Details Motivation: 3D高斯泼溅在快速高质量渲染方面表现出色,但如何将无颜色的3D点云转换为高斯分布仍是一个挑战。 Method: 设计了一个基于深度感知图像扩散模型的多视角优化框架,结合表面锚定机制和基于扩散的修复策略,以从无颜色点云生成高斯分布。 Result: GAP在从合成点云到真实世界扫描的各种复杂场景中均表现出色,能够生成高保真的3D高斯分布。 Conclusion: GAP有效地将无颜色的3D点云转换为高保真的3D高斯分布,通过多视角优化框架和表面锚定机制实现了几何准确性与外观一致性的平衡。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes. Project Page: https://weiqi-zhang.github.io/GAP.

[142] FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing

Mohammed Talha Alam,Fahad Shamshad,Fakhri Karray,Karthik Nandakumar

Main category: cs.CV

TL;DR: 本文提出了一种名为FaceAnonyMixer的可撤销人脸生成框架,通过将真实人脸图像的潜在编码与基于可撤销密钥的合成编码进行不可逆混合,实现了在保护隐私的同时保持识别效用的目标。

Details Motivation: 人脸识别技术的进步引发了隐私担忧,需要保护身份的方法,同时保持识别效用。现有方法未能满足生物特征模板保护的需求,包括可撤销性、不可链接性和不可逆性。 Method: 利用预训练生成模型的潜在空间,将真实人脸图像的潜在编码与基于可撤销密钥的合成编码进行不可逆混合,并通过多目标损失函数优化以满足可撤销生物特征需求。 Result: FaceAnonyMixer能够生成高质量的可撤销人脸图像,并在不需修改现有识别系统的情况下实现优越的识别准确率,相比最近的可撤销生物特征方法,在商业API上取得了超过11%的提升。 Conclusion: FaceAnonyMixer是一个可撤销的人脸生成框架,能够在保护隐私的同时保持识别效用。 Abstract: Advancements in face recognition (FR) technologies have amplified privacy concerns, necessitating methods that protect identity while maintaining recognition utility. Existing face anonymization methods typically focus on obscuring identity but fail to meet the requirements of biometric template protection, including revocability, unlinkability, and irreversibility. We propose FaceAnonyMixer, a cancelable face generation framework that leverages the latent space of a pre-trained generative model to synthesize privacy-preserving face images. The core idea of FaceAnonyMixer is to irreversibly mix the latent code of a real face image with a synthetic code derived from a revocable key. The mixed latent code is further refined through a carefully designed multi-objective loss to satisfy all cancelable biometric requirements. FaceAnonyMixer is capable of generating high-quality cancelable faces that can be directly matched using existing FR systems without requiring any modifications. Extensive experiments on benchmark datasets demonstrate that FaceAnonyMixer delivers superior recognition accuracy while providing significantly stronger privacy protection, achieving over an 11% gain on commercial API compared to recent cancelable biometric methods. Code is available at: https://github.com/talha-alam/faceanonymixer.