Table of Contents
cs.CL [Back]
[1] Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM
Thomas Thebaud,Yen-Ju Lu,Matthew Wiesner,Peter Viechnicki,Najim Dehak
Main category: cs.CL
TL;DR: 本文提出了一种通过结合冻结的音频模型和语言模型来丰富对话转录中说话人特征的方法,无需微调,即可在保持高效和模块化的同时实现优秀的性能。
Details
Motivation: 论文的动机是探索对话转录后处理中的一个补充步骤,即通过添加关于说话人特征(如年龄、性别和情感)的元数据标签来丰富转录内容。 Method: 论文的方法是将冻结的音频基础模型(如Whisper或WavLM)与冻结的LLAMA语言模型结合,利用轻量级、高效的连接器来弥合音频和语言表示之间的差距。 Result: 论文结果显示,该方法在保持模块化和速度的同时,在说话人分析任务上实现了具有竞争力的性能,并且冻结的LLAMA模型可以直接比较x向量,某些场景下等错误率达到了8.8%。 Conclusion: 该论文得出结论,通过结合冻结的音频基础模型和冻结的LLAMA语言模型,并使用轻量级连接器,可以有效地对说话人属性进行推理,且无需对模型进行任务特定的微调。 Abstract: In dialogue transcription pipelines, Large Language Models (LLMs) are frequently employed in post-processing to improve grammar, punctuation, and readability. We explore a complementary post-processing step: enriching transcribed dialogues by adding metadata tags for speaker characteristics such as age, gender, and emotion. Some of the tags are global to the entire dialogue, while some are time-variant. Our approach couples frozen audio foundation models, such as Whisper or WavLM, with a frozen LLAMA language model to infer these speaker attributes, without requiring task-specific fine-tuning of either model. Using lightweight, efficient connectors to bridge audio and language representations, we achieve competitive performance on speaker profiling tasks while preserving modularity and speed. Additionally, we demonstrate that a frozen LLAMA model can compare x-vectors directly, achieving an Equal Error Rate of 8.8% in some scenarios.[2] Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Negar Foroutan,Clara Meister,Debjit Paul,Joel Niklaus,Sina Ahmadi,Antoine Bosselut,Rico Sennrich
Main category: cs.CL
TL;DR: This paper introduces Parity-aware BPE, a solution to the issue of inequality in tokenization across languages. It provides equitable token counts with minimal impact on compression and no significant effect on model performance.
Details
Motivation: The standard algorithms for learning tokenizers rely on frequency-based objectives, which favor dominant languages in the training data, leaving lower-resource languages with suboptimal tokenizations. This results in increased computational and financial inequalities between users from different language backgrounds. Method: Parity-aware BPE, a variant of the widely-used BPE algorithm, was introduced and evaluated empirically. At every merge step, it maximizes the compression gain of the currently worst-compressed language. Result: Parity-aware BPE leads to more equitable token counts across languages, while having negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks. Conclusion: Parity-aware BPE is an effective solution to the problem of tokenization inequality across languages, as it leads to more equitable token counts with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks. Abstract: Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with[3] Pitch Accent Detection improves Pretrained Automatic Speech Recognition
David Sasu,Natalie Schluter
Main category: cs.CL
TL;DR: 本研究提出了一种联合自动语音识别和音高重音检测模型,通过半监督语音表示和互补的音高重音检测模块显著提升了语音识别性能。
Details
Motivation: 音高重音等韵律线索对于语音识别任务的重要性以及当前模型在此类特征上的不足激发了我们对这一问题的研究。 Method: 通过引入联合自动语音识别和音高重音检测模型,采用半监督语音表示并结合互补的音高重音检测模块来提升识别性能。 Result: 音高重音检测组件在该任务的最先进方法上取得了显著改进,F1分数差距缩小了41%;在有限资源微调下,LibriSpeech数据集上自动语音识别的词错误率降低了28.3%。 Conclusion: 扩展预训练语音模型以保留或重新学习诸如音高重音等韵律线索是提高自动语音识别系统性能的重要手段。 Abstract: We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.[4] Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History
Tommaso Tosato,Saskia Helbling,Yorguin-Jose Mantilla-Ramos,Mahmood Hegazy,Alberto Tosato,David John Lemay,Irina Rish,Guillaume Dumas
Main category: cs.CL
TL;DR: 研究发现大型语言模型在不同条件下表现出显著的人格和行为不稳定性,挑战了模型一致性的假设,并对安全应用提出警示。
Details
Motivation: 大型语言模型在部署时需要一致的行为模式,但其人格特征仍不明确,这影响了模型的可靠性和安全性。 Method: 提出PERSIST框架,使用传统和新开发的LLM适应性人格工具,系统地测试了25多个开源模型的响应变异性。 Result: 研究发现,即使是非常大的模型(400B参数以上)也表现出显著的响应变异性,简单的提示重新排序也能导致人格测量变化达20%。 Conclusion: 当前的LLM缺乏真正行为一致性的基础,人格一致性策略可能不足以实现安全关键应用中的可预测行为。 Abstract: Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD > 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.[5] RCR-Router: Efficient Role-Aware Context Routing for Multi-Agent LLM Systems with Structured Memory
Jun Liu,Zhenglun Kong,Changdi Yang,Fan Yang,Tianqi Li,Peiyan Dong,Joannah Nanjekye,Hao Tang,Geng Yuan,Wei Niu,Wenbin Zhang,Pu Zhao,Xue Lin,Dong Huang,Yanzhi Wang
Main category: cs.CL
TL;DR: This paper proposes RCR-Router, a dynamic and efficient context routing framework for multi-agent LLM systems that reduces token consumption while maintaining or improving answer quality, along with a new evaluation metric for assessing LLM-generated explanations.
Details
Motivation: Existing coordination schemes in multi-agent LLM systems use static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. Method: The paper introduces RCR-Router, a modular and role-aware context routing framework that dynamically selects semantically relevant memory subsets for each agent, using a lightweight scoring policy. Agent outputs are integrated into a shared memory store for progressive refinement. Result: Experiments on three multi-hop QA benchmarks show that RCR-Router reduces token usage by up to 30% while improving or maintaining answer quality. Additionally, the proposed Answer Quality Score metric captures LLM-generated explanations beyond standard QA accuracy. Conclusion: RCR-Router proves to be a more efficient and adaptive context routing framework for multi-agent LLMs, reducing token usage while maintaining or improving answer quality. Abstract: Multi-agent large language model (LLM) systems have shown strong potential in complex reasoning and collaborative decision-making tasks. However, most existing coordination schemes rely on static or full-context routing strategies, which lead to excessive token consumption, redundant memory exposure, and limited adaptability across interaction rounds. We introduce RCR-Router, a modular and role-aware context routing framework designed to enable efficient, adaptive collaboration in multi-agent LLMs. To our knowledge, this is the first routing approach that dynamically selects semantically relevant memory subsets for each agent based on its role and task stage, while adhering to a strict token budget. A lightweight scoring policy guides memory selection, and agent outputs are iteratively integrated into a shared memory store to facilitate progressive context refinement. To better evaluate model behavior, we further propose an Answer Quality Score metric that captures LLM-generated explanations beyond standard QA accuracy. Experiments on three multi-hop QA benchmarks -- HotPotQA, MuSiQue, and 2WikiMultihop -- demonstrate that RCR-Router reduces token usage (up to 30%) while improving or maintaining answer quality. These results highlight the importance of structured memory routing and output-aware evaluation in advancing scalable multi-agent LLM systems.[6] I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations
Julia Kharchenko,Tanya Roosta,Aman Chadha,Chirag Shah
Main category: cs.CL
TL;DR: 这篇论文开发了一个评估大语言模型如何处理语言中隐含人口统计信息的基准,发现模型会系统性地惩罚某些语言模式,例如犹豫性语言,尽管内容质量相同。
Details
Motivation: 论文的动机是评估大型语言模型(LLMs)如何应对语言中的“口音”标记,这些标记可能会无意中揭示性别、社会阶层或地区背景等人口统计信息,从而检测模型是否存在语言歧视。 Method: 论文提出了一种综合基准,通过100个经过验证的问题-回答对,模拟访谈情境,并生成受控的语言变化,以隔离特定的语言现象,同时保持语义等价。 Result: 论文结果显示,犹豫性语言的回答平均得分降低了25.6%,并且该基准能够有效识别特定模型的偏见。 Conclusion: 论文得出结论,LLMs(大语言模型)在处理语言时会系统性地惩罚某些语言模式,尤其是犹豫性语言,尽管这些语言的内容质量是相同的。这为检测和衡量人工智能系统中的语言歧视奠定了基础,并对自动化决策系统中的公平性具有广泛的应用价值。 Abstract: This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark's effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts.[7] Towards Robust Evaluation of Visual Activity Recognition: Resolving Verb Ambiguity with Sense Clustering
Louie Hong Yao,Nicholas Jarvis,Tianyu Jiang
Main category: cs.CL
TL;DR: The paper proposes a vision-language clustering approach to better evaluate visual activity recognition systems by addressing ambiguities in verb semantics and multiple valid interpretations of images.
Details
Motivation: Evaluating visual activity recognition systems is challenging due to ambiguities in verb semantics and image interpretation. Standard exact-match evaluations are insufficient because they rely on a single gold answer, which fails to capture synonymous verbs or different perspectives describing the same event. Method: The authors propose a vision-language clustering framework that constructs verb sense clusters to evaluate activity recognition models. They analyze the imSitu dataset and conduct a comparison between their cluster-based evaluation and standard evaluation methods. Human alignment analysis is also performed. Result: The analysis of the imSitu dataset revealed that each image maps to an average of 2.8 sense clusters, each representing a distinct perspective. The cluster-based evaluation showed better alignment with human judgments compared to standard methods, providing a more robust and nuanced assessment of model performance. Conclusion: The proposed vision-language clustering framework offers a more nuanced and accurate evaluation of visual activity recognition systems by addressing inherent ambiguities in verb semantics and image interpretation. Abstract: Evaluating visual activity recognition systems is challenging due to inherent ambiguities in verb semantics and image interpretation. When describing actions in images, synonymous verbs can refer to the same event (e.g., brushing vs. grooming), while different perspectives can lead to equally valid but distinct verb choices (e.g., piloting vs. operating). Standard exact-match evaluation, which relies on a single gold answer, fails to capture these ambiguities, resulting in an incomplete assessment of model performance. To address this, we propose a vision-language clustering framework that constructs verb sense clusters, providing a more robust evaluation. Our analysis of the imSitu dataset shows that each image maps to an average of 2.8 sense clusters, with each cluster representing a distinct perspective of the image. We evaluate multiple activity recognition models and compare our cluster-based evaluation with standard evaluation methods. Additionally, our human alignment analysis suggests that the cluster-based evaluation better aligns with human judgements, offering a more nuanced assessment of model performance.[8] A Multi-Stage Large Language Model Framework for Extracting Suicide-Related Social Determinants of Health
Song Wang,Yishu Wei,Haotian Ma,Max Lovitt,Kelly Deng,Yuan Meng,Zihan Xu,Jingze Zhang,Yunyu Xiao,Ying Ding,Xuhai Xu,Joydeep Ghosh,Yifan Peng
Main category: cs.CL
TL;DR: This paper presents a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text for suicide prevention, showing improved accuracy, transparency, and efficiency compared to other models.
Details
Motivation: Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches to this goal face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability. Method: We present a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text. Our approach was compared to other state-of-the-art language models (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1). We also evaluated how the model's explanations help people annotate SDoH factors more quickly and accurately. The analysis included both automated comparisons and a pilot user study. Result: We show that our proposed framework demonstrated performance boosts in the overarching task of extracting SDoH factors and in the finer-grained tasks of retrieving relevant context. Additionally, we show that fine-tuning a smaller, task-specific model achieves comparable or better performance with reduced inference costs. The multi-stage design not only enhances extraction but also provides intermediate explanations, improving model explainability. Conclusion: Our approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts. These advancements have the potential to support early identification of individuals at risk and inform more effective prevention strategies. Abstract: Background: Understanding social determinants of health (SDoH) factors contributing to suicide incidents is crucial for early intervention and prevention. However, data-driven approaches to this goal face challenges such as long-tailed factor distributions, analyzing pivotal stressors preceding suicide incidents, and limited model explainability. Methods: We present a multi-stage large language model framework to enhance SDoH factor extraction from unstructured text. Our approach was compared to other state-of-the-art language models (i.e., pre-trained BioBERT and GPT-3.5-turbo) and reasoning models (i.e., DeepSeek-R1). We also evaluated how the model's explanations help people annotate SDoH factors more quickly and accurately. The analysis included both automated comparisons and a pilot user study. Results: We show that our proposed framework demonstrated performance boosts in the overarching task of extracting SDoH factors and in the finer-grained tasks of retrieving relevant context. Additionally, we show that fine-tuning a smaller, task-specific model achieves comparable or better performance with reduced inference costs. The multi-stage design not only enhances extraction but also provides intermediate explanations, improving model explainability. Conclusions: Our approach improves both the accuracy and transparency of extracting suicide-related SDoH from unstructured texts. These advancements have the potential to support early identification of individuals at risk and inform more effective prevention strategies.[9] Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning
Kun Peng,Cong Cao,Hao Peng,Zhifeng Hao,Lei Jiang,Kongjing Gu,Yanbing Liu,Philip S. Yu
Main category: cs.CL
TL;DR: 本研究提出了一种通过分割对话并提取四元组的新方法,在降低计算成本的同时提高了情感分析性能。
Details
Motivation: 现有方法在整个对话范围内学习词关系,假设情感元素分布均匀,但实际对话包含多个语义上独立的子对话,因此容易引入噪声。 Method: 利用结构熵最小化算法将对话分割为语义独立的子对话,并引入一个两步框架:首先在话语级别提取情感元素,然后在子对话级别进行四元组匹配。 Result: 实验表明,该方法在对话方面-情感四元组抽取任务中达到了最先进的性能,同时计算成本显著降低。 Conclusion: 该研究提出了一种新的对话方面-情感四元组抽取方法,通过将对话分割为语义独立的子对话并采用两步框架进行四元组匹配,实现了更低计算成本下的最先进性能。 Abstract: Dialogues Aspect-based Sentiment Quadruple Extraction (DiaASQ) aims to extract all target-aspect-opinion-sentiment quadruples from a given multi-round, multi-participant dialogue. Existing methods typically learn word relations across entire dialogues, assuming a uniform distribution of sentiment elements. However, we find that dialogues often contain multiple semantically independent sub-dialogues without clear dependencies between them. Therefore, learning word relationships across the entire dialogue inevitably introduces additional noise into the extraction process. To address this, our method focuses on partitioning dialogues into semantically independent sub-dialogues. Achieving completeness while minimizing these sub-dialogues presents a significant challenge. Simply partitioning based on reply relationships is ineffective. Instead, we propose utilizing a structural entropy minimization algorithm to partition the dialogues. This approach aims to preserve relevant utterances while distinguishing irrelevant ones as much as possible. Furthermore, we introduce a two-step framework for quadruple extraction: first extracting individual sentiment elements at the utterance level, then matching quadruples at the sub-dialogue level. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in DiaASQ with much lower computational costs.[10] Evaluation of LLMs in AMR Parsing
Shu Han Ho
Main category: cs.CL
TL;DR: 本文研究了通过简单微调仅解码器的大型语言模型(LLMs)进行AMR解析的新方法,并发现其性能可与复杂解析器媲美。
Details
Motivation: 寻找一种更简单且有效的AMR解析方法,以替代复杂的最先进(SOTA)AMR解析器。 Method: 对Phi 3.5、Gemma 2、LLaMA 3.2和DeepSeek R1 LLaMA Distilled四种LLM架构进行微调,并使用LDC2020T02 Gold AMR3.0测试集进行评估。 Result: 通过简单微调仅解码器的LLMs,可以达到与复杂SOTA AMR解析器相当的性能,LLaMA 3.2在SMATCH F1得分上达到0.804,接近Graphene Smatch(MBSE)的0.854。 Conclusion: 微调仅解码器的大型语言模型(LLMs)在AMR解析中表现出色,LLaMA 3.2在语义性能上领先,Phi 3.5在结构有效性上表现优异。 Abstract: Meaning Representation (AMR) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity.[11] Align, Don't Divide: Revisiting the LoRA Architecture in Multi-Task Learning
Jinda Liu,Bo Cheng,Yi Chang,Yuan Wu
Main category: cs.CL
TL;DR: 本文挑战了当前多任务学习中复杂多适配器结构的主流范式,提出了更简单有效的Align-LoRA方法,代码已开源。
Details
Motivation: 研究发现当前多适配器或多头架构并非最优,有效多任务学习的关键在于学习强大的共享表示,而非隔离任务特定特征。 Method: 提出Align-LoRA方法,通过显式损失函数对齐任务表示,并在共享适配器空间中进行学习。 Result: 简化多头架构表现优于复杂系统;标准单适配器LoRA在增加秩后也具有竞争力;Align-LoRA在实验中显著优于所有基线。 Conclusion: Align-LoRA通过在共享适配器空间内对齐任务表示,显著超越所有基线模型,为多任务学习提供了一种更简单且更有效的范式。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) is essential for adapting Large Language Models (LLMs). In practice, LLMs are often required to handle a diverse set of tasks from multiple domains, a scenario naturally addressed by multi-task learning (MTL). Within this MTL context, a prevailing trend involves LoRA variants with multiple adapters or heads, which advocate for structural diversity to capture task-specific knowledge. Our findings present a direct challenge to this paradigm. We first show that a simplified multi-head architecture with high inter-head similarity substantially outperforms complex multi-adapter and multi-head systems. This leads us to question the multi-component paradigm itself, and we further demonstrate that a standard single-adapter LoRA, with a sufficiently increased rank, also achieves highly competitive performance. These results lead us to a new hypothesis: effective MTL generalization hinges on learning robust shared representations, not isolating task-specific features. To validate this, we propose Align-LoRA, which incorporates an explicit loss to align task representations within the shared adapter space. Experiments confirm that Align-LoRA significantly surpasses all baselines, establishing a simpler yet more effective paradigm for adapting LLMs to multiple tasks. The code is available at https://github.com/jinda-liu/Align-LoRA.[12] Multimodal Fact Checking with Unified Visual, Textual, and Contextual Representations
Aditya Kishore,Gaurav Kumar,Jasabanta Patro
Main category: cs.CL
TL;DR: This paper introduces "MultiCheck", a multimodal fact-checking framework that combines text and image analysis to improve the detection of misinformation, achieving strong results on the Factify 2 dataset.
Details
Motivation: The increasing prevalence of multimodal misinformation, where claims are supported by both text and images, poses challenges to traditional fact-checking systems that primarily rely on textual evidence. Method: The authors introduced a unified framework named "MultiCheck" that utilizes dedicated encoders for text and images, along with a fusion module to capture cross-modal relationships. This is complemented by a classification head and a contrastive learning objective to align claim-evidence pairs in a shared latent space. Result: The approach was evaluated on the Factify 2 dataset, achieving a weighted F1 score of 0.84, which significantly outperforms the baseline. Conclusion: The proposed MultiCheck framework effectively addresses the challenges of multimodal misinformation by enabling scalable and interpretable fact-checking through explicit multimodal reasoning. Abstract: The growing rate of multimodal misinformation, where claims are supported by both text and images, poses significant challenges to fact-checking systems that rely primarily on textual evidence. In this work, we have proposed a unified framework for fine-grained multimodal fact verification called "MultiCheck", designed to reason over structured textual and visual signals. Our architecture combines dedicated encoders for text and images with a fusion module that captures cross-modal relationships using element-wise interactions. A classification head then predicts the veracity of a claim, supported by a contrastive learning objective that encourages semantic alignment between claim-evidence pairs in a shared latent space. We evaluate our approach on the Factify 2 dataset, achieving a weighted F1 score of 0.84, substantially outperforming the baseline. These results highlight the effectiveness of explicit multimodal reasoning and demonstrate the potential of our approach for scalable and interpretable fact-checking in complex, real-world scenarios.[13] BEE-RAG: Balanced Entropy Engineering for Retrieval-Augmented Generation
Yuhao Wang,Ruiyang Ren,Yucheng Wang,Jing Liu,Wayne Xin Zhao,Hua Wu,Haifeng Wang
Main category: cs.CL
TL;DR: This paper introduces BEE-RAG, a framework that enhances RAG systems' adaptability to different context lengths using entropy engineering and attention reformulation.
Details
Motivation: The authors aim to address the issues of unconstrained entropy growth and attention dilution caused by long retrieval contexts in RAG systems. Method: The paper proposes the balanced entropy-engineered RAG (BEE-RAG) framework, incorporating entropy invariance, zero-shot inference strategy, and parameter-efficient adaptive fine-tuning mechanism. Result: Extensive experiments demonstrate the effectiveness of the proposed BEE-RAG framework in improving RAG performance across multiple tasks. Conclusion: The BEE-RAG framework effectively improves the adaptability of RAG systems to varying context lengths by leveraging balanced context entropy and reformulating attention dynamics. Abstract: With the rapid advancement of large language models (LLMs), retrieval-augmented generation (RAG) has emerged as a critical approach to supplement the inherent knowledge limitations of LLMs. However, due to the typically large volume of retrieved information, RAG tends to operate with long context lengths. From the perspective of entropy engineering, we identify unconstrained entropy growth and attention dilution due to long retrieval context as significant factors affecting RAG performance. In this paper, we propose the balanced entropy-engineered RAG (BEE-RAG) framework, which improves the adaptability of RAG systems to varying context lengths through the principle of entropy invariance. By leveraging balanced context entropy to reformulate attention dynamics, BEE-RAG separates attention sensitivity from context length, ensuring a stable entropy level. Building upon this, we introduce a zero-shot inference strategy for multi-importance estimation and a parameter-efficient adaptive fine-tuning mechanism to obtain the optimal balancing factor for different settings. Extensive experiments across multiple RAG tasks demonstrate the effectiveness of BEE-RAG.[14] Attention Basin: Why Contextual Position Matters in Large Language Models
Zihao Yi,Delong Zeng,Zhenqing Ling,Haohao Luo,Zhe Xu,Wei Liu,Jian Luan,Wanxia Cao,Ying Shen
Main category: cs.CL
TL;DR: 本文研究了大型语言模型对输入信息上下文位置的敏感性,提出了基于注意力机制的重排序框架AttnRank,以提升模型性能。
Details
Motivation: 大型语言模型的性能对其输入中信息的上下文位置非常敏感,因此需要研究其背后的机制。 Method: 通过大量实验揭示了一种称为“注意力盆地”的现象,并提出了AttnRank方法,该方法通过两个阶段估计模型的位置注意力偏好并重新排序检索到的文档或少样本示例。 Result: 实验表明,AttnRank在不同架构和规模的10个大型语言模型上都实现了显著改进。 Conclusion: AttnRank是一种模型无关、无需训练且即插即用的方法,具有最小的计算开销,能够有效提升模型性能。 Abstract: The performance of Large Language Models (LLMs) is significantly sensitive to the contextual position of information in the input. To investigate the mechanism behind this positional bias, our extensive experiments reveal a consistent phenomenon we term the attention basin: when presented with a sequence of structured items (e.g., retrieved documents or few-shot examples), models systematically assign higher attention to the items at the beginning and end of the sequence, while neglecting those in the middle. Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model's intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead. Experiments on multi-hop QA and few-shot in-context learning tasks demonstrate that AttnRank achieves substantial improvements across 10 large language models of varying architectures and scales, without modifying model parameters or training procedures.[15] Towards Assessing Medical Ethics from Knowledge to Practice
Chang Hong,Minghao Wu,Qingying Xiao,Yuchi Wang,Xiang Wan,Guangjun Yu,Benyou Wang,Yan Hu
Main category: cs.CL
TL;DR: 我们介绍了PrinciplismQA,这是一个用于系统评估大型语言模型(LLM)与核心医学伦理一致性的全面基准测试。
Details
Motivation: 将大型语言模型集成到医疗保健中需要严格评估其伦理推理能力,而当前的基准测试常常忽视这一领域。 Method: 我们介绍了PrinciplismQA,一个包含3,648个问题的全面基准,旨在系统评估LLM与核心医学伦理的一致性。 Result: 实验结果显示了模型的伦理知识与其实际应用之间存在显著差距,尤其是在将伦理原则动态应用于现实场景时。 Conclusion: PrinciplismQA 提供了一个可扩展的框架,用于诊断特定的伦理弱点,为更加平衡和负责任的医疗人工智能铺平了道路。 Abstract: The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs' alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models' ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models' overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.[16] ATLANTIS at SemEval-2025 Task 3: Detecting Hallucinated Text Spans in Question Answering
Catherine Kobus,François Lancelot,Marion-Cécile Martin,Nawal Ould Amer
Main category: cs.CL
TL;DR: ATLANTIS团队在SemEval-2025任务3中的贡献:通过整合上下文和微调模型或提示工程来检测问答系统中的幻觉文本片段。
Details
Motivation: 大型语言模型(LLMs)在自然语言生成方面取得了显著进展,但仍然容易受到幻觉的影响,生成错误或误导性内容。 Method: 利用少量示例提示、令牌级分类或在合成数据上微调的LLM,探索了有无外部上下文的方法。 Result: 该方法在西班牙语中获得了顶级排名,在英语和德语中也取得了具有竞争力的成绩。 Conclusion: 整合相关上下文对于减少幻觉的重要性,并展示了微调模型和提示工程的潜力。 Abstract: This paper presents the contributions of the ATLANTIS team to SemEval-2025 Task 3, focusing on detecting hallucinated text spans in question answering systems. Large Language Models (LLMs) have significantly advanced Natural Language Generation (NLG) but remain susceptible to hallucinations, generating incorrect or misleading content. To address this, we explored methods both with and without external context, utilizing few-shot prompting with a LLM, token-level classification or LLM fine-tuned on synthetic data. Notably, our approaches achieved top rankings in Spanish and competitive placements in English and German. This work highlights the importance of integrating relevant context to mitigate hallucinations and demonstrate the potential of fine-tuned models and prompt engineering.[17] Resource-Limited Joint Multimodal Sentiment Reasoning and Classification via Chain-of-Thought Enhancement and Distillation
Haonan Shangguan,Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang,Ge Yu
Main category: cs.CL
TL;DR: 本文提出了一种适用于资源受限环境的多模态情感推理与分类模型MulCoT-RD,通过“教师-助手-学生”蒸馏范式实现高效推理与分类。
Details
Motivation: 当前方法依赖参数量大的多模态大语言模型进行情感分类,忽略了资源受限环境下的自主多模态情感推理生成。 Method: 提出了一种名为MulCoT-RD的多模态思维链推理蒸馏模型,采用“教师-助手-学生”的蒸馏范式进行训练。 Result: MulCoT-RD在四个数据集上均表现出色,仅使用3B参数即实现强大的性能与泛化能力。 Conclusion: MulCoT-RD有效实现了在资源受限环境下的多模态情感推理与分类任务,同时保持了高性能与可解释性。 Abstract: The surge in rich multimodal content on social media platforms has greatly advanced Multimodal Sentiment Analysis (MSA), with Large Language Models (LLMs) further accelerating progress in this field. Current approaches primarily leverage the knowledge and reasoning capabilities of parameter-heavy (Multimodal) LLMs for sentiment classification, overlooking autonomous multimodal sentiment reasoning generation in resource-constrained environments. Therefore, we focus on the Resource-Limited Joint Multimodal Sentiment Reasoning and Classification task, JMSRC, which simultaneously performs multimodal sentiment reasoning chain generation and sentiment classification only with a lightweight model. We propose a Multimodal Chain-of-Thought Reasoning Distillation model, MulCoT-RD, designed for JMSRC that employs a "Teacher-Assistant-Student" distillation paradigm to address deployment constraints in resource-limited environments. We first leverage a high-performance Multimodal Large Language Model (MLLM) to generate the initial reasoning dataset and train a medium-sized assistant model with a multi-task learning mechanism. A lightweight student model is jointly trained to perform efficient multimodal sentiment reasoning generation and classification. Extensive experiments on four datasets demonstrate that MulCoT-RD with only 3B parameters achieves strong performance on JMSRC, while exhibiting robust generalization and enhanced interpretability.[18] Pruning Large Language Models by Identifying and Preserving Functional Networks
Yiheng Liu,Junhao Ning,Sichen Xia,Xiaohui Gao,Ning Qiang,Bao Ge,Junwei Han,Xintao Hu
Main category: cs.CL
TL;DR: 本文提出了一种新的大语言模型结构化剪枝方法,通过识别和保留功能网络中的关键神经元,有效提升了剪枝效果。
Details
Motivation: 现有结构化剪枝方法忽略了人工神经元之间的交互与协作,破坏了模型宏观功能架构,导致剪枝性能下降。因此,需要一种能够保留模型功能结构的剪枝方法。 Method: 将大语言模型视为数字大脑,借鉴神经影像数据中识别功能脑网络的方法,分解模型并识别功能网络,通过保留这些功能网络中的关键神经元进行剪枝。 Result: 实验表明,该方法能够成功识别和定位大语言模型中的功能网络及关键神经元,从而实现高效的模型压缩。 Conclusion: 该研究提出了一种基于功能网络识别和保留的新型大语言模型结构化剪枝方法,实验结果表明该方法可以有效识别和保留关键神经元,实现高效的模型剪枝。 Abstract: Structured pruning is one of the representative techniques for compressing large language models (LLMs) to reduce GPU memory consumption and accelerate inference speed. It offers significant practical value in improving the efficiency of LLMs in real-world applications. Current structured pruning methods typically rely on assessment of the importance of the structure units and pruning the units with less importance. Most of them overlooks the interaction and collaboration among artificial neurons that are crucial for the functionalities of LLMs, leading to a disruption in the macro functional architecture of LLMs and consequently a pruning performance degradation. Inspired by the inherent similarities between artificial neural networks and functional neural networks in the human brain, we alleviate this challenge and propose to prune LLMs by identifying and preserving functional networks within LLMs in this study. To achieve this, we treat an LLM as a digital brain and decompose the LLM into functional networks, analogous to identifying functional brain networks in neuroimaging data. Afterwards, an LLM is pruned by preserving the key neurons within these functional networks. Experimental results demonstrate that the proposed method can successfully identify and locate functional networks and key neurons in LLMs, enabling efficient model pruning. Our code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.[19] CodeBoost: Boosting Code LLMs by Squeezing Knowledge from Code Snippets with RL
Sijie Wang,Quanjiang Guo,Kai Zhao,Yawei Zhang,Xin Li,Xiang Li,Siqi Li,Rui She,Shangshu Yu,Wee Peng Tay
Main category: cs.CL
TL;DR: CodeBoost是一种利用代码片段提升代码大型语言模型性能的后训练框架,无需人工标注指令,通过五个关键组件提高模型性能。
Details
Motivation: 收集高质量编码指令既费力又难以扩展,而代码片段却广泛可用,因此需要一种不依赖人工标注的后训练方法。 Method: CodeBoost引入了五个关键组件:最大团簇筛选、双向预测、错误感知预测、异构增强和异构奖励机制。 Result: 大量实验验证了CodeBoost在多个代码大型语言模型和基准测试中的有效性,表明其作为一个可扩展的训练管道的潜力。 Conclusion: CodeBoost是一个有效的代码大型语言模型的后训练框架,它仅使用代码片段而无需人工标注的指令,提升了模型性能。 Abstract: Code large language models (LLMs) have become indispensable tools for building efficient and automated coding pipelines. Existing models are typically post-trained using reinforcement learning (RL) from general-purpose LLMs using "human instruction-final answer" pairs, where the instructions are usually from manual annotations. However, collecting high-quality coding instructions is both labor-intensive and difficult to scale. On the other hand, code snippets are abundantly available from various sources. This imbalance presents a major bottleneck in instruction-based post-training. We propose CodeBoost, a post-training framework that enhances code LLMs purely from code snippets, without relying on human-annotated instructions. CodeBoost introduces the following key components: (1) maximum-clique curation, which selects a representative and diverse training corpus from code; (2) bi-directional prediction, which enables the model to learn from both forward and backward prediction objectives; (3) error-aware prediction, which incorporates learning signals from both correct and incorrect outputs; (4) heterogeneous augmentation, which diversifies the training distribution to enrich code semantics; and (5) heterogeneous rewarding, which guides model learning through multiple reward types including format correctness and execution feedback from both successes and failures. Extensive experiments across several code LLMs and benchmarks verify that CodeBoost consistently improves performance, demonstrating its effectiveness as a scalable and effective training pipeline.[20] ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs
Dongxu Zhang,Ning Yang,Jihua Zhu,Jinnan Yang,Miao Xin,Baoliang Tian
Main category: cs.CL
TL;DR: 本文挑战了关于链式推理错误影响的传统假设,提出了ASCoT方法来解决后期脆弱性问题,并通过实验验证了其在基准测试中的优越性能。
Details
Motivation: 论文的动机是挑战现有的“级联失效”假设,即早期错误对推理链的影响最大,而发现了一个反直觉的现象:“后期脆弱性”,即在CoT链的后期阶段引入的错误更可能导致最终答案的错误。 Method: 论文采用了系统性的错误注入实验,并提出了ASCoT方法,其中包括自适应验证管理器(AVM)和多视角自我修正引擎(MSCE),通过位置影响评分函数I(k)来识别高风险、后期阶段的步骤,并对这些关键步骤进行双路径修正。 Result: 论文结果显示,ASCoT在GSM8K和MATH等基准测试中表现出色,准确率超过了包括标准CoT在内的强大基线方法。 Conclusion: 论文得出结论,通过ASCoT方法,可以有效地解决LLM推理链中后期阶段的脆弱性问题,并强调了诊断特定失败模式的重要性,提倡从统一的验证策略转向适应性、漏洞感知的修正机制。 Abstract: Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs), yet the reliability of these reasoning chains remains a critical challenge. A widely held "cascading failure" hypothesis suggests that errors are most detrimental when they occur early in the reasoning process. This paper challenges that assumption through systematic error-injection experiments, revealing a counter-intuitive phenomenon we term "Late-Stage Fragility": errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning. To address this specific vulnerability, we introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method. ASCoT employs a modular pipeline in which an Adaptive Verification Manager (AVM) operates first, followed by the Multi-Perspective Self-Correction Engine (MSCE). The AVM leverages a Positional Impact Score function I(k) that assigns different weights based on the position within the reasoning chains, addressing the Late-Stage Fragility issue by identifying and prioritizing high-risk, late-stage steps. Once these critical steps are identified, the MSCE applies robust, dual-path correction specifically to the failure parts. Extensive experiments on benchmarks such as GSM8K and MATH demonstrate that ASCoT achieves outstanding accuracy, outperforming strong baselines, including standard CoT. Our work underscores the importance of diagnosing specific failure modes in LLM reasoning and advocates for a shift from uniform verification strategies to adaptive, vulnerability-aware correction mechanisms.[21] Decision-Making with Deliberation: Meta-reviewing as a Document-grounded Dialogue
Sukannya Purkayastha,Nils Dycke,Anne Lauscher,Iryna Gurevych
Main category: cs.CL
TL;DR: This paper explores the use of dialogue agents to assist in the meta-reviewing process, showing that synthetic data can be used to train effective agents that improve the efficiency of meta-reviewing.
Details
Motivation: Meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context, which can be supported by dialogue agents. Method: Synthetic data was generated using a self-refinement strategy with LLMs, and dialogue agents were trained and tested in real-world scenarios. Result: The synthetic data generation method produced high-quality data, and the trained dialogue agents outperformed off-the-shelf LLM-based assistants in real-world meta-reviewing scenarios. Conclusion: Meta-reviewing can be effectively supported by dialogue agents, which can be trained using synthetic data generated by LLMs. Abstract: Meta-reviewing is a pivotal stage in the peer-review process, serving as the final step in determining whether a paper is recommended for acceptance. Prior research on meta-reviewing has treated this as a summarization problem over review reports. However, complementary to this perspective, meta-reviewing is a decision-making process that requires weighing reviewer arguments and placing them within a broader context. Prior research has demonstrated that decision-makers can be effectively assisted in such scenarios via dialogue agents. In line with this framing, we explore the practical challenges for realizing dialog agents that can effectively assist meta-reviewers. Concretely, we first address the issue of data scarcity for training dialogue agents by generating synthetic data using Large Language Models (LLMs) based on a self-refinement strategy to improve the relevance of these dialogues to expert domains. Our experiments demonstrate that this method produces higher-quality synthetic data and can serve as a valuable resource towards training meta-reviewing assistants. Subsequently, we utilize this data to train dialogue agents tailored for meta-reviewing and find that these agents outperform \emph{off-the-shelf} LLM-based assistants for this task. Finally, we apply our agents in real-world meta-reviewing scenarios and confirm their effectiveness in enhancing the efficiency of meta-reviewing.\footnote{Code and Data: https://github.com/UKPLab/arxiv2025-meta-review-as-dialog[22] SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens
Nikita Dragunov,Temurbek Rahmatullaev,Elizaveta Goncharova,Andrey Kuznetsov,Anton Razzhigaev
Main category: cs.CL
TL;DR: SONAR-LLM is a decoder-only transformer that improves upon the Large Concept Model by using a hybrid objective, resulting in competitive generation quality and enhanced training efficiency.
Details
Motivation: The motivation is to retain the semantic abstraction of the Large Concept Model (LCM) while eliminating its diffusion sampler and restoring a likelihood-based training signal for more effective learning. Method: SONAR-LLM uses a decoder-only transformer that operates in the continuous SONAR embedding space and is supervised through token-level cross-entropy via a frozen SONAR decoder, combining semantic abstraction with likelihood-based training. Result: SONAR-LLM achieves competitive generation quality across various model sizes, supported by scaling trends, ablations, and benchmark results. Conclusion: SONAR-LLM attains competitive generation quality across model sizes from 39M to 1.3B parameters, and the training code along with pretrained checkpoints are released for reproducibility and future research. Abstract: The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.[23] Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression
Jiameng Huang,Baijiong Lin,Guhao Feng,Jierun Chen,Di He,Lu Hou
Main category: cs.CL
TL;DR: 本文提出了一种新的方法Certainty-Guided Reflection Suppression (CGRS),用于减轻大型推理语言模型的过度思考问题,同时保持推理准确度。
Details
Motivation: 现有大型推理语言模型的反思行为可能导致过度思考问题,从而增加令牌使用量,提高推理成本并降低实用性。 Method: 通过动态抑制模型在具有高置信度时的反思触发器生成,提出Certainty-Guided Reflection Suppression (CGRS)方法。 Result: 实验结果表明,CGRS平均减少了18.5%到41.9%的令牌使用量,同时保持了准确性,并在长度减少和性能之间实现了最佳平衡。 Conclusion: CGRS是一种新颖的方法,可以在保持推理准确度的同时减轻LRLMs的过度思考问题。 Abstract: Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., "Wait" and "Alternatively") to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model's generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS's effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS's practical value for efficient reasoning.[24] Evaluation of a Sign Language Avatar on Comprehensibility, User Experience \& Acceptability
Fenya Wasserroth,Eleftherios Avramidis,Vera Czehmann,Tanja Kojic,Fabrizio Nunnari,Sebastian Möller
Main category: cs.CL
TL;DR: Adjustable sign language avatars on Hololens 2 showed no significant UX or comprehensibility improvements. Personalization alone is insufficient; comprehensibility by default is essential.
Details
Motivation: To understand the impact of adjustable features on sign language avatars and identify factors influencing user experience and comprehensibility. Method: Expert German Sign Language users interacted with adjustable and non-adjustable avatars on a Microsoft Hololens 2 device. The study analyzed comprehensibility, user experience, and acceptability. Result: No significant improvements in UX or comprehensibility were observed despite user preference for adjustability. Hedonic quality was higher than pragmatic quality, and stress levels were elevated for the adjustable avatar. Conclusion: Personalization alone is insufficient for SL avatars; comprehensibility by default is crucial. Key recommendations include enhancing facial animations, improving interaction interfaces, and applying participatory design. Abstract: This paper presents an investigation into the impact of adding adjustment features to an existing sign language (SL) avatar on a Microsoft Hololens 2 device. Through a detailed analysis of interactions of expert German Sign Language (DGS) users with both adjustable and non-adjustable avatars in a specific use case, this study identifies the key factors influencing the comprehensibility, the user experience (UX), and the acceptability of such a system. Despite user preference for adjustable settings, no significant improvements in UX or comprehensibility were observed, which remained at low levels, amid missing SL elements (mouthings and facial expressions) and implementation issues (indistinct hand shapes, lack of feedback and menu positioning). Hedonic quality was rated higher than pragmatic quality, indicating that users found the system more emotionally or aesthetically pleasing than functionally useful. Stress levels were higher for the adjustable avatar, reflecting lower performance, greater effort and more frustration. Additionally, concerns were raised about whether the Hololens adjustment gestures are intuitive and easy to familiarise oneself with. While acceptability of the concept of adjustability was generally positive, it was strongly dependent on usability and animation quality. This study highlights that personalisation alone is insufficient, and that SL avatars must be comprehensible by default. Key recommendations include enhancing mouthing and facial animation, improving interaction interfaces, and applying participatory design.[25] Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025
Samy Ateia,Udo Kruschwitz
Main category: cs.CL
TL;DR: This paper explores the effectiveness of iterative self-correction in Retrieval Augmented Generation systems for biomedical research, comparing reasoning and non-reasoning LLMs.
Details
Motivation: Applying autonomous search processes to domain-specific professional search presents challenges, particularly in maintaining user involvement and alignment with expert information needs. Method: The study used a self-feedback mechanism where LLMs generated, evaluated, and refined their outputs for query expansion and multiple answer types in the context of the BioASQ CLEF 2025 challenge. Result: Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. Conclusion: The work offers insights into LLM self-correction and suggests future research comparing the effectiveness of LLM-generated feedback with direct human expert input in search systems. Abstract: Agentic Retrieval Augmented Generation (RAG) and 'deep research' systems aim to enable autonomous search processes where Large Language Models (LLMs) iteratively refine outputs. However, applying these systems to domain-specific professional search, such as biomedical research, presents challenges, as automated systems may reduce user involvement and misalign with expert information needs. Professional search tasks often demand high levels of user expertise and transparency. The BioASQ CLEF 2025 challenge, using expert-formulated questions, can serve as a platform to study these issues. We explored the performance of current reasoning and nonreasoning LLMs like Gemini-Flash 2.0, o3-mini, o4-mini and DeepSeek-R1. A key aspect of our methodology was a self-feedback mechanism where LLMs generated, evaluated, and then refined their outputs for query expansion and for multiple answer types (yes/no, factoid, list, ideal). We investigated whether this iterative self-correction improves performance and if reasoning models are more capable of generating useful feedback. Preliminary results indicate varied performance for the self-feedback strategy across models and tasks. This work offers insights into LLM self-correction and informs future work on comparing the effectiveness of LLM-generated feedback with direct human expert input in these search systems.[26] The TUB Sign Language Corpus Collection
Eleftherios Avramidis,Vera Czehmann,Fabian Deckert,Lorenz Hufe,Aljoscha Lipski,Yuni Amaloa Quintero Villalobos,Tae Kwon Rhee,Mengqian Shi,Lennart Stölting,Fabrizio Nunnari,Sebastian Möller
Main category: cs.CL
TL;DR: 论文介绍了一个包含12种手语的大型平行视频语料库,覆盖超过1,300小时的视频和字幕,重点是拉丁美洲手语和德国手语。
Details
Motivation: 创建大规模、一致的平行手语语料库,以支持手语研究和技术开发。 Method: 通过从多个在线来源收集和处理多种手语的视频,包括数据收集、通知内容创作者、获取使用许可、抓取和剪裁视频。 Result: 建立了包含超过1,300小时视频的语料库,其中包含4,381个视频文件和1,3~M字幕,涵盖12种手语,特别是8种拉丁美洲手语。 Conclusion: 该论文成功收集并整理了包含12种手语的平行语料库,并提供了相关统计数据和数据收集方法的概述。 Abstract: We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.[27] MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints
Zhong Ken Hew,Jia Xin Low,Sze Jue Yang,Chee Seng chan
Main category: cs.CL
TL;DR: This paper introduces MyCulture, a new benchmark for evaluating LLMs on Malaysian cultural contexts, revealing disparities in cultural comprehension and advocating for more inclusive and fair evaluation methods.
Details
Motivation: The motivation stems from the cultural biases in LLMs due to training data dominated by high-resource languages, which challenges the accurate representation and evaluation of diverse cultural contexts, especially in low-resource language settings. Method: The researchers introduced a benchmark called MyCulture, which evaluates LLMs on Malaysian culture using a novel open-ended multiple-choice question format without predefined options. They provided a theoretical justification for this format's effectiveness and analyzed structural and language biases. Result: The evaluation across various LLMs revealed significant disparities in cultural comprehension. The open-ended structure proved effective in improving fairness and discriminative power, highlighting the need for more inclusive benchmarks. Conclusion: The study concludes that there are significant disparities in cultural comprehension among various LLMs, emphasizing the necessity for benchmarks that are culturally grounded and linguistically inclusive in LLM development and assessment. Abstract: Large Language Models (LLMs) often exhibit cultural biases due to training data dominated by high-resource languages like English and Chinese. This poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. To address this, we introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on Malaysian culture across six pillars: arts, attire, customs, entertainment, food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options, thereby reducing guessing and mitigating format bias. We provide a theoretical justification for the effectiveness of this open-ended structure in improving both fairness and discriminative power. Furthermore, we analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations. Our evaluation across a range of regional and international LLMs reveals significant disparities in cultural comprehension, highlighting the urgent need for culturally grounded and linguistically inclusive benchmarks in the development and assessment of LLMs.[28] LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
Ming Zhang,Yujiong Shen,Jingyi Deng,Yuhui Wang,Yue Zhang,Junzhe Wang,Shichun Liu,Shihan Dou,Huayu Sha,Qiyuan Peng,Changhao Jiang,Jingqi Tong,Yilong Wu,Zhihao Zhang,Mingqi Wu,Zhiheng Xi,Mingxu Chai,Tao Liang,Zhihui Fei,Zhen Wang,Mingyang Wan,Guojun Ma,Tao Gui,Qi Zhang,Xuanjing Huang
Main category: cs.CL
TL;DR: LLMEval-3 introduces a dynamic evaluation framework for Large Language Models, addressing data contamination and overfitting issues with static benchmarks, ensuring more accurate and trustworthy assessment of model capabilities.
Details
Motivation: Existing evaluation of LLMs on static benchmarks is vulnerable to data contamination and leaderboard overfitting, which obscures true model capabilities. Method: LLMEval-3 dynamically samples unseen test sets from a proprietary bank of 220k graduate-level questions for each evaluation run. It uses an automated pipeline with contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts. Result: An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency. Conclusion: LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards. Abstract: Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-3, a framework for dynamic evaluation of LLMs. LLMEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run. Its automated pipeline ensures integrity via contamination-resistant data curation, a novel anti-cheating architecture, and a calibrated LLM-as-a-judge process achieving 90% agreement with human experts, complemented by a relative ranking system for fair comparison. An 20-month longitudinal study of nearly 50 leading models reveals a performance ceiling on knowledge memorization and exposes data contamination vulnerabilities undetectable by static benchmarks. The framework demonstrates exceptional robustness in ranking stability and consistency, providing strong empirical validation for the dynamic evaluation paradigm. LLMEval-3 offers a robust and credible methodology for assessing the true capabilities of LLMs beyond leaderboard scores, promoting the development of more trustworthy evaluation standards.[29] TASE: Token Awareness and Structured Evaluation for Multilingual Language Models
Chenzhuo Zhao,Xinda Wang,Yue Huang,Junting Lu,Ziqian Liu
Main category: cs.CL
TL;DR: 本文介绍TASE基准测试,用于评估大型语言模型在跨语言标记级理解和结构推理方面的能力,并指出当前模型的不足之处。
Details
Motivation: 大型语言模型(LLMs)在高级语义任务上表现出色,但在细粒度、标记级理解和结构推理方面存在困难,而这些能力对于需要精确和控制的应用至关重要。 Method: 引入了一个名为TASE的综合基准测试,涵盖10项任务,使用35,927个实例评估集和可扩展的合成数据生成流水线进行训练。 Result: 评估了超过30个领先的商业和开源LLMs,包括O3、Claude 4、Gemini 2.5 Pro和DeepSeek-R1,并训练了一个使用GRPO训练方法的定制Qwen2.5-14B模型,结果显示人类表现显著优于当前LLMs。 Conclusion: TASE揭示了当前LLMs在跨语言和低级语言理解方面的局限性,并为未来改进提供了新的诊断视角。 Abstract: While large language models (LLMs) have demonstrated remarkable performance on high-level semantic tasks, they often struggle with fine-grained, token-level understanding and structural reasoning--capabilities that are essential for applications requiring precision and control. We introduce TASE, a comprehensive benchmark designed to evaluate LLMs' ability to perceive and reason about token-level information across languages. TASE covers 10 tasks under two core categories: token awareness and structural understanding, spanning Chinese, English, and Korean, with a 35,927-instance evaluation set and a scalable synthetic data generation pipeline for training. Tasks include character counting, token alignment, syntactic structure parsing, and length constraint satisfaction. We evaluate over 30 leading commercial and open-source LLMs, including O3, Claude 4, Gemini 2.5 Pro, and DeepSeek-R1, and train a custom Qwen2.5-14B model using the GRPO training method. Results show that human performance significantly outpaces current LLMs, revealing persistent weaknesses in token-level reasoning. TASE sheds light on these limitations and provides a new diagnostic lens for future improvements in low-level language understanding and cross-lingual generalization. Our code and dataset are publicly available at https://github.com/cyzcz/Tase .[30] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
Li-Chun Lu,Miri Liu,Pin-Chun Lu,Yufei Tian,Shao-Hua Sun,Nanyun Peng
Main category: cs.CL
TL;DR: The paper examines and compares creativity measures across different domains and finds them limited and in need of more robust evaluation frameworks that align with human judgments of creativity.
Details
Motivation: To understand the limitations of current creativity measures and the need for better evaluation frameworks. Method: The paper systematically examines, analyzes, and compares creativity measures across diverse creative domains. Result: The analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. Conclusion: The paper concludes that current creativity measures have limitations and there is a need for more robust evaluation frameworks that align with human judgments of creativity. Abstract: We systematically examine, analyze, and compare representative creativity measures--creativity index, perplexity, syntactic templates, and LLM-as-a-Judge--across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index's focus on lexical diversity, perplexity's sensitivity to model confidence, and syntactic templates' inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.[31] LAG: Logic-Augmented Generation from a Cartesian Perspective
Yilin Xiao,Chuang Zhou,Qinggang Zhang,Su Dong,Shengyuan Chen,Xiao Huang
Main category: cs.CL
TL;DR: This paper introduces Logic-Augmented Generation (LAG), a new approach that improves large language model reasoning by breaking down complex questions into logical steps, reducing hallucinations, and aligning with human problem-solving.
Details
Motivation: The motivation is to overcome the limitations of large language models (LLMs) in knowledge-intensive tasks, particularly their tendency to generate hallucinations, and to improve upon retrieval-augmented generation (RAG) by introducing structured logical organization and dependency-aware reasoning. Method: The paper introduces Logic-Augmented Generation (LAG), which decomposes complex questions into atomic sub-questions ordered by logical dependencies. It resolves these sequentially, using prior answers to guide context retrieval, incorporates a logical termination mechanism to prevent error propagation, and synthesizes sub-resolutions for verified responses. Result: Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition. Conclusion: The paper concludes that LAG offers a principled alternative to existing RAG systems by enhancing reasoning robustness, reducing hallucination, and aligning LLM problem-solving with human cognition. Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet exhibit critical limitations in knowledge-intensive tasks, often generating hallucinations when faced with questions requiring specialized expertise. While retrieval-augmented generation (RAG) mitigates this by integrating external knowledge, it struggles with complex reasoning scenarios due to its reliance on direct semantic retrieval and lack of structured logical organization. Inspired by Cartesian principles from \textit{Discours de la m\'ethode}, this paper introduces Logic-Augmented Generation (LAG), a novel paradigm that reframes knowledge augmentation through systematic question decomposition and dependency-aware reasoning. Specifically, LAG first decomposes complex questions into atomic sub-questions ordered by logical dependencies. It then resolves these sequentially, using prior answers to guide context retrieval for subsequent sub-questions, ensuring stepwise grounding in logical chain. To prevent error propagation, LAG incorporates a logical termination mechanism that halts inference upon encountering unanswerable sub-questions and reduces wasted computation on excessive reasoning. Finally, it synthesizes all sub-resolutions to generate verified responses. Experiments on four benchmark datasets demonstrate that LAG significantly enhances reasoning robustness, reduces hallucination, and aligns LLM problem-solving with human cognition, offering a principled alternative to existing RAG systems.[32] The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities
Harsh Nishant Lalai,Raj Sanjay Shah,Jiaxin Pei,Sashank Varma,Yi-Chia Wang,Ali Emami
Main category: cs.CL
TL;DR: 该论文研究了LLMs在20个问题游戏中表现出的隐性偏见,发现其在推断全球北部和西方实体方面优于全球南部和东方的实体,并发布了一个新的数据集Geo20Q+。
Details
Motivation: 论文的动机是研究LLMs中根植于预训练数据中的隐性偏见,而不是通过可能触发防护措施的人工提问方式直接探测LLMs。 Method: 论文提出了一种新的评估框架,使用20个问题游戏作为测试平台,系统地评估了实体推断中的地理性能差异。研究使用了一个新的数据集Geo20Q+,并在两种游戏配置和七种语言中测试了流行的LLMs。 Result: 研究结果显示了地理差异,LLMs在推断全球北部和西方实体方面远比推断全球南部和东方的实体成功。虽然维基百科浏览量和预训练语料库频率与性能有轻微相关性,但它们未能完全解释这些差异。 Conclusion: 该论文的结论是,通过创造性、自由形式的评估框架,可以揭示在标准提示设置中隐藏的LLMs中的微妙偏见。研究发现,LLMs在推断来自全球北部和西方实体方面的表现优于全球南部和东方的实体,而游戏使用的语言对性能差距影响甚小。 Abstract: Large Language Models (LLMs) have been extensively tuned to mitigate explicit biases, yet they often exhibit subtle implicit biases rooted in their pre-training data. Rather than directly probing LLMs with human-crafted questions that may trigger guardrails, we propose studying how models behave when they proactively ask questions themselves. The 20 Questions game, a multi-turn deduction task, serves as an ideal testbed for this purpose. We systematically evaluate geographic performance disparities in entity deduction using a new dataset, Geo20Q+, consisting of both notable people and culturally significant objects (e.g., foods, landmarks, animals) from diverse regions. We test popular LLMs across two gameplay configurations (canonical 20-question and unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs are substantially more successful at deducing entities from the Global North than the Global South, and the Global West than the Global East. While Wikipedia pageviews and pre-training corpus frequency correlate mildly with performance, they fail to fully explain these disparities. Notably, the language in which the game is played has minimal impact on performance gaps. These findings demonstrate the value of creative, free-form evaluation frameworks for uncovering subtle biases in LLMs that remain hidden in standard prompting setups. By analyzing how models initiate and pursue reasoning goals over multiple turns, we find geographic and cultural disparities embedded in their reasoning processes. We release the dataset (Geo20Q+) and code at https://sites.google.com/view/llmbias20q/home.[33] CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation
Santosh T. Y. S. S,Youssef Tarek Elkhayat,Oana Ichim,Pranav Shetty,Dongsheng Wang,Zhiqiang Ma,Armineh Nourbakhsh,Xiaomo Liu
Main category: cs.CL
TL;DR: 本文提出 CoCoLex 解码策略,通过动态结合模型生成与上下文复制,提高法律文本生成的忠实度与质量。
Details
Motivation: 法律领域需要处理长而复杂的上下文,但现有的大型语言模型容易生成不忠实或虚构的输出,而传统的检索增强生成方法无法确保上下文的有效整合。 Method: 提出了一种基于置信度引导的复制解码策略(CoCoLex),通过将模型生成的词汇分布与基于上下文复制的分布进行动态插值,从而提高生成文本的忠实度。 Result: 在五个法律基准测试中,CoCoLex 表现出优于现有上下文感知解码方法的效果,特别是在长文本生成任务中。 Conclusion: CoCoLex 是一种有效的解码策略,能提高法律文本生成的准确性和可靠性,特别是在长文本生成任务中优于现有方法。 Abstract: Due to their ability to process long and complex contexts, LLMs can offer key benefits to the Legal domain, but their adoption has been hindered by their tendency to generate unfaithful, ungrounded, or hallucinatory outputs. While Retrieval-Augmented Generation offers a promising solution by grounding generations in external knowledge, it offers no guarantee that the provided context will be effectively integrated. To address this, context-aware decoding strategies have been proposed to amplify the influence of relevant context, but they usually do not explicitly enforce faithfulness to the context. In this work, we introduce Confidence-guided Copy-based Decoding for Legal Text Generation (CoCoLex)-a decoding strategy that dynamically interpolates the model produced vocabulary distribution with a distribution derived based on copying from the context. CoCoLex encourages direct copying based on the model's confidence, ensuring greater fidelity to the source. Experimental results on five legal benchmarks demonstrate that CoCoLex outperforms existing context-aware decoding methods, particularly in long-form generation tasks.[34] Conformal Sets in Multiple-Choice Question Answering under Black-Box Settings with Provable Coverage Guarantees
Guang Yang,Xinyang Liu
Main category: cs.CL
TL;DR: This paper proposes a frequency-based uncertainty quantification method for LLMs in multiple-choice question answering, enhancing reliability and trustworthiness in high-risk applications by leveraging conformal prediction.
Details
Motivation: The motivation is to address the unreliability of LLMs, such as hallucinations and overconfidence, which limits their use in high-risk domains, by providing a reliable uncertainty quantification method. Method: The authors proposed a frequency-based uncertainty quantification method using conformal prediction, involving multiple independent samplings of the model's output distribution to calculate predictive entropy. Result: The experiments showed that the frequency-based predictive entropy outperforms logit-based predictive entropy in distinguishing correct and incorrect predictions, effectively controlling the miscoverage rate under specified risk levels. Conclusion: This paper concludes that frequency-based predictive entropy is a viable method for uncertainty quantification in black-box settings for LLMs, offering reliable coverage guarantees and improving trustworthiness in practical applications. Abstract: Large Language Models (LLMs) have shown remarkable progress in multiple-choice question answering (MCQA), but their inherent unreliability, such as hallucination and overconfidence, limits their application in high-risk domains. To address this, we propose a frequency-based uncertainty quantification method under black-box settings, leveraging conformal prediction (CP) to ensure provable coverage guarantees. Our approach involves multiple independent samplings of the model's output distribution for each input, with the most frequent sample serving as a reference to calculate predictive entropy (PE). Experimental evaluations across six LLMs and four datasets (MedMCQA, MedQA, MMLU, MMLU-Pro) demonstrate that frequency-based PE outperforms logit-based PE in distinguishing between correct and incorrect predictions, as measured by AUROC. Furthermore, the method effectively controls the empirical miscoverage rate under user-specified risk levels, validating that sampling frequency can serve as a viable substitute for logit-based probabilities in black-box scenarios. This work provides a distribution-free model-agnostic framework for reliable uncertainty quantification in MCQA with guaranteed coverage, enhancing the trustworthiness of LLMs in practical applications.[35] Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs
Franziska Weeber,Tanise Ceron,Sebastian Padó
Main category: cs.CL
TL;DR: 研究发现,多语言大语言模型在西方语言环境中显示出政治观点的跨语言传递,对齐后观点普遍转变,表明实现多语言文化与政治对齐的挑战。
Details
Motivation: 跨文化背景下公众舆论调查显示出政治观点的差异,但这些差异是否在多语言大语言模型中转化为语言间的差异尚不清楚。 Method: 通过使用投票建议应用中的政治陈述来评估多语言大语言模型在不同语言中的观点,并在对齐前后进行评估。 Result: 未经对齐的模型显示很少有显著的跨语言差异,而政治对齐几乎在所有五种语言中均产生了观点的转变。 Conclusion: 政治观点在语言之间传递,表明在实现多语言大语言模型的社会语言学、文化和政治对齐方面存在挑战。 Abstract: Public opinion surveys show cross-cultural differences in political opinions between socio-cultural contexts. However, there is no clear evidence whether these differences translate to cross-lingual differences in multilingual large language models (MLLMs). We analyze whether opinions transfer between languages or whether there are separate opinions for each language in MLLMs of various sizes across five Western languages. We evaluate MLLMs' opinions by prompting them to report their (dis)agreement with political statements from voting advice applications. To better understand the interaction between languages in the models, we evaluate them both before and after aligning them with more left or right views using direct preference optimization and English alignment data only. Our findings reveal that unaligned models show only very few significant cross-lingual differences in the political opinions they reflect. The political alignment shifts opinions almost uniformly across all five languages. We conclude that in Western language contexts, political opinions transfer between languages, demonstrating the challenges in achieving explicit socio-linguistic, cultural, and political alignment of MLLMs.[36] MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
Shaoxiong Zhan,Yanlin Lai,Ziyu Lu,Dahua Lin,Ziqing Yang,Fei Tang
Main category: cs.CL
TL;DR: MathSmith is a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning by creating new problems from scratch using PlanetMath, incorporating reinforcement learning, and measuring cognitive complexity through reasoning traces, showing consistent performance improvements over existing methods.
Details
Motivation: The advancement of large language models in mathematical reasoning is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods are constrained by diversity and scalability issues. Method: MathSmith synthesizes mathematical problems by constructing concept-explanation pairs from PlanetMath, using nine predefined strategies as soft constraints and reinforcement learning to optimize structural validity, reasoning complexity, and answer consistency. Cognitive complexity is measured by the length of reasoning traces under autoregressive prompting. Result: Experiments across five benchmarks (GSM8K, MATH-500, AIME2024, AIME2025, OlympiadBench) show that MathSmith consistently outperforms existing baselines under both short and long chain-of-thought settings. Conclusion: MathSmith demonstrates strong scalability, generalization, and transferability, highlighting the potential of high-difficulty synthetic data in advancing LLM reasoning capabilities. Abstract: Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming human-written templates, limiting both diversity and scalability. We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. Rather than modifying existing problems, MathSmith constructs new ones from scratch by randomly sampling concept-explanation pairs from PlanetMath, ensuring data independence and avoiding contamination. To increase difficulty, we design nine predefined strategies as soft constraints during rationales. We further adopts reinforcement learning to jointly optimize structural validity, reasoning complexity, and answer consistency. The length of the reasoning trace generated under autoregressive prompting is used to reflect cognitive complexity, encouraging the creation of more demanding problems aligned with long-chain-of-thought reasoning. Experiments across five benchmarks, categorized as easy & medium (GSM8K, MATH-500) and hard (AIME2024, AIME2025, OlympiadBench), show that MathSmith consistently outperforms existing baselines under both short and long CoT settings. Additionally, a weakness-focused variant generation module enables targeted improvement on specific concepts. Overall, MathSmith exhibits strong scalability, generalization, and transferability, highlighting the promise of high-difficulty synthetic data in advancing LLM reasoning capabilities.[37] Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models
Haitao Hong,Yuchen Yan,Xingyu Wu,Guiyang Hou,Wenqi Zhang,Weiming Lu,Yongliang Shen,Jun Xiao
Main category: cs.CL
TL;DR: 本文提出了一种名为Cooper的强化学习框架,通过同时优化策略模型和动态更新奖励模型,有效缓解了奖励黑客问题并提高了强化学习性能。
Details
Motivation: 当前的两种主流奖励范式(基于模型的奖励和基于规则的奖励)都存在局限性:基于规则的奖励缺乏鲁棒性,而基于模型的奖励容易受到奖励黑客的影响。 Method: 提出了一种名为Cooper的强化学习框架,该框架同时优化策略模型和奖励模型,并引入了一种混合注释策略来高效准确地生成奖励模型的训练数据。此外,还提出了一种基于参考答案的奖励建模范式,并基于此设计训练了一个名为VerifyRM的奖励模型。 Result: Cooper框架不仅缓解了奖励黑客的问题,还提高了端到端的强化学习性能,例如在Qwen2.5-1.5B-Instruct上平均准确率提高了0.54%。VerifyRM在VerifyBench上的准确率也优于同规模的其他模型。 Conclusion: 动态更新奖励模型是防止奖励黑客行为的有效方法,为更好地将奖励模型集成到强化学习中提供了参考。 Abstract: Large language models (LLMs) have demonstrated remarkable performance in reasoning tasks, where reinforcement learning (RL) serves as a key algorithm for enhancing their reasoning capabilities. Currently, there are two mainstream reward paradigms: model-based rewards and rule-based rewards. However, both approaches suffer from limitations: rule-based rewards lack robustness, while model-based rewards are vulnerable to reward hacking. To address these issues, we propose Cooper(Co-optimizing Policy Model and Reward Model), a RL framework that jointly optimizes both the policy model and the reward model. Cooper leverages the high precision of rule-based rewards when identifying correct responses, and dynamically constructs and selects positive-negative sample pairs for continued training the reward model. This design enhances robustness and mitigates the risk of reward hacking. To further support Cooper, we introduce a hybrid annotation strategy that efficiently and accurately generates training data for the reward model. We also propose a reference-based reward modeling paradigm, where the reward model takes a reference answer as input. Based on this design, we train a reward model named VerifyRM, which achieves higher accuracy on VerifyBench compared to other models of the same size. We conduct reinforcement learning using both VerifyRM and Cooper. Our experiments show that Cooper not only alleviates reward hacking but also improves end-to-end RL performance, for instance, achieving a 0.54% gain in average accuracy on Qwen2.5-1.5B-Instruct. Our findings demonstrate that dynamically updating reward model is an effective way to combat reward hacking, providing a reference for better integrating reward models into RL.[38] OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks
Zixuan Wang,Dingming Li,Hongxing Li,Shuo Chen,Yuchen Yan,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang
Main category: cs.CL
TL;DR: 本研究提出了OmniEAR框架,用于评估语言模型在具身任务中的推理能力,揭示了当前模型在动态能力获取和多代理协调方面的重大局限性。
Details
Motivation: 大语言模型在抽象推理方面表现出色,但其在具身代理推理方面的能力仍不清楚。 Method: 提出OmniEAR框架,通过文本环境表示建模连续物理属性和复杂空间关系,涵盖1500个家庭和工业领域场景。 Result: 实验发现,在需要推理任务约束时模型性能显著下降,完全环境信息反而会降低协作性能,微调对单智能体任务有显著改进但对多智能体任务提升有限。 Conclusion: OmniEAR作为评估具身智能系统的新基准,揭示了当前模型在具身推理方面存在根本性的挑战和架构限制。 Abstract: Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains. Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations. These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems. Our code and data are included in the supplementary materials and will be open-sourced upon acceptance.[39] Learning to Reason for Factuality
Xilun Chen,Ilia Kulikov,Vincent-Pierre Berges,Barlas Oğuz,Rulin Shao,Gargi Ghosh,Jason Weston,Wen-tau Yih
Main category: cs.CL
TL;DR: 本文提出一种新的奖励函数用于R-LLMs,在线RL学习高质量事实推理,有效降低幻觉率并提高回答细节水平。
Details
Motivation: R-LLMs在长格式事实性设置中面临缺乏可靠验证方法的挑战,直接利用现有方法作为奖励会导致奖励黑客问题。 Method: 提出了一种新的奖励函数,同时考虑事实准确性、回答细节水平和答案相关性,并应用在线RL学习高质量事实推理。 Result: 在六个长格式事实性基准测试中,平均幻觉率降低了23.1个百分点,回答细节水平提高了23%,整体响应有用性没有下降。 Conclusion: R-LLMs在复杂推理任务中表现出色,但在长格式事实性基准测试中容易产生幻觉,提出的新奖励函数能有效降低幻觉率并提高回答的细节水平。 Abstract: Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.[40] How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
Brandon Jaipersaud,David Krueger,Ekdeep Singh Lubana
Main category: cs.CL
TL;DR: 本文通过训练线性探针研究大型语言模型在多轮对话中的说服动态,发现其效果优于提示方法,并可能应用于复杂行为研究。
Details
Motivation: 大型语言模型展现出了说服人类的能力,但我们对这一动态的理解有限,因此使用线性探针来研究这一问题。 Method: 利用认知科学的见解,训练针对说服成功、说服对象个性和说服策略的线性探针。 Result: 尽管线性探针简单,但它们能够在样本和数据集层面捕捉说服的多个方面,且在某些情况下优于基于提示的方法。 Conclusion: 线性探针可以有效研究多轮对话中的说服动态,并可能用于研究其他复杂行为,如欺骗和操控。 Abstract: Large Language Models (LLMs) have started to demonstrate the ability to persuade humans, yet our understanding of how this dynamic transpires is limited. Recent work has used linear probes, lightweight tools for analyzing model representations, to study various LLM skills such as the ability to model user sentiment and political perspective. Motivated by this, we apply probes to study persuasion dynamics in natural, multi-turn conversations. We leverage insights from cognitive science to train probes on distinct aspects of persuasion: persuasion success, persuadee personality, and persuasion strategy. Despite their simplicity, we show that they capture various aspects of persuasion at both the sample and dataset levels. For instance, probes can identify the point in a conversation where the persuadee was persuaded or where persuasive success generally occurs across the entire dataset. We also show that in addition to being faster than expensive prompting-based approaches, probes can do just as well and even outperform prompting in some settings, such as when uncovering persuasion strategy. This suggests probes as a plausible avenue for studying other complex behaviours such as deception and manipulation, especially in multi-turn settings and large-scale dataset analysis where prompting-based methods would be computationally inefficient.[41] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages
Mehrdad Zakershahrak,Samira Ghodratnama
Main category: cs.CL
TL;DR: H-NET++ 是一种无需分词的层次动态分块模型,解决了形态丰富语言中字节级语言模型的计算挑战,并在波斯语任务上表现优异。
Details
Motivation: 字节级语言模型避免了脆弱的分词器问题,但在处理形态丰富语言时面临计算挑战,因为这些语言的单词通常包含大量字节。 Method: H-NET++ 引入了一种分层的动态分块模型,结合了轻量级Transformer上下文混合器、文档级一致性的双层潜在超先验、对正字法特征(如波斯语中的ZWNJ)的专门处理,以及基于课程的序列长度训练策略。 Result: 在14亿token的波斯语语料库上,H-NET++ 取得了SOTA结果:与基于BPE的GPT-2-fa相比,BPB减少了0.159(压缩率提高了12%),ParsGLUE得分提高了5.4个百分点,对ZWNJ损坏的鲁棒性提高了53%,在黄金形态边界上的F1得分为73.8%。 Conclusion: H-NET++ 提出了一种无需分词的层次动态分块解决方案,有效应对了在形态丰富语言(如波斯语)中字节级语言模型的计算挑战。 Abstract: Byte-level language models eliminate fragile tokenizers but face computational challenges in morphologically-rich languages (MRLs), where words span many bytes. We propose H-NET++, a hierarchical dynamic-chunking model that learns linguistically-informed segmentation through end-to-end training. Key innovations include: (1) a lightweight Transformer context-mixer (1.9M parameters) for cross-chunk attention, (2) a two-level latent hyper-prior for document-level consistency, (3) specialized handling of orthographic artifacts (e.g. Persian ZWNJ), and (4) curriculum-based training with staged sequence lengths. On a 1.4B-token Persian corpus, H-NET++ achieves state-of-the-art results: 0.159 BPB reduction versus BPE-based GPT-2-fa (12% better compression), 5.4pp gain on ParsGLUE, 53% improved robustness to ZWNJ corruption, and 73.8% F1 on gold morphological boundaries. Our learned chunks align with Persian morphology without explicit supervision, demonstrating that hierarchical dynamic chunking provides an effective tokenizer-free solution for MRLs while maintaining computational efficiency.cs.CV [Back]
[42] RetinexDual: Retinex-based Dual Nature Approach for Generalized Ultra-High-Definition Image Restoration
Mohab Kishawy,Ali Abdellatif Hussein,Jun Chen
Main category: cs.CV
TL;DR: RetinexDual improves Ultra-High-Definition Image Restoration by combining spatial and frequency domain techniques, outperforming existing methods in multiple tasks.
Details
Motivation: Traditional methods for UHD IR have drawbacks like information loss and inefficiency in handling spatial artifacts, necessitating a more effective approach. Method: RetinexDual uses two sub-networks: SAMBA for reflectance correction and FIA for illumination correction, combining spatial and frequency domain techniques. Result: RetinexDual outperforms recent methods in UHD IR tasks like deraining, deblurring, dehazing, and LLIE, as shown by qualitative and quantitative evaluations. Conclusion: RetinexDual is an effective framework for UHD IR tasks, overcoming limitations of traditional methods by combining spatial and frequency domain approaches. Abstract: Advancements in image sensing have elevated the importance of Ultra-High-Definition Image Restoration (UHD IR). Traditional methods, such as extreme downsampling or transformation from the spatial to the frequency domain, encounter significant drawbacks: downsampling induces irreversible information loss in UHD images, while our frequency analysis reveals that pure frequency-domain approaches are ineffective for spatially confined image artifacts, primarily due to the loss of degradation locality. To overcome these limitations, we present RetinexDual, a novel Retinex theory-based framework designed for generalized UHD IR tasks. RetinexDual leverages two complementary sub-networks: the Scale-Attentive maMBA (SAMBA) and the Frequency Illumination Adaptor (FIA). SAMBA, responsible for correcting the reflectance component, utilizes a coarse-to-fine mechanism to overcome the causal modeling of mamba, which effectively reduces artifacts and restores intricate details. On the other hand, FIA ensures precise correction of color and illumination distortions by operating in the frequency domain and leveraging the global context provided by it. Evaluating RetinexDual on four UHD IR tasks, namely deraining, deblurring, dehazing, and Low-Light Image Enhancement (LLIE), shows that it outperforms recent methods qualitatively and quantitatively. Ablation studies demonstrate the importance of employing distinct designs for each branch in RetinexDual, as well as the effectiveness of its various components.[43] ACM Multimedia Grand Challenge on ENT Endoscopy Analysis
Trong-Thuan Nguyen,Viet-Tham Huynh,Thao Thi Phuong Dao,Ha Nguyen Thi,Tien To Vu Thuy,Uyen Hanh Tran,Tam V. Nguyen,Thanh Dinh Le,Minh-Triet Tran
Main category: cs.CV
TL;DR: This paper introduces ENTRep, a new benchmark for automated analysis of ENT endoscopic images with multilingual support and retrieval capabilities.
Details
Motivation: The motivation stems from the lack of reliable automated analysis tools for ENT endoscopic imagery due to variability in devices, operators, and the need for fine-grained distinctions and multilingual support. Method: The authors introduced ENTRep, a benchmark dataset with expert annotations, and defined three benchmark tasks with standardized submission protocols evaluated on public and private test splits. Result: The result includes the creation of the ENTRep dataset, benchmark tasks, and evaluation results from top-performing teams, along with insights into future directions. Conclusion: The paper concludes that ENTRep provides a novel benchmark for ENT endoscopy analysis, integrating fine-grained classification and retrieval capabilities under bilingual supervision. Abstract: Automated analysis of endoscopic imagery is a critical yet underdeveloped component of ENT (ear, nose, and throat) care, hindered by variability in devices and operators, subtle and localized findings, and fine-grained distinctions such as laterality and vocal-fold state. In addition to classification, clinicians require reliable retrieval of similar cases, both visually and through concise textual descriptions. These capabilities are rarely supported by existing public benchmarks. To this end, we introduce ENTRep, the ACM Multimedia 2025 Grand Challenge on ENT endoscopy analysis, which integrates fine-grained anatomical classification with image-to-image and text-to-image retrieval under bilingual (Vietnamese and English) clinical supervision. Specifically, the dataset comprises expert-annotated images, labeled for anatomical region and normal or abnormal status, and accompanied by dual-language narrative descriptions. In addition, we define three benchmark tasks, standardize the submission protocol, and evaluate performance on public and private test splits using server-side scoring. Moreover, we report results from the top-performing teams and provide an insight discussion.[44] CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework
Sriram Mandalika,Lalitha V
Main category: cs.CV
TL;DR: CoMAD是一种轻量级的模型蒸馏方法,通过整合多个自监督学习模型的知识,训练出性能更强的小型模型。
Details
Motivation: 现有的自监督学习方法通常独立预训练,忽略了互补信息,并且模型体积大,不利于资源受限的部署。 Method: 使用不对称掩码和联合共识门控机制,将三个预训练模型(MAE、MoCo v3和iBOT)的知识蒸馏到一个轻量级的学生模型中。 Result: CoMAD的ViT-Tiny在ImageNet-1K上达到75.4%的Top-1准确率,在ADE20K和MS-COCO等密集预测任务上也取得了SOTA结果。 Conclusion: CoMAD框架成功整合了多个自监督视觉Transformer的知识,训练出紧凑的学生网络,在多个任务上达到了最先进的性能。 Abstract: Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask, forcing the student to interpolate missing features under richer contexts. Teacher embeddings are aligned to the student's space via a linear adapter and layer normalization, then fused through our joint consensus gating, which weights each token by combining cosine affinity with inter-teacher agreement. The student is trained with dual-level KL divergence on visible tokens and reconstructed feature maps, capturing both local and global structure. On ImageNet-1K, CoMAD's ViT-Tiny achieves 75.4 percent Top-1, an increment of 0.4 percent over the previous state-of-the-art. In dense-prediction transfers, it attains 47.3 percent mIoU on ADE20K, and 44.5 percent box average precision and 40.5 percent mask average precision on MS-COCO, establishing a new state-of-the-art in compact SSL distillation.[45] Single-Step Reconstruction-Free Anomaly Detection and Segmentation via Diffusion Models
Mehrdad Moradi,Marco Grasso,Bianca Maria Colosimo,Kamran Paynabar
Main category: cs.CV
TL;DR: RADAR是一种基于注意力机制的扩散模型,用于实时异常检测,无需图像重建,显著提高了检测准确性和计算效率。
Details
Motivation: 重建过程的计算成本高、对复杂或细微模式的重建效果差以及选择合适中间噪声水平的困难是现有扩散模型在异常检测中的主要问题。 Method: 引入了一种基于注意力机制的实时重建自由扩散模型(RADAR),直接从扩散模型生成异常图,而不是重建输入图像。 Result: RADAR在MVTec-AD数据集和3D打印材料数据集上均超越了最先进的扩散模型和统计机器学习模型,F1分数分别提高了7%和13%。 Conclusion: RADAR通过直接生成异常图克服了基于重建的方法的局限性,并且在计算效率和检测准确性上都优于当前最先进的扩散模型和其他统计机器学习模型。 Abstract: Generative models have demonstrated significant success in anomaly detection and segmentation over the past decade. Recently, diffusion models have emerged as a powerful alternative, outperforming previous approaches such as GANs and VAEs. In typical diffusion-based anomaly detection, a model is trained on normal data, and during inference, anomalous images are perturbed to a predefined intermediate step in the forward diffusion process. The corresponding normal image is then reconstructed through iterative reverse sampling. However, reconstruction-based approaches present three major challenges: (1) the reconstruction process is computationally expensive due to multiple sampling steps, making real-time applications impractical; (2) for complex or subtle patterns, the reconstructed image may correspond to a different normal pattern rather than the original input; and (3) Choosing an appropriate intermediate noise level is challenging because it is application-dependent and often assumes prior knowledge of anomalies, an assumption that does not hold in unsupervised settings. We introduce Reconstruction-free Anomaly Detection with Attention-based diffusion models in Real-time (RADAR), which overcomes the limitations of reconstruction-based anomaly detection. Unlike current SOTA methods that reconstruct the input image, RADAR directly produces anomaly maps from the diffusion model, improving both detection accuracy and computational efficiency. We evaluate RADAR on real-world 3D-printed material and the MVTec-AD dataset. Our approach surpasses state-of-the-art diffusion-based and statistical machine learning models across all key metrics, including accuracy, precision, recall, and F1 score. Specifically, RADAR improves F1 score by 7% on MVTec-AD and 13% on the 3D-printed material dataset compared to the next best model. Code available at: https://github.com/mehrdadmoradi124/RADAR[46] A deep learning approach to track eye movements based on events
Chirag Seth,Divya Naiken,Keyan Lin
Main category: cs.CV
TL;DR: This research develops a deep learning-based algorithm using a CNN_LSTM model to track eye movements with about 81% accuracy, aiming to enhance VR/AR user experience while proposing future improvements through Layer-wise Relevance Propagation (LRP).
Details
Motivation: The motivation is to develop a cost-effective and interpretable algorithm for eye movement tracking, which has extensive applications in consumer electronics, especially VR and AR product development. Method: The study uses deep learning methods, particularly exploring the CNN_LSTM model, to predict human attention and track eye movements using inputs from an event camera. Result: The CNN_LSTM model achieved approximately 81% accuracy in tracking eye movements, demonstrating its effectiveness in predicting human attention. Conclusion: The research concludes that the CNN_LSTM model is the most effective approach for tracking eye movements with an accuracy of about 81%, and future work will focus on Layer-wise Relevance Propagation (LRP) to enhance model interpretability and predictive performance. Abstract: This research project addresses the challenge of accurately tracking eye movements during specific events by leveraging previous research. Given the rapid movements of human eyes, which can reach speeds of 300{\deg}/s, precise eye tracking typically requires expensive and high-speed cameras. Our primary objective is to locate the eye center position (x, y) using inputs from an event camera. Eye movement analysis has extensive applications in consumer electronics, especially in VR and AR product development. Therefore, our ultimate goal is to develop an interpretable and cost-effective algorithm using deep learning methods to predict human attention, thereby improving device comfort and enhancing overall user experience. To achieve this goal, we explored various approaches, with the CNN\_LSTM model proving most effective, achieving approximately 81\% accuracy. Additionally, we propose future work focusing on Layer-wise Relevance Propagation (LRP) to further enhance the model's interpretability and predictive performance.[47] LuKAN: A Kolmogorov-Arnold Network Framework for 3D Human Motion Prediction
Md Zahidul Hasan,A. Ben Hamza,Nizar Bouguila
Main category: cs.CV
TL;DR: 本文提出 LuKAN,一种基于 Kolmogorov-Arnold Networks 和 Lucas 多项式激活的高效 3D 人体运动预测模型,通过离散小波变换、空间投影层和时间依赖学习器实现时间一致性的运动预测,实验结果显示其性能优异且计算效率高。
Details
Motivation: 现有的 3D 人体运动预测方法在预测准确性与计算效率之间难以取得平衡,因此需要提出一种新的有效模型来解决这一问题。 Method: LuKAN 使用离散小波变换编码输入运动序列中的时间信息,通过空间投影层捕捉关节间的依赖关系,并采用基于 Lucas 多项式参数化的 KAN 层作为时间依赖学习器,最后通过逆离散小波变换重建时间域中的运动序列。 Result: 在三个基准数据集上的大量实验表明,LuKAN 在定量和定性评估中均表现出与强基线模型相当的性能,同时保持了较高的计算效率。 Conclusion: LuKAN 提出了一种基于 Kolmogorov-Arnold Networks 和 Lucas 多项式激活的模型,用于高效准确地预测 3D 人体运动,其紧凑架构和 Lucas 多项式的线性递归确保了计算效率。 Abstract: The goal of 3D human motion prediction is to forecast future 3D poses of the human body based on historical motion data. Existing methods often face limitations in achieving a balance between prediction accuracy and computational efficiency. In this paper, we present LuKAN, an effective model based on Kolmogorov-Arnold Networks (KANs) with Lucas polynomial activations. Our model first applies the discrete wavelet transform to encode temporal information in the input motion sequence. Then, a spatial projection layer is used to capture inter-joint dependencies, ensuring structural consistency of the human body. At the core of LuKAN is the Temporal Dependency Learner, which employs a KAN layer parameterized by Lucas polynomials for efficient function approximation. These polynomials provide computational efficiency and an enhanced capability to handle oscillatory behaviors. Finally, the inverse discrete wavelet transform reconstructs motion sequences in the time domain, generating temporally coherent predictions. Extensive experiments on three benchmark datasets demonstrate the competitive performance of our model compared to strong baselines, as evidenced by both quantitative and qualitative evaluations. Moreover, its compact architecture coupled with the linear recurrence of Lucas polynomials, ensures computational efficiency.[48] VER-Bench: Evaluating MLLMs on Reasoning with Fine-Grained Visual Evidence
Chenhui Qiang,Zhaoyang Wei,Xumeng Han Zipeng Wang,Siyao Li,Xiangyuan Lan,Jianbin Jiao,Zhenjun Han
Main category: cs.CV
TL;DR: 本文介绍了一种新的评估框架VER-Bench,用于评估多模态大语言模型(MLLMs)在识别图像中的细粒度视觉线索和结合世界知识进行复杂推理方面的能力。
Details
Motivation: 现有的基准测试主要分为基础感知基准和主流推理基准,但它们都未能充分评估需要复杂分析的细微线索。深刻视觉理解和复杂推理更依赖于对细微、不显眼局部细节的解释,而非识别显著的宏观物体。 Method: 开发了一个名为VER-Bench的新框架,包含374个精心设计的问题,每个问题都附有结构化的证据:视觉线索和相关推理。 Result: VER-Bench能够评估MLLMs识别细粒度视觉线索(平均仅占图像区域的0.25%)并将其与世界知识结合进行复杂推理的能力。 Conclusion: VER-Bench揭示了当前MLLMs在提取微妙视觉证据和构建基于证据的论证方面的局限性,强调了增强模型细粒度视觉证据提取、整合和推理能力的必要性。 Abstract: With the rapid development of MLLMs, evaluating their visual capabilities has become increasingly crucial. Current benchmarks primarily fall into two main types: basic perception benchmarks, which focus on local details but lack deep reasoning (e.g., "what is in the image?"), and mainstream reasoning benchmarks, which concentrate on prominent image elements but may fail to assess subtle clues requiring intricate analysis. However, profound visual understanding and complex reasoning depend more on interpreting subtle, inconspicuous local details than on perceiving salient, macro-level objects. These details, though occupying minimal image area, often contain richer, more critical information for robust analysis. To bridge this gap, we introduce the VER-Bench, a novel framework to evaluate MLLMs' ability to: 1) identify fine-grained visual clues, often occupying on average just 0.25% of the image area; 2) integrate these clues with world knowledge for complex reasoning. Comprising 374 carefully designed questions across Geospatial, Temporal, Situational, Intent, System State, and Symbolic reasoning, each question in VER-Bench is accompanied by structured evidence: visual clues and question-related reasoning derived from them. VER-Bench reveals current models' limitations in extracting subtle visual evidence and constructing evidence-based arguments, highlighting the need to enhance models's capabilities in fine-grained visual evidence extraction, integration, and reasoning for genuine visual understanding and human-like analysis. Dataset and additional materials are available https://github.com/verbta/ACMMM-25-Materials.[49] Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications
Noreen Anwar,Guillaume-Alexandre Bilodeau,Wassim Bouachir
Main category: cs.CV
TL;DR: 本文提出DAMM,通过多模态查询和双流注意力机制提升Transformer目标检测性能,在复杂场景中实现了更高的精度和效率。
Details
Motivation: Transformer目标检测器在遮挡、细粒度定位和计算效率方面存在不足,主要源于固定查询和密集注意力机制。 Method: 提出DAMM框架,结合视觉语言模型的外观查询、多边形嵌入的位置查询和随机学习查询,并设计双流交叉注意力模块分别优化语义和空间特征。 Result: DAMM在四个具有挑战性的数据集上实现了平均精度(AP)和召回率的最先进性能。 Conclusion: DAMM实现了多模态查询自适应和双流注意力模块,提升了检测精度和效率,达到了SOTA性能。 Abstract: Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention. We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency. DAMM capitalizes on three types of queries: appearance-based queries from vision-language models, positional queries using polygonal embeddings, and random learned queries for general scene coverage. Furthermore, a dual-stream cross-attention module separately refines semantic and spatial features, boosting localization precision in cluttered scenes. We evaluated DAMM on four challenging benchmarks, and it achieved state-of-the-art performance in average precision (AP) and recall, demonstrating the effectiveness of multi-modal query adaptation and dual-stream attention. Source code is at: \href{https://github.com/DET-LIP/DAMM}{GitHub}.[50] Revealing Temporal Label Noise in Multimodal Hateful Video Classification
Shuonan Yang,Tailin Chen,Rahul Singh,Jiangbei Yue,Jianbo Jiao,Zeyu Fu
Main category: cs.CV
TL;DR: 该研究分析了多模态仇恨视频中标签噪声的影响,通过细粒度分析揭示了仇恨内容的时间动态特性,并指出需要时间感知模型以提高检测的鲁棒性和可解释性。
Details
Motivation: 现有的多模态仇恨视频检测方法依赖于粗略的视频级别注释,忽略了仇恨内容的时间粒度,这导致了大量的标签噪声,因此需要更精细的分析方法。 Method: 通过使用标注的时间戳对HateMM和MultiHateClip英文数据集中的仇恨视频进行细粒度分析,并对明确的仇恨片段进行修剪和探索性分析。 Result: 时间戳噪声从根本上改变了模型的决策边界并削弱了分类置信度,表明了仇恨言论表达的上下文依赖性和时间连续性。 Conclusion: 研究结果强调了多模态仇恨视频的时间动态特性,并指出需要具有时间感知能力的模型和基准以提高鲁棒性和可解释性。 Abstract: The rapid proliferation of online multimedia content has intensified the spread of hate speech, presenting critical societal and regulatory challenges. While recent work has advanced multimodal hateful video detection, most approaches rely on coarse, video-level annotations that overlook the temporal granularity of hateful content. This introduces substantial label noise, as videos annotated as hateful often contain long non-hateful segments. In this paper, we investigate the impact of such label ambiguity through a fine-grained approach. Specifically, we trim hateful videos from the HateMM and MultiHateClip English datasets using annotated timestamps to isolate explicitly hateful segments. We then conduct an exploratory analysis of these trimmed segments to examine the distribution and characteristics of both hateful and non-hateful content. This analysis highlights the degree of semantic overlap and the confusion introduced by coarse, video-level annotations. Finally, controlled experiments demonstrated that time-stamp noise fundamentally alters model decision boundaries and weakens classification confidence, highlighting the inherent context dependency and temporal continuity of hate speech expression. Our findings provide new insights into the temporal dynamics of multimodal hateful videos and highlight the need for temporally aware models and benchmarks for improved robustness and interpretability. Code and data are available at https://github.com/Multimodal-Intelligence-Lab-MIL/HatefulVideoLabelNoise.[51] Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations
Zahidul Islam,Sujoy Paul,Mrigank Rochan
Main category: cs.CV
TL;DR: 本文提出了一种名为 Highlight-TTA 的视频亮点检测测试时间自适应框架,通过动态调整模型以适应每个测试视频的特定特征,结合辅助任务跨模态幻觉的联合优化,有效提高了泛化能力和检测性能。
Details
Motivation: 现有的视频亮点检测方法虽然先进,但在所有测试视频中泛化能力不足。这些方法通常为每个测试视频使用一个通用的亮点检测模型,这种固定模型无法适应新出现的测试视频中的多样化内容、风格或音视频质量,导致亮点检测性能下降。 Method: Highlight-TTA 与辅助任务跨模态幻觉一起进行联合优化,同时增强主要亮点检测任务。我们利用元辅助训练方案,通过辅助任务实现有效的适应,同时增强主要任务。 Result: 实验结果表明,在三个最先进的亮点检测模型和三个基准数据集中引入 Highlight-TTA 后,它们的性能得到了提升,产生了更优越的结果。 Conclusion: Highlight-TTA 是一种用于视频亮点检测的测试时间自适应框架,它通过在测试期间动态调整模型来更好地匹配每个测试视频的特定特征,从而提高泛化能力和亮点检测性能。 Abstract: Existing video highlight detection methods, although advanced, struggle to generalize well to all test videos. These methods typically employ a generic highlight detection model for each test video, which is suboptimal as it fails to account for the unique characteristics and variations of individual test videos. Such fixed models do not adapt to the diverse content, styles, or audio and visual qualities present in new, unseen test videos, leading to reduced highlight detection performance. In this paper, we propose Highlight-TTA, a test-time adaptation framework for video highlight detection that addresses this limitation by dynamically adapting the model during testing to better align with the specific characteristics of each test video, thereby improving generalization and highlight detection performance. Highlight-TTA is jointly optimized with an auxiliary task, cross-modality hallucinations, alongside the primary highlight detection task. We utilize a meta-auxiliary training scheme to enable effective adaptation through the auxiliary task while enhancing the primary task. During testing, we adapt the trained model using the auxiliary task on the test video to further enhance its highlight detection performance. Extensive experiments with three state-of-the-art highlight detection models and three benchmark datasets show that the introduction of Highlight-TTA to these models improves their performance, yielding superior results.[52] Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens
Suchisrit Gangopadhyay,Jung-Hee Kim,Xien Chen,Patrick Rim,Hyoungseob Park,Alex Wong
Main category: cs.CV
TL;DR: 论文提出了一种无需重新训练或微调基础单目深度估计器(FMDEs)即可将其扩展到鱼眼图像的方法,通过引入校准标记对齐潜在嵌入分布,实现了在室内外场景下的优越性能。
Details
Motivation: 尽管基础单目深度估计器(FMDEs)在数千万张透视图像上训练后仍对相机校准(内参、畸变)参数变化引起的协变量偏移敏感,导致深度估计错误,因此需要一种有效的方法来提升其在鱼眼图像上的表现。 Method: 论文提出了一种自监督方法,通过重新校准透视图像为鱼眼图像,并在训练期间强制保持它们的估计一致性,从而利用现有的大规模透视图像数据集。此外,该方法引入了校准标记来调制潜在嵌入以实现对齐。 Result: 论文结果显示,所提出的方法在室内外场景下均能持续超越现有最先进的方法,并且仅需一组校准标记即可实现。 Conclusion: 论文得出结论,通过引入校准标记(Calibration Tokens)作为轻量级适应机制,可以有效对齐鱼眼图像和透视图像的潜在嵌入分布,从而无需重新训练或微调即可将基础单目深度估计器(FMDEs)扩展到鱼眼图像。 Abstract: We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.[53] Toward Errorless Training ImageNet-1k
Bo Deng,Levi Heath
Main category: cs.CV
TL;DR: 本文介绍了一种基于新方法训练的前馈神经网络,在ImageNet数据集上实现了高准确率,并分析了限制其达到100%准确率的原因。
Details
Motivation: 提高图像分类的准确率,并分析模型未能达到100%准确率的原因。 Method: 使用[5]中提出的新方法,在ImageNet 2012竞赛数据集上训练了一个前馈人工神经网络。 Result: 模型准确率达到98.3%,Top-1准确率达到99.69%,平均在10个数据批次分区中有285.9个标签被完美分类。最佳模型使用了322,430,160个参数,精度为小数点后四位。 Conclusion: 该模型未达到100%准确率的原因可能是数据集中存在重复图像但标签不同,即双重标记问题。 Abstract: In this paper, we describe a feedforward artificial neural network trained on the ImageNet 2012 contest dataset [7] with the new method of [5] to an accuracy rate of 98.3% with a 99.69 Top-1 rate, and an average of 285.9 labels that are perfectly classified over the 10 batch partitions of the dataset. The best performing model uses 322,430,160 parameters, with 4 decimal places precision. We conjecture that the reason our model does not achieve a 100% accuracy rate is due to a double-labeling problem, by which there are duplicate images in the dataset with different labels.[54] Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models
Phuoc-Nguyen Bui,Khanh-Binh Nguyen,Hyunseung Choo
Main category: cs.CV
TL;DR: ProMIM是一种增强现有视觉-语言模型适应能力的轻量级解决方案,通过引入掩码图像建模来提高提示学习的泛化能力和鲁棒性。
Details
Motivation: 视觉-语言模型(VLMs)如CLIP在零样本学习中表现出色,但通常需要资源密集型的训练以适应新任务。提示学习技术,如CoOp和CoCoOp,提供了有效的适应方法,但往往过拟合于已知类别,限制了对未见类别的泛化能力。 Method: 通过仅掩码可见的图像块,并使用这些表示来指导提示生成,从而提高特征的鲁棒性并减轻过拟合。 Result: 实验结果表明,当插入现有方法时,ProMIM在零样本和少样本分类任务中持续提升了泛化性能。 Conclusion: ProMIM是一个即插即用的框架,能够通过整合掩码图像建模(MIM)到现有的视觉-语言模型(VLM)流水线中,有效增强条件提示学习。 Abstract: Vision-language models (VLMs) like CLIP excel in zero-shot learning but often require resource-intensive training to adapt to new tasks. Prompt learning techniques, such as CoOp and CoCoOp, offer efficient adaptation but tend to overfit to known classes, limiting generalization to unseen categories. We introduce ProMIM, a plug-and-play framework that enhances conditional prompt learning by integrating masked image modeling (MIM) into existing VLM pipelines. ProMIM leverages a simple yet effective masking strategy to generate robust, instance-conditioned prompts, seamlessly augmenting methods like CoOp and CoCoOp without altering their core architectures. By masking only visible image patches and using these representations to guide prompt generation, ProMIM improves feature robustness and mitigates overfitting, all while introducing negligible additional computational cost. Extensive experiments across zero-shot and few-shot classification tasks demonstrate that ProMIM consistently boosts generalization performance when plugged into existing approaches, providing a practical, lightweight solution for real-world vision-language applications.[55] TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring
Zhu Xu,Ting Lei,Zhimin Li,Guan Wang,Qingchao Chen,Yuxin Peng,Yang liu
Main category: cs.CV
TL;DR: This paper proposes TRKT, a Temporal-enhanced Relation-aware Knowledge Transferring method for Weakly Supervised Dynamic Scene Graph Generation (WS-DSGG), which improves object detection and localization in dynamic video scenes by leveraging motion- and relation-aware knowledge mining and fusion techniques.
Details
Motivation: Existing Weakly Supervised Dynamic Scene Graph Generation (WS-DSGG) methods rely on external object detectors trained on static images, which perform poorly in dynamic, relation-aware scenarios due to inaccurate localization and low-confidence proposals. This work aims to address these limitations by enhancing detection through knowledge transfer tailored for dynamic scenes. Method: The proposed Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method includes two components: (1) Relation-aware knowledge mining with object and relation class decoders and an Inter-frame Attention Augmentation strategy leveraging optical flow, and (2) a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and improve confidence scores. Result: Extensive experiments demonstrate that the TRKT method achieves state-of-the-art performance on the Action Genome dataset, showing its effectiveness in improving object detection and localization in dynamic, relation-aware video scenarios. Conclusion: TRKT achieves state-of-the-art performance on the Action Genome dataset, offering improved object localization and confidence scores in dynamic, relation-aware scenarios by leveraging relation- and motion-aware knowledge mining and a dual-stream fusion module. Abstract: Dynamic Scene Graph Generation (DSGG) aims to create a scene graph for each video frame by detecting objects and predicting their relationships. Weakly Supervised DSGG (WS-DSGG) reduces annotation workload by using an unlocalized scene graph from a single frame per video for training. Existing WS-DSGG methods depend on an off-the-shelf external object detector to generate pseudo labels for subsequent DSGG training. However, detectors trained on static, object-centric images struggle in dynamic, relation-aware scenarios required for DSGG, leading to inaccurate localization and low-confidence proposals. To address the challenges posed by external object detectors in WS-DSGG, we propose a Temporal-enhanced Relation-aware Knowledge Transferring (TRKT) method, which leverages knowledge to enhance detection in relation-aware dynamic scenarios. TRKT is built on two key components:(1)Relation-aware knowledge mining: we first employ object and relation class decoders that generate category-specific attention maps to highlight both object regions and interactive areas. Then we propose an Inter-frame Attention Augmentation strategy that exploits optical flow for neighboring frames to enhance the attention maps, making them motion-aware and robust to motion blur. This step yields relation- and motion-aware knowledge mining for WS-DSGG. (2) we introduce a Dual-stream Fusion Module that integrates category-specific attention maps into external detections to refine object localization and boost confidence scores for object proposals. Extensive experiments demonstrate that TRKT achieves state-of-the-art performance on Action Genome dataset. Our code is avaliable at https://github.com/XZPKU/TRKT.git.[56] AdvDINO: Domain-Adversarial Self-Supervised Representation Learning for Spatial Proteomics
Stella Su,Marc Harary,Scott J. Rodig,William Lotter
Main category: cs.CV
TL;DR: AdvDINO is a new domain-adversarial self-supervised learning framework that addresses domain shifts in biomedical imaging and other fields, producing more robust and biologically meaningful representations.
Details
Motivation: The robustness of standard SSL methods to domain shifts remains uncertain, especially in biomedical imaging, where batch effects may mask true biological signals. Method: AdvDINO uses a domain-adversarial self-supervised learning framework that incorporates a gradient reversal layer into the DINOv2 architecture to achieve domain-invariant feature learning. Result: In the analysis of more than 5.46 million mIF image blocks, AdvDINO discovered phenotype clusters with different proteomic profiles and prognostic significance, and improved survival prediction in attention-based multiple instance learning. Conclusion: AdvDINO is a widely applicable framework that effectively addresses domain shifts in self-supervised learning, particularly in biomedical imaging, but also in areas such as radiology, remote sensing, and autonomous driving. Abstract: Self-supervised learning (SSL) has emerged as a powerful approach for learning visual representations without manual annotations. However, the robustness of standard SSL methods to domain shift -- systematic differences across data sources -- remains uncertain, posing an especially critical challenge in biomedical imaging where batch effects can obscure true biological signals. We present AdvDINO, a domain-adversarial self-supervised learning framework that integrates a gradient reversal layer into the DINOv2 architecture to promote domain-invariant feature learning. Applied to a real-world cohort of six-channel multiplex immunofluorescence (mIF) whole slide images from non-small cell lung cancer patients, AdvDINO mitigates slide-specific biases to learn more robust and biologically meaningful representations than non-adversarial baselines. Across $>5.46$ million mIF image tiles, the model uncovers phenotype clusters with distinct proteomic profiles and prognostic significance, and improves survival prediction in attention-based multiple instance learning. While demonstrated on mIF data, AdvDINO is broadly applicable to other imaging domains -- including radiology, remote sensing, and autonomous driving -- where domain shift and limited annotated data hinder model generalization and interpretability.[57] Open-world Point Cloud Semantic Segmentation: A Human-in-the-loop Framework
Peng Zhang,Songru Yang,Jinsheng Sun,Weiqing Li,Zhiyong Su
Main category: cs.CV
TL;DR: HOW-Seg提出了一种新的开放世界点云语义分割框架,结合稀疏人工标注和分层原型优化,取得了优于现有方法的性能。
Details
Motivation: 现有方法依赖于资源密集的离线增量学习或密集标注的支持数据,限制了其在实际场景中的应用。 Method: 构建类原型并直接在查询数据上优化,结合稀疏人工标注和稠密条件随机场(CRF)进行标签优化。 Result: 在S3DIS数据集上达到85.27% mIoU,在ScanNetv2上达到66.37% mIoU,显著优于其他方法。 Conclusion: HOW-Seg利用稀疏的人工反馈和分层原型消歧机制,在开放世界点云语义分割中实现了高质量的预测,超越了现有方法。 Abstract: Open-world point cloud semantic segmentation (OW-Seg) aims to predict point labels of both base and novel classes in real-world scenarios. However, existing methods rely on resource-intensive offline incremental learning or densely annotated support data, limiting their practicality. To address these limitations, we propose HOW-Seg, the first human-in-the-loop framework for OW-Seg. Specifically, we construct class prototypes, the fundamental segmentation units, directly on the query data, avoiding the prototype bias caused by intra-class distribution shifts between the support and query data. By leveraging sparse human annotations as guidance, HOW-Seg enables prototype-based segmentation for both base and novel classes. Considering the lack of granularity of initial prototypes, we introduce a hierarchical prototype disambiguation mechanism to refine ambiguous prototypes, which correspond to annotations of different classes. To further enrich contextual awareness, we employ a dense conditional random field (CRF) upon the refined prototypes to optimize their label assignments. Through iterative human feedback, HOW-Seg dynamically improves its predictions, achieving high-quality segmentation for both base and novel classes. Experiments demonstrate that with sparse annotations (e.g., one-novel-class-one-click), HOW-Seg matches or surpasses the state-of-the-art generalized few-shot segmentation (GFS-Seg) method under the 5-shot setting. When using advanced backbones (e.g., Stratified Transformer) and denser annotations (e.g., 10 clicks per sub-scene), HOW-Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2, significantly outperforming alternatives.[58] UGOD: Uncertainty-Guided Differentiable Opacity and Soft Dropout for Enhanced Sparse-View 3DGS
Zhihao Guo,Peng Wang,Zidong Chen,Xiangyu Kong,Yan Lyu,Guanyu Gao,Liangxiu Han
Main category: cs.CV
TL;DR: 本文提出了一种基于学习不确定性的自适应加权高斯点绘方法,有效提升了3D高斯点绘在稀疏视角下的渲染质量。
Details
Motivation: 传统的3D高斯点绘方法在渲染时对所有高斯点赋予相同的权重,容易导致过拟合,尤其在稀疏视角下更为严重,因此需要一种自适应加权机制来提升渲染质量。 Method: 引入了一种学习到的不确定性机制,用于指导高斯不透明度的可微更新,并对不确定性进行软可微丢弃正则化,以生成连续的丢弃概率,从而控制最终的高斯投影和混合过程。 Result: 在多个广泛使用的数据集上进行了大量实验,结果表明该方法在稀疏视角下3D合成效果优于现有方法,例如在MipNeRF 360数据集上相比DropGaussian提升了3.27%的PSNR,并且在大多数数据集中使用更少的高斯点即可实现高质量重建。 Conclusion: 本文提出了一种基于学习不确定性的自适应加权高斯方法,以解决3D高斯点绘在稀疏视角下的过拟合问题,并通过实验验证了该方法在稀疏视角3D合成任务中的优越性。 Abstract: 3D Gaussian Splatting (3DGS) has become a competitive approach for novel view synthesis (NVS) due to its advanced rendering efficiency through 3D Gaussian projection and blending. However, Gaussians are treated equally weighted for rendering in most 3DGS methods, making them prone to overfitting, which is particularly the case in sparse-view scenarios. To address this, we investigate how adaptive weighting of Gaussians affects rendering quality, which is characterised by learned uncertainties proposed. This learned uncertainty serves two key purposes: first, it guides the differentiable update of Gaussian opacity while preserving the 3DGS pipeline integrity; second, the uncertainty undergoes soft differentiable dropout regularisation, which strategically transforms the original uncertainty into continuous drop probabilities that govern the final Gaussian projection and blending process for rendering. Extensive experimental results over widely adopted datasets demonstrate that our method outperforms rivals in sparse-view 3D synthesis, achieving higher quality reconstruction with fewer Gaussians in most datasets compared to existing sparse-view approaches, e.g., compared to DropGaussian, our method achieves 3.27\% PSNR improvements on the MipNeRF 360 dataset.[59] CSRAP: Enhanced Canvas Attention Scheduling for Real-Time Mission Critical Perception
Md Iftekharul Islam Sakib,Yigong Hu,Tarek Abdelzaher
Main category: cs.CV
TL;DR: 本文提出了一种改进的基于画布的注意力调度方法,通过可变尺寸和帧率的画布帧,在边缘平台上实现了更优的目标检测性能与资源消耗平衡。
Details
Motivation: 在边缘平台上,如何在有限计算资源和严格延迟约束下执行高分辨率目标检测是一个核心挑战。本文旨在通过改进注意力调度策略来降低感知子系统的资源需求。 Method: 使用基于画布的注意力调度机制,将感兴趣区域集中到较小的画布帧中进行处理,并引入可变尺寸和可选帧率的画布帧以优化资源利用。 Result: 实验结果表明,所提出的方法在NVIDIA Jetson Orin Nano上运行YOLOv11时,相比现有技术,能够实现更高的平均精度(mAP)和召回率,同时保持实时性要求。 Conclusion: 该文通过可变尺寸画布帧和可选帧率的方法,提高了边缘平台实时感知的质量与成本折衷,实现了比现有技术更高的平均精度(mAP)和召回率。 Abstract: Real-time perception on edge platforms faces a core challenge: executing high-resolution object detection under stringent latency constraints on limited computing resources. Canvas-based attention scheduling was proposed in earlier work as a mechanism to reduce the resource demands of perception subsystems. It consolidates areas of interest in an input data frame onto a smaller area, called a canvas frame, that can be processed at the requisite frame rate. This paper extends prior canvas-based attention scheduling literature by (i) allowing for variable-size canvas frames and (ii) employing selectable canvas frame rates that may depart from the original data frame rate. We evaluate our solution by running YOLOv11, as the perception module, on an NVIDIA Jetson Orin Nano to inspect video frames from the Waymo Open Dataset. Our results show that the additional degrees of freedom improve the attainable quality/cost trade-offs, thereby allowing for a consistently higher mean average precision (mAP) and recall with respect to the state of the art.[60] Steering One-Step Diffusion Model with Fidelity-Rich Decoder for Fast Image Compression
Zheng Chen,Mingde Zhou,Jinpei Guo,Jiale Yuan,Yifei Ji,Yulun Zhang
Main category: cs.CV
TL;DR: SODEC is a faster and more efficient image compression method that addresses the limitations of traditional diffusion models by using single-step decoding and fidelity improvements.
Details
Motivation: To overcome the drawbacks of traditional diffusion-based image compression, which include excessive decoding latency and poor fidelity. Method: SODEC uses a pre-trained VAE-based model to generate informative latents, replaces iterative denoising with single-step decoding, and introduces a fidelity guidance module and rate annealing training strategy. Result: SODEC achieves superior rate-distortion-perception performance and improves decoding speed by more than 20x compared to previous diffusion-based models. Conclusion: SODEC is a novel single-step diffusion image compression model that outperforms existing methods in rate-distortion-perception performance and decoding speed. Abstract: Diffusion-based image compression has demonstrated impressive perceptual performance. However, it suffers from two critical drawbacks: (1) excessive decoding latency due to multi-step sampling, and (2) poor fidelity resulting from over-reliance on generative priors. To address these issues, we propose SODEC, a novel single-step diffusion image compression model. We argue that in image compression, a sufficiently informative latent renders multi-step refinement unnecessary. Based on this insight, we leverage a pre-trained VAE-based model to produce latents with rich information, and replace the iterative denoising process with a single-step decoding. Meanwhile, to improve fidelity, we introduce the fidelity guidance module, encouraging output that is faithful to the original image. Furthermore, we design the rate annealing training strategy to enable effective training under extremely low bitrates. Extensive experiments show that SODEC significantly outperforms existing methods, achieving superior rate-distortion-perception performance. Moreover, compared to previous diffusion-based compression models, SODEC improves decoding speed by more than 20$\times$. Code is released at: https://github.com/zhengchen1999/SODEC.[61] Propagating Sparse Depth via Depth Foundation Model for Out-of-Distribution Depth Completion
Shenglun Chen,Xinzhu Ma,Hong Zhang,Haojie Li,Zhihui Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于深度基础模型的深度补全框架,结合双空间传播和校正模块,在无需大规模训练的情况下实现了强大的分布外泛化能力。
Details
Motivation: 现有的基于学习的深度补全模型依赖于精心准备但有限的数据,在分布外(OOD)场景下性能显著下降。而通过大规模训练的基础模型在单目深度估计中表现出色,因此将其用于增强深度补全模型的鲁棒性是一个有前景的解决方案。 Method: 论文通过利用深度基础模型提取RGB图像中的环境线索(包括结构和语义上下文),引导稀疏深度信息传播到缺失区域。此外,设计了一种无参数的双空间传播方法,在3D和2D空间中传播稀疏深度,同时引入了可学习的校正模块以逐步调整深度预测。 Result: 该模型在NYUv2和KITTI数据集上训练,并在16个其他数据集上进行了广泛评估,结果表明其在OOD场景中表现优异,超越了现有方法。 Conclusion: 该论文提出了一种新的深度补全框架,利用深度基础模型,在无需大规模训练的情况下实现了显著的鲁棒性,并在分布外场景中表现出色,超越了现有的最先进方法。 Abstract: Depth completion is a pivotal challenge in computer vision, aiming at reconstructing the dense depth map from a sparse one, typically with a paired RGB image. Existing learning based models rely on carefully prepared but limited data, leading to significant performance degradation in out-of-distribution (OOD) scenarios. Recent foundation models have demonstrated exceptional robustness in monocular depth estimation through large-scale training, and using such models to enhance the robustness of depth completion models is a promising solution. In this work, we propose a novel depth completion framework that leverages depth foundation models to attain remarkable robustness without large-scale training. Specifically, we leverage a depth foundation model to extract environmental cues, including structural and semantic context, from RGB images to guide the propagation of sparse depth information into missing regions. We further design a dual-space propagation approach, without any learnable parameters, to effectively propagates sparse depth in both 3D and 2D spaces to maintain geometric structure and local consistency. To refine the intricate structure, we introduce a learnable correction module to progressively adjust the depth prediction towards the real depth. We train our model on the NYUv2 and KITTI datasets as in-distribution datasets and extensively evaluate the framework on 16 other datasets. Our framework performs remarkably well in the OOD scenarios and outperforms existing state-of-the-art depth completion methods. Our models are released in https://github.com/shenglunch/PSD.[62] Unified modality separation: A vision-language framework for unsupervised domain adaptation
Xinyao Li,Jingjing Li,Zhekai Du,Lei Zhu,Heng Tao Shen
Main category: cs.CV
TL;DR: This paper introduces a modality separation framework to address the modality gap in unsupervised domain adaptation for vision-language models, achieving better performance and efficiency.
Details
Motivation: The motivation stems from the limitations of current unsupervised domain adaptation (UDA) methods when applied to vision-language models (VLMs), particularly due to the inherent modality gap. This gap restricts the transfer of only modality-invariant knowledge, resulting in suboptimal performance. Method: A unified modality separation framework was developed to disentangle modality-specific and modality-invariant features from vision-language models (VLMs). These components were processed separately and combined using modality-adaptive ensemble weights during testing. A modality discrepancy metric was also designed to classify samples into modality-invariant, modality-specific, and uncertain categories. Result: The proposed method achieves up to a 9% performance improvement while offering 9 times computational efficiency. Extensive experiments confirmed its effectiveness across various backbones, datasets, and adaptation settings. Conclusion: The proposed unified modality separation framework effectively addresses the modality gap issue in unsupervised domain adaptation by separately handling modality-specific and modality-invariant components, leading to improved performance. Abstract: Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging semantic information to facilitate target tasks. By aligning vision and text embeddings, VLMs have shown notable success in bridging domain gaps. However, inherent differences naturally exist between modalities, which is known as modality gap. Our findings reveal that direct UDA with the presence of modality gap only transfers modality-invariant knowledge, leading to suboptimal target performance. To address this limitation, we propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components. During training, different modality components are disentangled from VLM features then handled separately in a unified manner. At test time, modality-adaptive ensemble weights are automatically determined to maximize the synergy of different components. To evaluate instance-level modality characteristics, we design a modality discrepancy metric to categorize samples into modality-invariant, modality-specific, and uncertain ones. The modality-invariant samples are exploited to facilitate cross-modal alignment, while uncertain ones are annotated to enhance model capabilities. Building upon prompt tuning techniques, our methods achieve up to 9% performance gain with 9 times of computational efficiencies. Extensive experiments and analysis across various backbones, baselines, datasets and adaptation settings demonstrate the efficacy of our design.[63] Modeling Rapid Contextual Learning in the Visual Cortex with Fast-Weight Deep Autoencoder Networks
Yue Li,Weifan Wang,Tai Sing Lee
Main category: cs.CV
TL;DR: 该研究使用基于Vision Transformer (ViT) 的自编码器探索熟悉训练如何在深度神经网络的早期层中引入对全局上下文的敏感性,并假设快速学习通过快速权重(fast weights)编码瞬时或短期记忆痕迹来实现。
Details
Motivation: 近期的神经生理学研究表明,早期视觉皮层可以迅速学习全局图像上下文,这种现象主要归因于局部递归相互作用,而不是前馈或反馈通路的变化。研究者希望通过功能视角探究熟悉训练如何影响深度神经网络早期层的全局上下文敏感性。 Method: 研究采用基于Vision Transformer (ViT) 的自编码器,并探索使用低秩适应(LoRA)在每个Transformer层中实现快速权重(fast weights),以模拟快速学习过程。 Result: 研究结果表明:(1) 提出的ViT自编码器的自注意力机制执行的流形变换类似于熟悉效应的神经电路模型;(2) 熟悉训练使早期层的潜在表示与包含全局上下文信息的顶层对齐;(3) 熟悉训练扩展了在记忆图像上下文中的自注意力范围;(4) 基于LoRA的快速权重显著放大了这些效应。 Conclusion: 熟悉训练在分层网络的早期层中引入了对全局上下文的敏感性,而快慢权重混合架构可能为研究大脑中的快速全局上下文学习提供可行的计算模型。 Abstract: Recent neurophysiological studies have revealed that the early visual cortex can rapidly learn global image context, as evidenced by a sparsification of population responses and a reduction in mean activity when exposed to familiar versus novel image contexts. This phenomenon has been attributed primarily to local recurrent interactions, rather than changes in feedforward or feedback pathways, supported by both empirical findings and circuit-level modeling. Recurrent neural circuits capable of simulating these effects have been shown to reshape the geometry of neural manifolds, enhancing robustness and invariance to irrelevant variations. In this study, we employ a Vision Transformer (ViT)-based autoencoder to investigate, from a functional perspective, how familiarity training can induce sensitivity to global context in the early layers of a deep neural network. We hypothesize that rapid learning operates via fast weights, which encode transient or short-term memory traces, and we explore the use of Low-Rank Adaptation (LoRA) to implement such fast weights within each Transformer layer. Our results show that (1) The proposed ViT-based autoencoder's self-attention circuit performs a manifold transform similar to a neural circuit model of the familiarity effect. (2) Familiarity training aligns latent representations in early layers with those in the top layer that contains global context information. (3) Familiarity training broadens the self-attention scope within the remembered image context. (4) These effects are significantly amplified by LoRA-based fast weights. Together, these findings suggest that familiarity training introduces global sensitivity to earlier layers in a hierarchical network, and that a hybrid fast-and-slow weight architecture may provide a viable computational model for studying rapid global context learning in the brain.[64] Attribute Guidance With Inherent Pseudo-label For Occluded Person Re-identification
Rui Zhi,Zhen Yang,Haiyang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的行人重识别框架AG-ReID,通过利用预训练模型的内在能力提取细粒度语义属性,解决了部分遮挡和外观差异细微的挑战,实现了在多个数据集上的先进性能。
Details
Motivation: 尽管预训练的视觉语言模型在Re-ID任务中显示出了有效性,但它们在遮挡场景中面临重大挑战,因为它们关注整体图像语义而忽视了细粒度属性信息。这一限制在处理部分遮挡的行人或区分外观差异细微的个体时尤为明显。 Method: AG-ReID框架分为两个阶段:首先生成捕捉细微视觉特征的属性伪标签,然后引入结合整体和细粒度属性信息的双引导机制以增强图像特征提取。 Result: AG-ReID在多个广泛使用的Re-ID数据集上实现了最先进的结果,显著提高了处理遮挡和细微属性差异的能力,同时保持了在标准Re-ID场景中的竞争性能。 Conclusion: AG-ReID通过结合整体和细粒度属性信息的双引导机制,在多个广泛使用的Re-ID数据集中实现了最先进的结果,显著提高了处理遮挡和细微属性差异的能力,同时在标准Re-ID场景中保持了竞争性能。 Abstract: Person re-identification (Re-ID) aims to match person images across different camera views, with occluded Re-ID addressing scenarios where pedestrians are partially visible. While pre-trained vision-language models have shown effectiveness in Re-ID tasks, they face significant challenges in occluded scenarios by focusing on holistic image semantics while neglecting fine-grained attribute information. This limitation becomes particularly evident when dealing with partially occluded pedestrians or when distinguishing between individuals with subtle appearance differences. To address this limitation, we propose Attribute-Guide ReID (AG-ReID), a novel framework that leverages pre-trained models' inherent capabilities to extract fine-grained semantic attributes without additional data or annotations. Our framework operates through a two-stage process: first generating attribute pseudo-labels that capture subtle visual characteristics, then introducing a dual-guidance mechanism that combines holistic and fine-grained attribute information to enhance image feature extraction. Extensive experiments demonstrate that AG-ReID achieves state-of-the-art results on multiple widely-used Re-ID datasets, showing significant improvements in handling occlusions and subtle attribute differences while maintaining competitive performance on standard Re-ID scenarios.[65] CRAM: Large-scale Video Continual Learning with Bootstrapped Compression
Shivani Mall,Joao F. Henriques
Main category: cs.CV
TL;DR: 本文提出了一种名为CRAM的视频持续学习方法,通过在线训练视频压缩器并刷新视频编码,解决了高内存需求问题,并在大规模视频持续学习基准上表现优异。
Details
Motivation: 视频持续学习面临高内存需求的挑战,尤其是在处理长视频和连续流时,这与常见的回放缓冲区大小限制相冲突。 Method: 提出了一种名为CRAM的方法,通过在线训练视频压缩器并刷新视频编码来应对灾难性遗忘。 Result: CRAM通过存储压缩的视频编码而非原始输入,成功在EpicKitchens-100和Kinetics-700等大规模视频持续学习基准上实现了优于现有技术的表现,且内存占用显著减少。 Conclusion: 实验结果表明,CRAM在减少内存占用的同时,在视频持续学习任务上优于现有方法。 Abstract: Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling, which requires random access to a full dataset. This would allow for much smaller storage requirements and self-sufficiency of deployed systems that cope with natural distribution shifts, similarly to biological learning. We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer. We posit that part of the reason why practical video CL is challenging is the high memory requirements of video, further exacerbated by long-videos and continual streams, which are at odds with the common rehearsal-buffer size constraints. To address this, we propose to use compressed vision, i.e. store video codes (embeddings) instead of raw inputs, and train a video classifier by IID sampling from this rolling buffer. Training a video compressor online (so not depending on any pre-trained networks) means that it is also subject to catastrophic forgetting. We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one. We name our method Continually Refreshed Amodal Memory (CRAM). We expand current video CL benchmarks to large-scale settings, namely EpicKitchens-100 and Kinetics-700, storing thousands of relatively long videos in under 2 GB, and demonstrate empirically that our video CL method outperforms prior art with a significantly reduced memory footprint.[66] Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation
Xusheng Liang,Lihua Zhou,Nianxin Li,Miao Xu,Ziyang Song,Dong Yi,Jinlin Wu,Hongbin Liu,Jiebo Luo,Zhen Lei
Main category: cs.CV
TL;DR: 本文提出了一种结合因果推理与视觉-语言模型(VLM)的新框架MCDRL,旨在解决医学图像分割中的领域泛化问题。
Details
Motivation: 由于医学图像数据的高变异性和复杂性,视觉-语言模型(VLMs)在医学成像领域的应用面临挑战,特别是由设备差异、操作伪影和成像模式等混淆因素引起的显著领域转移,导致模型在未见过的领域中泛化能力较差。 Method: MCDRL框架分为两个步骤:首先利用CLIP的跨模态能力,通过文本提示识别候选病变区域并构建代表领域特异性变化的混淆因子字典;其次,训练一个因果干预网络,利用该字典识别并消除这些领域特异性变化的影响,同时保留对分割任务至关重要的解剖结构信息。 Result: 大量实验表明,MCDRL在医学图像分割任务中始终优于现有方法,表现出更高的分割准确性和强大的泛化能力。 Conclusion: MCDRL为医学图像分割中的领域泛化问题提供了一个有效的解决方案,结合因果推理与VLM的优势,提升了模型在面对未见过领域时的性能。 Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary through text prompts, specifically designed to represent domain-specific variations; second, it trains a causal intervention network that utilizes this dictionary to identify and eliminate the influence of these domain-specific variations while preserving the anatomical structural information critical for segmentation tasks. Extensive experiments demonstrate that MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.[67] AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content
Shushi Wang,Chunyi Li,Zicheng Zhang,Han Zhou,Wei Dong,Jun Chen,Guangtao Zhai,Xiaohong Liu
Main category: cs.CV
TL;DR: This paper introduces AU-IQA, a new dataset for evaluating perceptual quality assessment models on AI-enhanced user-generated content, showing that current methods are insufficient.
Details
Motivation: The motivation is to address the lack of specialized quality assessment models for AI-enhanced UGC, which limits user experience and method advancement. Method: The researchers constructed a benchmark dataset named AU-IQA, containing 4,800 AI-UGC images, and evaluated various quality assessment models on this dataset. Result: The evaluation of existing models on the AU-IQA dataset revealed gaps in their effectiveness for assessing the perceptual quality of AI-UGC. Conclusion: The study concludes that current perceptual quality assessment models do not perform effectively on AI-enhanced UGC, highlighting the need for specialized models. Abstract: AI-based image enhancement techniques have been widely adopted in various visual applications, significantly improving the perceptual quality of user-generated content (UGC). However, the lack of specialized quality assessment models has become a significant limiting factor in this field, limiting user experience and hindering the advancement of enhancement methods. While perceptual quality assessment methods have shown strong performance on UGC and AIGC individually, their effectiveness on AI-enhanced UGC (AI-UGC) which blends features from both, remains largely unexplored. To address this gap, we construct AU-IQA, a benchmark dataset comprising 4,800 AI-UGC images produced by three representative enhancement types which include super-resolution, low-light enhancement, and denoising. On this dataset, we further evaluate a range of existing quality assessment models, including traditional IQA methods and large multimodal models. Finally, we provide a comprehensive analysis of how well current approaches perform in assessing the perceptual quality of AI-UGC. The access link to the AU-IQA is https://github.com/WNNGGU/AU-IQA-Dataset.[68] Skin-SOAP: A Weakly Supervised Framework for Generating Structured SOAP Notes
Sadia Kamal,Tim Oates,Joy Wan
Main category: cs.CV
TL;DR: The paper introduces skin-SOAP, a weakly supervised multimodal framework that automates the generation of structured SOAP notes for skin carcinoma patients, reducing clinician workload and achieving performance comparable to state-of-the-art language models using novel evaluation metrics.
Details
Motivation: Manually generating detailed SOAP notes is labor-intensive and contributes to clinician burnout. There is a need for efficient, accurate, and scalable methods to produce clinically structured documentation to improve healthcare outcomes and reduce costs in the context of widespread skin carcinoma. Method: The paper proposes a weakly supervised multimodal framework called skin-SOAP that generates structured SOAP notes using limited inputs such as lesion images and sparse clinical text. It introduces two novel evaluation metrics, MedConceptEval and Clinical Coherence Score (CCS), to assess the semantic alignment and clinical relevance of the generated notes. Result: The skin-SOAP method achieves performance comparable to advanced language models like GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. It successfully reduces reliance on large annotated datasets while maintaining clinical accuracy and coherence. Conclusion: The skin-SOAP framework effectively generates structured SOAP notes with minimal manual annotation, offering scalable and clinically grounded documentation that reduces clinician burden and performs comparably to advanced language models in clinical relevance metrics. Abstract: Skin carcinoma is the most prevalent form of cancer globally, accounting for over $8 billion in annual healthcare expenditures. Early diagnosis, accurate and timely treatment are critical to improving patient survival rates. In clinical settings, physicians document patient visits using detailed SOAP (Subjective, Objective, Assessment, and Plan) notes. However, manually generating these notes is labor-intensive and contributes to clinician burnout. In this work, we propose skin-SOAP, a weakly supervised multimodal framework to generate clinically structured SOAP notes from limited inputs, including lesion images and sparse clinical text. Our approach reduces reliance on manual annotations, enabling scalable, clinically grounded documentation while alleviating clinician burden and reducing the need for large annotated data. Our method achieves performance comparable to GPT-4o, Claude, and DeepSeek Janus Pro across key clinical relevance metrics. To evaluate this clinical relevance, we introduce two novel metrics MedConceptEval and Clinical Coherence Score (CCS) which assess semantic alignment with expert medical concepts and input features, respectively.[69] A Novel Image Similarity Metric for Scene Composition Structure
Md Redwanul Haque,Manzur Murshed,Manoranjan Paul,Tsz-Kwan Lee
Main category: cs.CV
TL;DR: The paper introduces SCSSIM, a new training-free metric for evaluating the preservation of Scene Composition Structure (SCS) in images generated by AI models, showing better performance than existing methods.
Details
Motivation: Traditional image similarity metrics often fall short in assessing Scene Composition Structure (SCS), which is critical for ensuring faithful and structurally accurate GenAI outputs. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Method: The SCS Similarity Index Measure (SCSSIM) is a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images. Result: SCSSIM demonstrates high invariance to non-compositional distortions and a strong monotonic decrease for compositional distortions, accurately reflecting unchanged and altered SCS respectively. Conclusion: SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition. Abstract: The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image's underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM's high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.[70] HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID
Yiyang Su,Yunping Shi,Feng Liu,Xiaoming Liu
Main category: cs.CV
TL;DR: This paper proposes HAMoBE, a novel framework for video-based person re-identification that adaptively integrates biometric features and dynamically adjusts expert contributions, resulting in significant performance improvements.
Details
Motivation: Existing video-based ReID methods often overlook the necessity of identifying and selecting the most discriminative features for effective matching in dynamic environments. Method: The HAMoBE framework uses multi-layer features from a pre-trained large model, models key biometric features (appearance, body shape, gait), and integrates them using a dual-input decision gating network. Result: Extensive evaluations on benchmarks like MEVID show significant performance improvements, e.g., +13.0% Rank-1 accuracy. Conclusion: The proposed HAMoBE framework significantly improves the performance of video-based person re-identification by adaptively integrating key biometric features and dynamically adjusting expert contributions. Abstract: Recently, research interest in person re-identification (ReID) has increasingly focused on video-based scenarios, which are essential for robust surveillance and security in varied and dynamic environments. However, existing video-based ReID methods often overlook the necessity of identifying and selecting the most discriminative features from both videos in a query-gallery pair for effective matching. To address this issue, we propose a novel Hierarchical and Adaptive Mixture of Biometric Experts (HAMoBE) framework, which leverages multi-layer features from a pre-trained large model (e.g., CLIP) and is designed to mimic human perceptual mechanisms by independently modeling key biometric features--appearance, static body shape, and dynamic gait--and adaptively integrating them. Specifically, HAMoBE includes two levels: the first level extracts low-level features from multi-layer representations provided by the frozen large model, while the second level consists of specialized experts focusing on long-term, short-term, and temporal features. To ensure robust matching, we introduce a new dual-input decision gating network that dynamically adjusts the contributions of each expert based on their relevance to the input scenarios. Extensive evaluations on benchmarks like MEVID demonstrate that our approach yields significant performance improvements (e.g., +13.0% Rank-1 accuracy).[71] Finding Needles in Images: Can Multimodal LLMs Locate Fine Details?
Parth Thakkar,Ankush Agarwal,Prasad Kasu,Pulkit Bansal,Chaitanya Devaguptapu
Main category: cs.CV
TL;DR: 本文提出了Spot-IT方法,通过智能补丁选择和高斯注意力机制增强了MLLMs在复杂文档中提取细粒度细节的能力。
Details
Motivation: MLLMs在文档理解任务中的表现虽然出色,但在复杂文档中定位和推理细粒度细节的能力尚未得到充分研究。 Method: 构建了一个名为NiM的基准测试,并提出了Spot-IT方法,通过智能补丁选择和高斯注意力机制来增强MLLMs的能力。 Result: 实验表明,Spot-IT相较于基线方法有显著改进,特别是在复杂布局的文档中提取精确细节方面。 Conclusion: Spot-IT有效提升了MLLMs在复杂文档中提取细节的能力,特别是在需要精确细节提取的场景中表现突出。 Abstract: While Multi-modal Large Language Models (MLLMs) have shown impressive capabilities in document understanding tasks, their ability to locate and reason about fine-grained details within complex documents remains understudied. Consider searching a restaurant menu for a specific nutritional detail or identifying a disclaimer in a lengthy newspaper article tasks that demand careful attention to small but significant details within a broader narrative, akin to Finding Needles in Images (NiM). To address this gap, we introduce NiM, a carefully curated benchmark spanning diverse real-world documents including newspapers, menus, and lecture images, specifically designed to evaluate MLLMs' capability in these intricate tasks. Building on this, we further propose Spot-IT, a simple yet effective approach that enhances MLLMs capability through intelligent patch selection and Gaussian attention, motivated from how humans zoom and focus when searching documents. Our extensive experiments reveal both the capabilities and limitations of current MLLMs in handling fine-grained document understanding tasks, while demonstrating the effectiveness of our approach. Spot-IT achieves significant improvements over baseline methods, particularly in scenarios requiring precise detail extraction from complex layouts.[72] DualMat: PBR Material Estimation via Coherent Dual-Path Diffusion
Yifeng Huang,Zhang Chen,Yi Xu,Minh Hoai,Zhong Li
Main category: cs.CV
TL;DR: DualMat is a novel dual-path diffusion model that efficiently estimates PBR materials from single images, achieving superior accuracy in albedo and metallic-roughness estimation under complex lighting conditions.
Details
Motivation: Estimating PBR materials from single images under complex lighting is challenging due to the need for precise albedo, metallic, and roughness estimation. Existing methods struggle with coherence and efficiency, motivating the development of a novel dual-path framework. Method: DualMat operates in two latent spaces: an albedo-optimized path using RGB latent space and a material-specialized path for precise metallic and roughness estimation. It uses feature distillation for coherence, rectified flow for efficiency, and extends to high-resolution and multi-view inputs via patch-based estimation and cross-view attention. Result: DualMat achieves state-of-the-art performance on Objaverse and real-world data, with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors. Conclusion: DualMat introduces a dual-path diffusion framework that achieves state-of-the-art performance in estimating PBR materials from single images under complex lighting, showing significant improvements in albedo and metallic-roughness estimation. Abstract: We present DualMat, a novel dual-path diffusion framework for estimating Physically Based Rendering (PBR) materials from single images under complex lighting conditions. Our approach operates in two distinct latent spaces: an albedo-optimized path leveraging pretrained visual knowledge through RGB latent space, and a material-specialized path operating in a compact latent space designed for precise metallic and roughness estimation. To ensure coherent predictions between the albedo-optimized and material-specialized paths, we introduce feature distillation during training. We employ rectified flow to enhance efficiency by reducing inference steps while maintaining quality. Our framework extends to high-resolution and multi-view inputs through patch-based estimation and cross-view attention, enabling seamless integration into image-to-3D pipelines. DualMat achieves state-of-the-art performance on both Objaverse and real-world data, significantly outperforming existing methods with up to 28% improvement in albedo estimation and 39% reduction in metallic-roughness prediction errors.[73] Decoupling Continual Semantic Segmentation
Yifu Guo,Yuquan Lu,Wentao Zhang,Zishan Xu,Dexia Chen,Siyu Zhang,Yizhe Zhang,Ruixuan Wang
Main category: cs.CV
TL;DR: DecoupleCSS proposes a two-stage framework for continual semantic segmentation that decouples class-aware detection from class-agnostic segmentation, effectively addressing catastrophic forgetting and achieving superior performance.
Details
Motivation: The motivation is to address the challenge of catastrophic forgetting in continual semantic segmentation tasks, where existing methods suffer from interference between old and new class learning due to tightly coupled segmentation masks and class labels. Method: The method uses a two-stage framework: the first stage encodes class-specific information using pre-trained text and image encoders adapted with LoRA to generate location-aware prompts, while the second stage employs the Segment Anything Model (SAM) to produce precise segmentation masks, ensuring shared segmentation knowledge across new and previous classes. Result: DecoupleCSS achieves state-of-the-art performance across various challenging continual semantic segmentation tasks by improving the balance between retention and adaptability. Conclusion: DecoupleCSS introduces a two-stage framework for continual semantic segmentation that effectively balances retention and adaptability by decoupling class-aware detection from class-agnostic segmentation. Abstract: Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single-stage encoder-decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention-plasticity balance. We introduce DecoupleCSS, a novel two-stage framework for CSS. By decoupling class-aware detection from class-agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre-trained text and image encoders, adapted using LoRA, to encode class-specific information and generate location-aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state-of-the-art performance across a variety of challenging tasks. Our code is publicly available at: https://github.com/euyis1019/Decoupling-Continual-Semantic-Segmentation.[74] Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks
Ruiyu Li,Changyuan Qiu,Hangrui Cao,Qihan Ren,Yuqing Qiu
Main category: cs.CV
TL;DR: 该项目旨在通过分类和对抗学习方法解决图像着色问题,并基于先前工作进行改进和比较。
Details
Motivation: 图像着色是一项具有挑战性的任务,因为丢失了两个图像维度,但场景的语义和表面纹理可以提供重要的颜色线索。 Method: 该项目将着色问题表述为分类和对抗学习任务,并将建立模型进行实验。 Result: 无 Conclusion: 该项目探索了通过分类和对抗学习进行自动图像着色的方法,并计划在先前工作的基础上进行修改和比较。 Abstract: Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons.[75] FLUX-Makeup: High-Fidelity, Identity-Consistent, and Robust Makeup Transfer via Diffusion Transformer
Jian Zhu,Shanyuan Liu,Liuzhuozheng Li,Yue Gong,He Wang,Bo Cheng,Yuhang Ma,Liebucha Wu,Xiaoyu Wu,Dawei Leng,Yuhui Yin,Yang Xu
Main category: cs.CV
TL;DR: 本文提出了一种名为FLUX-Makeup的高保真、身份一致且鲁棒的化妆迁移框架,无需任何辅助面部控制组件,通过源-参考图像对实现卓越的迁移性能。
Details
Motivation: 现有的基于GAN的方法依赖于精心设计的损失函数,而基于扩散的方法需要额外的面部控制模块,这些都会引入额外误差,导致迁移效果不理想。 Method: 基于FLUX-Kontext框架,将源图像作为原生条件输入,并引入轻量级化妆特征注入器RefLoRAInjector,同时设计了鲁棒且可扩展的数据生成流程。 Result: 实验表明,FLUX-Makeup在多种场景下实现了最先进的性能,具有强大的鲁棒性。 Conclusion: FLUX-Makeup是一种高效且高质量的化妆迁移方法,克服了现有方法对辅助组件的依赖性,显著提升了迁移效果和适用性。 Abstract: Makeup transfer aims to apply the makeup style from a reference face to a target face and has been increasingly adopted in practical applications. Existing GAN-based approaches typically rely on carefully designed loss functions to balance transfer quality and facial identity consistency, while diffusion-based methods often depend on additional face-control modules or algorithms to preserve identity. However, these auxiliary components tend to introduce extra errors, leading to suboptimal transfer results. To overcome these limitations, we propose FLUX-Makeup, a high-fidelity, identity-consistent, and robust makeup transfer framework that eliminates the need for any auxiliary face-control components. Instead, our method directly leverages source-reference image pairs to achieve superior transfer performance. Specifically, we build our framework upon FLUX-Kontext, using the source image as its native conditional input. Furthermore, we introduce RefLoRAInjector, a lightweight makeup feature injector that decouples the reference pathway from the backbone, enabling efficient and comprehensive extraction of makeup-related information. In parallel, we design a robust and scalable data generation pipeline to provide more accurate supervision during training. The paired makeup datasets produced by this pipeline significantly surpass the quality of all existing datasets. Extensive experiments demonstrate that FLUX-Makeup achieves state-of-the-art performance, exhibiting strong robustness across diverse scenarios.[76] AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models
Yuxiang Xiao,Yang Hu,Bin Li,Tianyang Zhang,Zexi Li,Huazhu Fu,Jens Rittscher,Kaixiang Yang
Main category: cs.CV
TL;DR: AdaFusion is a novel prompt-guided inference framework that dynamically integrates complementary knowledge from multiple PFMs, enhancing performance and interpretability in real-world applications.
Details
Motivation: The motivation is to overcome latent biases introduced by diverse yet opaque pre-training contexts of PFMs, aiming to improve generalisability and transparency in downstream applications. Method: AdaFusion uses a prompt-guided inference framework that compresses and aligns tile-level features from diverse models and employs a lightweight attention mechanism to adaptively fuse them based on tissue phenotype context. Result: AdaFusion consistently surpasses individual PFMs across both classification and regression tasks while offering interpretable insights into each model's biosemantic specialisation. Conclusion: AdaFusion successfully integrates knowledge from multiple PFMs, enhancing performance and interpretability in various real-world applications, such as treatment response prediction, tumor grading, and spatial gene expression inference. Abstract: Pathology foundation models (PFMs) have demonstrated strong representational capabilities through self-supervised pre-training on large-scale, unannotated histopathology image datasets. However, their diverse yet opaque pretraining contexts, shaped by both data-related and structural/training factors, introduce latent biases that hinder generalisability and transparency in downstream applications. In this paper, we propose AdaFusion, a novel prompt-guided inference framework that, to our knowledge, is among the very first to dynamically integrate complementary knowledge from multiple PFMs. Our method compresses and aligns tile-level features from diverse models and employs a lightweight attention mechanism to adaptively fuse them based on tissue phenotype context. We evaluate AdaFusion on three real-world benchmarks spanning treatment response prediction, tumour grading, and spatial gene expression inference. Our approach consistently surpasses individual PFMs across both classification and regression tasks, while offering interpretable insights into each model's biosemantic specialisation. These results highlight AdaFusion's ability to bridge heterogeneous PFMs, achieving both enhanced performance and interpretability of model-specific inductive biases.[77] PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation
Jingxuan He,Busheng Su,Finn Wong
Main category: cs.CV
TL;DR: PoseGen is a novel framework that generates long, coherent videos with precise control over subject identity and motion, overcoming the limitations of current diffusion models.
Details
Motivation: The motivation is to overcome the challenges faced by current diffusion models in generating long, temporally coherent videos with precise control over subject identity and motion, particularly addressing identity drift and limitations to short clips. Method: PoseGen utilizes an in-context LoRA finetuning strategy for identity preservation and conditions on pose information for motion control. It also employs an interleaved segment generation method with a shared KV cache mechanism for seamless video clip stitching. Result: PoseGen can generate arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence, showing significant improvements in identity fidelity, pose accuracy, and video coherence. Conclusion: PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and the unique ability to produce coherent, artifact-free videos of unlimited duration. Abstract: Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.[78] Sculpting Margin Penalty: Intra-Task Adapter Merging and Classifier Calibration for Few-Shot Class-Incremental Learning
Liang Bai,Hong Song,Jinfu Li,Yucong Lin,Jingfan Fan,Tianyu Fu,Danni Ai,Deqiang Xiao,Jian Yang
Main category: cs.CV
TL;DR: 本文提出了一种新的Few-Shot Class-Incremental Learning(FSCIL)方法SMP,通过引入Margin-aware Intra-task Adapter Merging(MIAM)和Margin Penalty-based Classifier Calibration(MPCC)策略,解决了基础类别可区分性和新类别泛化之间的平衡问题,取得了优异的性能表现。
Details
Motivation: 现实应用中的数据隐私限制和高获取成本导致传统的增量学习假设不现实,因此需要一种能够更好平衡基础类别可区分性和新类别泛化能力的FSCIL方法。 Method: SMP提出了Margin-aware Intra-task Adapter Merging(MIAM)和Margin Penalty-based Classifier Calibration(MPCC)策略,分别用于基础任务学习和增量任务分类器校准。 Result: 在CIFAR100、ImageNet-R和CUB200数据集上的实验表明,SMP在FSCIL任务中达到了最先进的性能,并在基础类别和新类别之间保持了更好的平衡。 Conclusion: SMP有效地解决了FSCIL中基础类别可区分性和新类别泛化之间的平衡问题,并通过在不同阶段引入margin penalties实现了更优的前向兼容性。 Abstract: Real-world applications often face data privacy constraints and high acquisition costs, making the assumption of sufficient training data in incremental tasks unrealistic and leading to significant performance degradation in class-incremental learning. Forward-compatible learning, which prospectively prepares for future tasks during base task training, has emerged as a promising solution for Few-Shot Class-Incremental Learning (FSCIL). However, existing methods still struggle to balance base-class discriminability and new-class generalization. Moreover, limited access to original data during incremental tasks often results in ambiguous inter-class decision boundaries. To address these challenges, we propose SMP (Sculpting Margin Penalty), a novel FSCIL method that strategically integrates margin penalties at different stages within the parameter-efficient fine-tuning paradigm. Specifically, we introduce the Margin-aware Intra-task Adapter Merging (MIAM) mechanism for base task learning. MIAM trains two sets of low-rank adapters with distinct classification losses: one with a margin penalty to enhance base-class discriminability, and the other without margin constraints to promote generalization to future new classes. These adapters are then adaptively merged to improve forward compatibility. For incremental tasks, we propose a Margin Penalty-based Classifier Calibration (MPCC) strategy to refine decision boundaries by fine-tuning classifiers on all seen classes' embeddings with a margin penalty. Extensive experiments on CIFAR100, ImageNet-R, and CUB200 demonstrate that SMP achieves state-of-the-art performance in FSCIL while maintaining a better balance between base and new classes.[79] AHDMIL: Asymmetric Hierarchical Distillation Multi-Instance Learning for Fast and Accurate Whole-Slide Image Classification
Jiuyang Dong,Jiahan Li,Junjun Jiang,Kui Jiang,Yongbing Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为AHDMIL的不对称分层蒸馏多实例学习框架,通过消除不相关的图像补丁,实现了在计算病理学中快速且准确的分类。
Details
Motivation: 多实例学习(MIL)在病理图像分类中取得了成功,但由于需要处理来自每个千兆像素全切片图像(WSI)的数千个补丁,其推理成本较高。为了解决这一问题,本文提出了AHDMIL框架,旨在实现快速且准确的分类。 Method: AHDMIL包括两个关键组件:用于处理高分辨率WSI的动态多实例网络(DMIN)和用于分析低分辨率对应物的双分支轻量级实例预筛选网络(DB-LIPN)。DMIN在自蒸馏阶段生成的实例注意力分数用于指导DB-LIPN的学习,DB-LIPN预测的高分辨率相关补丁用于DMIN的微调和高效推理。此外,还设计了基于切比雪夫多项式的Kolmogorov-Arnold(CKA)分类器。 Result: 实验表明,AHDMIL在四个公开数据集上均优于之前的最先进方法。例如,在Camelyon16数据集上,准确率提高了5.3%,推理速度加快了1.2倍。所有数据集的曲线下面积(AUC)、准确率、F1分数和Brier分数均显示出一致的提升,平均推理速度提升了1.2到2.1倍。 Conclusion: AHDMIL框架在计算病理学中实现了快速且准确的分类,通过消除不相关的补丁来降低推理成本,并在多个数据集上显著优于现有方法。 Abstract: Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to the need to process thousands of patches from each gigapixel whole slide image (WSI). To address this, we propose AHDMIL, an Asymmetric Hierarchical Distillation Multi-Instance Learning framework that enables fast and accurate classification by eliminating irrelevant patches through a two-step training process. AHDMIL comprises two key components: the Dynamic Multi-Instance Network (DMIN), which operates on high-resolution WSIs, and the Dual-Branch Lightweight Instance Pre-screening Network (DB-LIPN), which analyzes corresponding low-resolution counterparts. In the first step, self-distillation (SD), DMIN is trained for WSI classification while generating per-instance attention scores to identify irrelevant patches. These scores guide the second step, asymmetric distillation (AD), where DB-LIPN learns to predict the relevance of each low-resolution patch. The relevant patches predicted by DB-LIPN have spatial correspondence with patches in high-resolution WSIs, which are used for fine-tuning and efficient inference of DMIN. In addition, we design the first Chebyshev-polynomial-based Kolmogorov-Arnold (CKA) classifier in computational pathology, which improves classification performance through learnable activation layers. Extensive experiments on four public datasets demonstrate that AHDMIL consistently outperforms previous state-of-the-art methods in both classification performance and inference speed. For example, on the Camelyon16 dataset, it achieves a relative improvement of 5.3% in accuracy and accelerates inference by 1.2.times. Across all datasets, area under the curve (AUC), accuracy, f1 score, and brier score show consistent gains, with average inference speedups ranging from 1.2 to 2.1 times. The code is available.[80] MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs
Yufei Gao,Jiaying Fei,Nuo Chen,Ruirui Chen,Guohang Yan,Yunshi Lan,Botian Shi
Main category: cs.CV
TL;DR: This paper introduces MELLA, a dataset designed to improve Multimodal Large Language Models' performance in low-resource languages by enhancing both linguistic capabilities and cultural awareness.
Details
Motivation: MLLMs perform poorly in low-resource languages due to limited multilingual enhancement methods that often focus only on text modality or machine translation, neglecting cultural and multimodal relevance. Method: The researchers proposed a dual-source strategy to collect data for linguistic and cultural enhancement and introduced the MELLA dataset for multimodal, multilingual training. Result: Fine-tuning on the MELLA dataset led to improved performance across eight languages and demonstrated that both linguistic and cultural enhancements contribute to model effectiveness. Conclusion: The study concludes that enhancing both linguistic capability and cultural groundedness significantly improves the performance of Multimodal Large Language Models (MLLMs) in low-resource language settings. Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce "thin descriptions", they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing "thick descriptions". We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.[81] Latent Expression Generation for Referring Image Segmentation and Grounding
Seonghoon Yu,Joonbeom Hong,Joonseok Lee,Jeany Son
Main category: cs.CV
TL;DR: 该论文提出了一种新的视觉基础框架,通过从单一文本输入生成多个潜在表达,利用互补视觉细节来提高视觉定位任务的准确性。
Details
Motivation: 现有的方法依赖单一文本输入,难以充分利用视觉领域中丰富的信息,导致相似物体容易被错误识别。 Method: 引入了主体分配器和视觉概念注入器模块,将共享主体和独特属性概念嵌入潜在表示中,并采用正边距对比学习策略对齐所有潜在表达与原始文本。 Result: 实验结果表明,该方法在多个基准测试中不仅优于现有的RIS和REC方法,还在广义指代表达分割(GRES)基准测试中表现出色。 Conclusion: 该方法有效解决了单一文本输入与丰富视觉细节之间的不匹配问题,提高了视觉基础任务的性能。 Abstract: Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that leverages multiple latent expressions generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to embed both shared-subject and distinct-attributes concepts into the latent representations, thereby capturing unique and target-specific visual cues. We also propose a positive-margin contrastive learning strategy to align all latent expressions with the original text while preserving subtle variations. Experimental results show that our method not only outperforms state-of-the-art RIS and REC approaches on multiple benchmarks but also achieves outstanding performance on the generalized referring expression segmentation (GRES) benchmark.[82] Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
Luozheng Qin,Jia Gong,Yuqing Sun,Tianjiao Li,Mengping Yang,Xiaomeng Yang,Chao Qu,Zhiyu Tan,Hao Li
Main category: cs.CV
TL;DR: 本文提出了一种统一的双层推理框架Uni-CoT,用于解决多模态任务中的视觉状态转换和连贯性问题,通过高效的训练策略实现了卓越的性能和泛化能力。
Details
Motivation: 现有的多模态推理方法在建模视觉状态转换或生成连贯的视觉轨迹方面存在局限性,因此需要一种更高效和统一的推理框架。 Method: Uni-CoT利用一个统一的模型进行多模态推理,引入了宏观层面的任务规划和微观层面的子任务执行,并通过交错的图文监督训练和多任务目标相结合的训练策略优化模型。 Result: Uni-CoT在多个图像生成和编辑基准测试中表现出SOTA性能,同时仅需8块80GB显存的A100 GPU即可完成实验。 Conclusion: Uni-CoT通过宏观和微观的双层推理范式,在多模态推理任务中实现了高效且连贯的推理,展示了最先进的性能和强大的泛化能力。 Abstract: Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures. To overcome these limitations, we propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning within a single unified model. The key idea is to leverage a model capable of both image understanding and generation to reason over visual content and model evolving visual states. However, empowering a unified model to achieve that is non-trivial, given the high computational cost and the burden of training. To address this, Uni-CoT introduces a novel two-level reasoning paradigm: A Macro-Level CoT for high-level task planning and A Micro-Level CoT for subtask execution. This design significantly reduces the computational overhead. Furthermore, we introduce a structured training paradigm that combines interleaved image-text supervision for macro-level CoT with multi-task objectives for micro-level CoT. Together, these innovations allow Uni-CoT to perform scalable and coherent multi-modal reasoning. Furthermore, thanks to our design, all experiments can be efficiently completed using only 8 A100 GPUs with 80GB VRAM each. Experimental results on reasoning-driven image generation benchmark (WISE) and editing benchmarks (RISE and KRIS) indicates that Uni-CoT demonstrates SOTA performance and strong generalization, establishing Uni-CoT as a promising solution for multi-modal reasoning. Project Page and Code: https://sais-fuxi.github.io/projects/uni-cot/[83] FedGIN: Federated Learning with Dynamic Global Intensity Non-linear Augmentation for Organ Segmentation using Multi-modal Images
Sachin Dudda Nagaraju,Ashkan Moradi,Bendik Skarre Abrahamsen,Mattijs Elschot
Main category: cs.CV
TL;DR: FedGIN is a Federated Learning framework with GIN augmentation for multimodal medical image segmentation that preserves privacy, improves performance, and achieves near-centralized results.
Details
Motivation: Medical image segmentation is crucial for diagnostics and treatment monitoring, but real-world deployment faces challenges like data scarcity, domain shift between modalities, and privacy restrictions. A unified model capable of generalizing across modalities while preserving privacy would be highly beneficial. Method: FedGIN, a Federated Learning (FL) framework with a lightweight Global Intensity Non-linear (GIN) augmentation module, was developed to enable multimodal organ segmentation without sharing raw patient data. It harmonizes modality-specific intensity distributions during local training. Result: FedGIN achieved a 12 to 18% improvement in 3D Dice scores on MRI test cases compared to FL without GIN in the limited dataset scenario. In the complete dataset scenario, it showed a 30% Dice score improvement over the MRI-only baseline and a 10% improvement over the CT-only baseline, achieving near-centralized performance. Conclusion: FedGIN demonstrates strong cross-modality generalization under privacy constraints, achieving near-centralized performance and outperforming local baselines and FL without GIN in multimodal organ segmentation. Abstract: Medical image segmentation plays a crucial role in AI-assisted diagnostics, surgical planning, and treatment monitoring. Accurate and robust segmentation models are essential for enabling reliable, data-driven clinical decision making across diverse imaging modalities. Given the inherent variability in image characteristics across modalities, developing a unified model capable of generalizing effectively to multiple modalities would be highly beneficial. This model could streamline clinical workflows and reduce the need for modality-specific training. However, real-world deployment faces major challenges, including data scarcity, domain shift between modalities (e.g., CT vs. MRI), and privacy restrictions that prevent data sharing. To address these issues, we propose FedGIN, a Federated Learning (FL) framework that enables multimodal organ segmentation without sharing raw patient data. Our method integrates a lightweight Global Intensity Non-linear (GIN) augmentation module that harmonizes modality-specific intensity distributions during local training. We evaluated FedGIN using two types of datasets: an imputed dataset and a complete dataset. In the limited dataset scenario, the model was initially trained using only MRI data, and CT data was added to assess its performance improvements. In the complete dataset scenario, both MRI and CT data were fully utilized for training on all clients. In the limited-data scenario, FedGIN achieved a 12 to 18% improvement in 3D Dice scores on MRI test cases compared to FL without GIN and consistently outperformed local baselines. In the complete dataset scenario, FedGIN demonstrated near-centralized performance, with a 30% Dice score improvement over the MRI-only baseline and a 10% improvement over the CT-only baseline, highlighting its strong cross-modality generalization under privacy constraints.[84] Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
Yong Du,Yuchen Yan,Fei Tang,Zhengxi Lu,Chang Zong,Weiming Lu,Shengpei Jiang,Yongliang Shen
Main category: cs.CV
TL;DR: 该研究提出了一种新的测试时扩展方法(GUI-RC)和基于一致性的强化学习方法(GUI-RCPO),通过利用模型预测结果的空间一致性来提高GUI定位任务的准确性,无需额外训练。
Details
Motivation: 现有方法依赖于大量监督训练或带有标签奖励的强化学习,但受到像素级标注成本和可用性的限制。该研究旨在利用模型生成的多个预测的空间重叠模式来提升性能。 Method: 该论文通过构建空间投票网格,利用模型在测试时生成的多个预测结果的一致性来提高定位准确性,并将一致性模式转化为奖励机制,用于测试时的强化学习(GUI-RCPO)。 Result: GUI-RC在ScreenSpot基准测试中将Qwen2.5-VL-3B-Instruct的准确性从80.11%提升至83.57%,而GUI-RCPO进一步将其提升至85.14%。 Conclusion: 该论文提出了一种无需训练的测试时扩展方法(GUI-RC)和一种基于一致性的强化学习方法(GUI-RCPO),用于提升GUI定位任务的准确性,展示了在测试时扩展和自监督优化的潜力。 Abstract: Graphical User Interface (GUI) grounding, the task of mapping natural language instructions to precise screen coordinates, is fundamental to autonomous GUI agents. While existing methods achieve strong performance through extensive supervised training or reinforcement learning with labeled rewards, they remain constrained by the cost and availability of pixel-level annotations. We observe that when models generate multiple predictions for the same GUI element, the spatial overlap patterns reveal implicit confidence signals that can guide more accurate localization. Leveraging this insight, we propose GUI-RC (Region Consistency), a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions where models show highest agreement. Without any training, GUI-RC improves accuracy by 2-3% across various architectures on ScreenSpot benchmarks. We further introduce GUI-RCPO (Region Consistency Policy Optimization), which transforms these consistency patterns into rewards for test-time reinforcement learning. By computing how well each prediction aligns with the collective consensus, GUI-RCPO enables models to iteratively refine their outputs on unlabeled data during inference. Extensive experiments demonstrate the generality of our approach: GUI-RC boosts Qwen2.5-VL-3B-Instruct from 80.11% to 83.57% on ScreenSpot-v2, while GUI-RCPO further improves it to 85.14% through self-supervised optimization. Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.[85] Deep Learning-based Animal Behavior Analysis: Insights from Mouse Chronic Pain Models
Yu-Hsi Chen,Wei-Hsin Chen,Chien-Yao Wang,Hong-Yuan Mark Liao,James C. Liao,Chien-Chang Chen
Main category: cs.CV
TL;DR: 该研究提出了一种新的自动发现与慢性疼痛相关特征的框架,避免了人工标注带来的偏差,并在疼痛分类任务中表现优于人类专家和现有方法。
Details
Motivation: 现有的慢性疼痛行为评估方法主要依赖于人工标注行为特征,而人类对哪些行为最能代表慢性疼痛缺乏清晰的认识,导致现有方法难以准确捕捉慢性疼痛的隐匿性和持续性行为变化。 Method: 该研究提出了一种无需依赖人工定义动作标签的通用动作空间投影器方法,从原始视频中自动提取小鼠动作特征,避免了人工标注可能带来的偏差。 Result: 在15类疼痛分类任务中,该方法达到了48.41%的准确率,显著优于人类专家(21.33%)和广泛使用的方法B-SOiD(30.52%)。在仅包含三类(神经性疼痛、炎症性疼痛和无疼痛)的分类任务中,该方法达到了73.1%的准确率,也明显高于人类专家(48%)和B-SOiD(58.43%)。 Conclusion: 该研究展示了一种新的慢性疼痛行为分析方法,具有潜在的临床应用价值,为疼痛研究和相关药物开发提供了新的视角。 Abstract: Assessing chronic pain behavior in mice is critical for preclinical studies. However, existing methods mostly rely on manual labeling of behavioral features, and humans lack a clear understanding of which behaviors best represent chronic pain. For this reason, existing methods struggle to accurately capture the insidious and persistent behavioral changes in chronic pain. This study proposes a framework to automatically discover features related to chronic pain without relying on human-defined action labels. Our method uses universal action space projector to automatically extract mouse action features, and avoids the potential bias of human labeling by retaining the rich behavioral information in the original video. In this paper, we also collected a mouse pain behavior dataset that captures the disease progression of both neuropathic and inflammatory pain across multiple time points. Our method achieves 48.41\% accuracy in a 15-class pain classification task, significantly outperforming human experts (21.33\%) and the widely used method B-SOiD (30.52\%). Furthermore, when the classification is simplified to only three categories, i.e., neuropathic pain, inflammatory pain, and no pain, then our method achieves an accuracy of 73.1\%, which is notably higher than that of human experts (48\%) and B-SOiD (58.43\%). Finally, our method revealed differences in drug efficacy for different types of pain on zero-shot Gabapentin drug testing, and the results were consistent with past drug efficacy literature. This study demonstrates the potential clinical application of our method, which can provide new insights into pain research and related drug development.[86] Rotation Equivariant Arbitrary-scale Image Super-Resolution
Qi Xie,Jiahong Fu,Zongben Xu,Deyu Meng
Main category: cs.CV
TL;DR: 本文提出了一种旋转等变的任意尺度图像超分辨率(ASISR)方法,通过重新设计INR和编码器模块的架构,实现了从输入到输出的端到端旋转等变性,从而有效保持了输入图像中几何模式的原始方向和结构完整性。
Details
Motivation: 为了解决任意尺度图像超分辨率(ASISR)任务中由于低分辨率图像中常见的几何模式(如重复纹理、边缘或形状)严重扭曲和变形而导致的高分辨率重建中出现的意外伪影问题,需要将旋转等变性嵌入到ASISR网络中。 Method: 重新设计了INR和编码器模块的基本架构,以实现从输入到输出的端到端旋转等变性。 Result: 首次实现了从输入到输出的端到端旋转等变性ASISR网络,并通过理论分析评估了其内在等变误差,证明了其嵌入等变结构的固有性质。 Conclusion: 实验结果证明了所提方法在模拟和真实数据集上的优越性,并且该框架可以即插即用地集成到现有的ASISR方法中以进一步提升其性能。 Abstract: The arbitrary-scale image super-resolution (ASISR), a recent popular topic in computer vision, aims to achieve arbitrary-scale high-resolution recoveries from a low-resolution input image. This task is realized by representing the image as a continuous implicit function through two fundamental modules, a deep-network-based encoder and an implicit neural representation (INR) module. Despite achieving notable progress, a crucial challenge of such a highly ill-posed setting is that many common geometric patterns, such as repetitive textures, edges, or shapes, are seriously warped and deformed in the low-resolution images, naturally leading to unexpected artifacts appearing in their high-resolution recoveries. Embedding rotation equivariance into the ASISR network is thus necessary, as it has been widely demonstrated that this enhancement enables the recovery to faithfully maintain the original orientations and structural integrity of geometric patterns underlying the input image. Motivated by this, we make efforts to construct a rotation equivariant ASISR method in this study. Specifically, we elaborately redesign the basic architectures of INR and encoder modules, incorporating intrinsic rotation equivariance capabilities beyond those of conventional ASISR networks. Through such amelioration, the ASISR network can, for the first time, be implemented with end-to-end rotational equivariance maintained from input to output. We also provide a solid theoretical analysis to evaluate its intrinsic equivariance error, demonstrating its inherent nature of embedding such an equivariance structure. The superiority of the proposed method is substantiated by experiments conducted on both simulated and real datasets. We also validate that the proposed framework can be readily integrated into current ASISR methods in a plug \& play manner to further enhance their performance.[87] X-MoGen: Unified Motion Generation across Humans and Animals
Xuan Wang,Kai Ruan,Liyang Qian,Zhizhi Guo,Chang Su,Gaoang Wang
Main category: cs.CV
TL;DR: 本文提出了一种跨物种的文本驱动运动生成统一框架X-MoGen,并构建了支持统一建模的大规模数据集UniMo4D。
Details
Motivation: 由于虚拟现实、动画和机器人技术中的广泛应用,文本驱动的运动生成受到了越来越多的关注。现有的方法通常分别建模人类和动物的运动,而跨物种的统一方法可以提供更广泛的通用性和统一的表示。然而,不同物种的形态差异仍然是一个关键挑战。 Method: X-MoGen采用了一个两阶段架构。第一阶段,条件图变分自编码器学习标准T姿态先验,同时自编码器通过形态损失正则化将运动编码到共享的潜在空间。在第二阶段,进行掩码运动建模以生成基于文本描述的运动嵌入。此外,训练期间采用形态一致性模块以提高骨骼运动的合理性。构建了大规模数据集UniMo4D,包括115个物种和119k运动序列,以支持统一建模。 Result: 在UniMo4D上的大量实验表明,X-MoGen在已见和未见物种上均优于现有最先进方法。 Conclusion: X-MoGen是第一个用于跨物种文本驱动运动生成的统一框架,能够有效处理不同物种的形态差异,并在大规模数据集UniMo4D上表现出色。 Abstract: Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose \textbf{X-MoGen}, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct \textbf{UniMo4D}, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.[88] PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems
Qi Guo,Xiaojun Jia,Shanmin Pang,Simeng Qin,Lin Wang,Ju Jia,Yang Liu,Qing Guo
Main category: cs.CV
TL;DR: 本文提出PhysPatch,一种专为基于多模态大语言模型(MLLM)的自动驾驶系统设计的对抗补丁框架,通过优化补丁的位置、形状和内容,提高攻击效果和现实世界中的适用性。
Details
Motivation: 现有的基于补丁的攻击方法主要针对目标检测模型,在转移到基于MLLM的系统时效果较差,因此需要一种新的方法来应对MLLM的复杂架构和推理能力带来的挑战。 Method: PhysPatch联合优化补丁位置、形状和内容,引入了基于语义的掩码初始化策略、基于SVD的局部对齐损失以及基于势场的掩码优化方法。 Result: 实验表明,PhysPatch在影响基于MLLM的自动驾驶系统的目标对齐感知和规划输出方面显著优于先前方法,并且补丁能够放置在自动驾驶场景中物理上可行的区域。 Conclusion: PhysPatch是一种专为基于MLLM的自动驾驶系统设计的物理可实现且可转移的对抗补丁框架,显著优于现有方法,并确保了在现实世界中的适用性和可部署性。 Abstract: Multimodal Large Language Models (MLLMs) are becoming integral to autonomous driving (AD) systems due to their strong vision-language reasoning capabilities. However, MLLMs are vulnerable to adversarial attacks, particularly adversarial patch attacks, which can pose serious threats in real-world scenarios. Existing patch-based attack methods are primarily designed for object detection models and perform poorly when transferred to MLLM-based systems due to the latter's complex architectures and reasoning abilities. To address these limitations, we propose PhysPatch, a physically realizable and transferable adversarial patch framework tailored for MLLM-based AD systems. PhysPatch jointly optimizes patch location, shape, and content to enhance attack effectiveness and real-world applicability. It introduces a semantic-based mask initialization strategy for realistic placement, an SVD-based local alignment loss with patch-guided crop-resize to improve transferability, and a potential field-based mask refinement method. Extensive experiments across open-source, commercial, and reasoning-capable MLLMs demonstrate that PhysPatch significantly outperforms prior methods in steering MLLM-based AD systems toward target-aligned perception and planning outputs. Moreover, PhysPatch consistently places adversarial patches in physically feasible regions of AD scenes, ensuring strong real-world applicability and deployability.[89] Multi-tracklet Tracking for Generic Targets with Adaptive Detection Clustering
Zewei Wu,Longhao Wang,Cui Wang,César Teixeira,Wei Ke,Zhang Xiong
Main category: cs.CV
TL;DR: 本文提出了一种多轨迹跟踪(MTT)框架,用于解决多目标跟踪中出现的低置信度检测、弱运动和外观约束以及长期遮挡问题。
Details
Motivation: 由于低置信度检测、弱运动和外观约束以及长期遮挡,现有方法经常受到看不见的类别的挑战。 Method: 提出了一种称为多轨迹跟踪(MTT)的轨迹增强型跟踪器,该跟踪器将灵活的轨迹生成集成到多轨迹关联框架中。 Result: 实验表明,该方法在处理长期关联中的错误传播方面具有良好的性能。 Conclusion: MTT框架在通用多目标跟踪基准测试中表现出较强的竞争力。 Abstract: Tracking specific targets, such as pedestrians and vehicles, has been the focus of recent vision-based multitarget tracking studies. However, in some real-world scenarios, unseen categories often challenge existing methods due to low-confidence detections, weak motion and appearance constraints, and long-term occlusions. To address these issues, this article proposes a tracklet-enhanced tracker called Multi-Tracklet Tracking (MTT) that integrates flexible tracklet generation into a multi-tracklet association framework. This framework first adaptively clusters the detection results according to their short-term spatio-temporal correlation into robust tracklets and then estimates the best tracklet partitions using multiple clues, such as location and appearance over time to mitigate error propagation in long-term association. Finally, extensive experiments on the benchmark for generic multiple object tracking demonstrate the competitiveness of the proposed framework.[90] SPA++: Generalized Graph Spectral Alignment for Versatile Domain Adaptation
Zhiqing Xiao,Haobo Wang,Xu Lu,Wentao Ye,Gang Chen,Junbo Zhao
Main category: cs.CV
TL;DR: 提出了一种新的域适应框架SPA++,通过图谱对齐和邻域感知传播机制,提高了跨域知识迁移的性能。
Details
Motivation: 现有的域适应方法主要关注跨域的可迁移性,而忽略了域内的丰富结构,导致判别能力下降。 Method: 提出了一种广义图谱对齐框架SPA++,该框架结合了粗粒度图对齐机制、细粒度邻域感知传播机制、数据增强和一致性正则化。 Result: 在基准数据集上的大量实验表明,SPA++ 一致优于现有方法,并提供了理论分析来支持该方法。 Conclusion: SPA++ 框架在各种具有挑战性的适应场景中均表现出卓越的鲁棒性和适应性,优于现有的最先进方法。 Abstract: Domain Adaptation (DA) aims to transfer knowledge from a labeled source domain to an unlabeled or sparsely labeled target domain under domain shifts. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. To tackle this tradeoff, we propose a generalized graph SPectral Alignment framework, SPA++. Its core is briefly condensed as follows: (1)-by casting the DA problem to graph primitives, it composes a coarse graph alignment mechanism with a novel spectral regularizer toward aligning the domain graphs in eigenspaces; (2)-we further develop a fine-grained neighbor-aware propagation mechanism for enhanced discriminability in the target domain; (3)-by incorporating data augmentation and consistency regularization, SPA++ can adapt to complex scenarios including most DA settings and even challenging distribution scenarios. Furthermore, we also provide theoretical analysis to support our method, including the generalization bound of graph-based DA and the role of spectral alignment and smoothing consistency. Extensive experiments on benchmark datasets demonstrate that SPA++ consistently outperforms existing cutting-edge methods, achieving superior robustness and adaptability across various challenging adaptation scenarios.[91] SPEX: A Vision-Language Model for Land Cover Extraction on Spectral Remote Sensing Images
Dongchen Si,Di Wang,Erzhong Gao,Xiaolei Qin,Liu Zhao,Jing Zhang,Minqiang Xu,Jianbo Zhan,Jianshe Wang,Lin Liu,Bo Du,Liangpei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为SPEX的多模态视觉-语言模型,用于光谱遥感图像中的指令驱动土地覆盖提取,并构建了相应的数据集SPIE。SPEX通过多尺度特征聚合、标记上下文压缩和多光谱视觉预训练等策略,实现了精确和灵活的像素级解释,并能生成文本解释以提高可解释性和用户友好性。实验结果表明,SPEX在多个数据集上优于现有方法。
Details
Motivation: 光谱信息在遥感观测中至关重要,但目前的视觉-语言模型未能充分利用光谱信息,导致性能不佳,尤其是在多光谱场景中。 Method: 构建了一个名为SPIE的视觉-语言指令跟随数据集,将土地覆盖物体的光谱先验编码为大语言模型可识别的文本属性,并提出了SPEX模型,引入了多尺度特征聚合、标记上下文压缩和多光谱视觉预训练等组件和训练策略。 Result: 在五个公开的多光谱数据集上的大量实验表明,SPEX在提取植被、建筑物和水体等典型土地覆盖类别方面始终优于现有的最先进方法。 Conclusion: SPEX是首个专注于光谱遥感图像中土地覆盖提取的多模态视觉-语言模型,并能够生成对其预测结果的文本解释,提高了可解释性和用户友好性。 Abstract: Spectral information has long been recognized as a critical cue in remote sensing observations. Although numerous vision-language models have been developed for pixel-level interpretation, spectral information remains underutilized, resulting in suboptimal performance, particularly in multispectral scenarios. To address this limitation, we construct a vision-language instruction-following dataset named SPIE, which encodes spectral priors of land-cover objects into textual attributes recognizable by large language models (LLMs), based on classical spectral index computations. Leveraging this dataset, we propose SPEX, a multimodal LLM designed for instruction-driven land cover extraction. To this end, we introduce several carefully designed components and training strategies, including multiscale feature aggregation, token context condensation, and multispectral visual pre-training, to achieve precise and flexible pixel-level interpretation. To the best of our knowledge, SPEX is the first multimodal vision-language model dedicated to land cover extraction in spectral remote sensing imagery. Extensive experiments on five public multispectral datasets demonstrate that SPEX consistently outperforms existing state-of-the-art methods in extracting typical land cover categories such as vegetation, buildings, and water bodies. Moreover, SPEX is capable of generating textual explanations for its predictions, thereby enhancing interpretability and user-friendliness. Code will be released at: https://github.com/MiliLab/SPEX.[92] EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery
Bingyu Yang,Qingyao Tian,Yimeng Geng,Huai Liao,Xinyan Huang,Jiebo Luo,Hongbin Liu
Main category: cs.CV
TL;DR: EndoMatcher是一种基于大规模多领域数据预训练的内窥镜图像匹配方法,通过双分支Vision Transformer和Endo-Mix6数据集的多目标训练策略,实现了在复杂内窥镜条件下准确且通用的密集匹配。
Details
Motivation: 内窥镜图像的密集特征匹配对于机器人辅助任务至关重要,但由于视觉条件差和标注数据不足,通用匹配方法仍然具有挑战性。 Method: 提出EndoMatcher,使用双分支Vision Transformer和双交互模块提取多尺度特征,并构建Endo-Mix6数据集进行多目标训练以提高匹配效果。 Result: 在Hamlyn和Bladder数据集上,EndoMatcher的匹配点对数量分别提高了140.69%和201.43%,在Gastro-Matching数据集上的匹配方向预测准确率(MDPA)提高了9.40%。 Conclusion: EndoMatcher实现了在不同器官和成像条件下的零样本密集特征匹配,通过构建Endo-Mix6数据集和采用渐进式多目标训练策略,提高了匹配的准确性和泛化能力。 Abstract: Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with correspondence labels generated using Structure-from-Motion and simulated transformations. The diversity and scale of Endo-Mix6 introduce new challenges in training stability due to significant variations in dataset sizes, distribution shifts, and error imbalance. To address them, a progressive multi-objective training strategy is employed to promote balanced learning and improve representation quality across domains. This enables EndoMatcher to generalize across unseen organs and imaging conditions in a zero-shot fashion. Extensive zero-shot matching experiments demonstrate that EndoMatcher increases the number of inlier matches by 140.69% and 201.43% on the Hamlyn and Bladder datasets over state-of-the-art methods, respectively, and improves the Matching Direction Prediction Accuracy (MDPA) by 9.40% on the Gastro-Matching dataset, achieving dense and accurate matching under challenging endoscopic conditions. The code is publicly available at https://github.com/Beryl2000/EndoMatcher.[93] VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang,Runsen Xu,Chenhang Cui,Tai Wang,Dahua Lin,Jiangmiao Pang
Main category: cs.CV
TL;DR: VFlowOpt是一种高效的视觉标记剪枝框架,通过重要性图和渐进式剪枝机制,在保持性能的同时大幅减少计算资源消耗。
Details
Motivation: 现有的视觉标记剪枝方法简单且效果不佳,导致性能下降,因此需要一种更有效的剪枝框架。 Method: 引入了基于注意力和信息熵的重要性图,以及渐进式修剪模块和回收机制,并通过视觉信息流引导的方法优化超参数。 Result: VFlowOpt能够修剪90%的视觉标记,保持性能的同时,KV-Cache内存减少89%,推理速度快3.8倍。 Conclusion: VFlowOpt有效地减少了视觉语言模型中的视觉标记数量,同时保持了性能,从而显著降低了计算成本和内存使用。 Abstract: Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning strategies tailored to different LMMs. Experiments demonstrate that VFlowOpt can prune 90% of visual tokens while maintaining comparable performance, leading to an 89% reduction in KV-Cache memory and 3.8 times faster inference.[94] Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation
Jianming Liu,Wenlong Qiu,Haitao Wei
Main category: cs.CV
TL;DR: This paper introduces a source-free Cross-Domain Few-Shot Segmentation method that uses textual and visual information for adaptation, achieving significant improvements in segmentation accuracy.
Details
Motivation: Motivation stems from the need to address performance degradation in Few-Shot Segmentation due to domain discrepancies and the increasing concerns over data privacy and training costs. Method: The method involves appending Task-Specific Attention Adapters (TSAA) to a pretrained backbone and training TSAA parameters using Visual-Visual Embedding Alignment (VVEA) and Text-Visual Embedding Alignment (TVEA) modules. Result: Under both 1-shot and 5-shot settings, the approach achieved average segmentation accuracy improvements of 2.18% and 4.11%, respectively, across four cross-domain datasets. Conclusion: The paper proposes a source-free CD-FSS method that uses textual and visual information for target domain task adaptation, significantly outperforming state-of-the-art CD-FSS methods. Abstract: Few-Shot Segmentation(FSS) aims to efficient segmentation of new objects with few labeled samples. However, its performance significantly degrades when domain discrepancies exist between training and deployment. Cross-Domain Few-Shot Segmentation(CD-FSS) is proposed to mitigate such performance degradation. Current CD-FSS methods primarily sought to develop segmentation models on a source domain capable of cross-domain generalization. However, driven by escalating concerns over data privacy and the imperative to minimize data transfer and training expenses, the development of source-free CD-FSS approaches has become essential. In this work, we propose a source-free CD-FSS method that leverages both textual and visual information to facilitate target domain task adaptation without requiring source domain data. Specifically, we first append Task-Specific Attention Adapters (TSAA) to the feature pyramid of a pretrained backbone, which adapt multi-level features extracted from the shared pre-trained backbone to the target task. Then, the parameters of the TSAA are trained through a Visual-Visual Embedding Alignment (VVEA) module and a Text-Visual Embedding Alignment (TVEA) module. The VVEA module utilizes global-local visual features to align image features across different views, while the TVEA module leverages textual priors from pre-aligned multi-modal features (e.g., from CLIP) to guide cross-modal adaptation. By combining the outputs of these modules through dense comparison operations and subsequent fusion via skip connections, our method produces refined prediction masks. Under both 1-shot and 5-shot settings, the proposed approach achieves average segmentation accuracy improvements of 2.18\% and 4.11\%, respectively, across four cross-domain datasets, significantly outperforming state-of-the-art CD-FSS methods. Code are available at https://github.com/ljm198134/TVGTANet.[95] ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking
Xiao Wang,Liye Jin,Xufeng Lou,Shiao Wang,Lan Chen,Bo Jiang,Zhipeng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的基于推理的视觉-语言跟踪框架ReasoningTrack,结合预训练视觉-语言模型和强化学习优化语言生成,同时构建了一个大规模长期视觉-语言跟踪数据集TNLLT,实验验证了方法的有效性。
Details
Motivation: 现有视觉-语言跟踪方法在处理目标变化时性能受限,且未充分利用大模型的优势,本文旨在解决这些问题。 Method: 基于预训练视觉-语言模型Qwen2.5-VL,结合SFT(监督微调)和强化学习GRPO优化推理和语言生成,将更新的语言描述与视觉特征一起输入统一的跟踪主干网络,并采用跟踪头预测目标位置。 Result: 在多个视觉-语言跟踪基准数据集上的实验验证了所提方法的有效性,构建了包含200个视频序列的大规模数据集TNLLT,并重新训练和评估了20个基线视觉跟踪器。 Conclusion: 本文提出了一种基于推理的视觉-语言跟踪框架ReasoningTrack,并构建了一个大规模长期视觉-语言跟踪基准数据集TNLLT,实验证明了该方法的有效性。 Abstract: Vision-language tracking has received increasing attention in recent years, as textual information can effectively address the inflexibility and inaccuracy associated with specifying the target object to be tracked. Existing works either directly fuse the fixed language with vision features or simply modify using attention, however, their performance is still limited. Recently, some researchers have explored using text generation to adapt to the variations in the target during tracking, however, these works fail to provide insights into the model's reasoning process and do not fully leverage the advantages of large models, which further limits their overall performance. To address the aforementioned issues, this paper proposes a novel reasoning-based vision-language tracking framework, named ReasoningTrack, based on a pre-trained vision-language model Qwen2.5-VL. Both SFT (Supervised Fine-Tuning) and reinforcement learning GRPO are used for the optimization of reasoning and language generation. We embed the updated language descriptions and feed them into a unified tracking backbone network together with vision features. Then, we adopt a tracking head to predict the specific location of the target object. In addition, we propose a large-scale long-term vision-language tracking benchmark dataset, termed TNLLT, which contains 200 video sequences. 20 baseline visual trackers are re-trained and evaluated on this dataset, which builds a solid foundation for the vision-language visual tracking task. Extensive experiments on multiple vision-language tracking benchmark datasets fully validated the effectiveness of our proposed reasoning-based natural language generation strategy. The source code of this paper will be released on https://github.com/Event-AHU/Open_VLTrack[96] Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2
Semanur Küçük,Cosimo Della Santina,Angeliki Laskari
Main category: cs.CV
TL;DR: This paper shows that a fine-tuned Segment Anything Model (SAM v2.1) can accurately segment complex, non-spherical bubble structures in multiphase flows using a small dataset, addressing a key challenge in industrial applications where traditional methods fall short.
Details
Motivation: The motivation behind this study is the challenge of accurately segmenting gas bubbles in multiphase flows, particularly in industrial applications like metallurgical processing and maritime drag reduction, where bubbles often undergo deformation, coalescence, or breakup, forming amorphous and topologically diverse patches. Method: The researchers approached the problem as a transfer learning task, utilizing the Segment Anything Model (SAM v2.1) and fine-tuning it with a limited dataset of 100 annotated images to segment complex bubble structures. Result: The proposed method demonstrated that a fine-tuned SAM v2.1 model can accurately segment highly non-convex, irregular bubble structures using a small dataset of only 100 annotated images, marking a significant improvement over traditional and recent learning-based approaches. Conclusion: The study concludes that using a fine-tuned Segment Anything Model (SAM v2.1) can effectively segment highly non-convex and irregular bubble structures in multiphase flows, overcoming the limitations of traditional and learning-based methods that assume near-spherical bubble shapes. Abstract: Segmenting gas bubbles in multiphase flows is a critical yet unsolved challenge in numerous industrial settings, from metallurgical processing to maritime drag reduction. Traditional approaches-and most recent learning-based methods-assume near-spherical shapes, limiting their effectiveness in regimes where bubbles undergo deformation, coalescence, or breakup. This complexity is particularly evident in air lubrication systems, where coalesced bubbles form amorphous and topologically diverse patches. In this work, we revisit the problem through the lens of modern vision foundation models. We cast the task as a transfer learning problem and demonstrate, for the first time, that a fine-tuned Segment Anything Model SAM v2.1 can accurately segment highly non-convex, irregular bubble structures using as few as 100 annotated images.[97] ArbiViewGen: Controllable Arbitrary Viewpoint Camera Data Generation for Autonomous Driving via Stable Diffusion Models
Yatong Lan,Jingfeng Chen,Yiru Wang,Lei He
Main category: cs.CV
TL;DR: 提出了一种名为 Arbiviewgen 的新方法,利用扩散模型和自监督学习技术,实现无需地面实况数据的可控任意视角图像生成,适用于自动驾驶领域。
Details
Motivation: 由于缺乏用于外推视角的地面实况数据,自动驾驶中可控视角图像生成面临挑战,因此需要开发一种无需真实数据监督的高保真生成模型。 Method: 提出了基于扩散模型的 Arbiviewgen 框架,包含两个关键组件:特征感知自适应视角拼接(FAVS)和跨视角一致性自监督学习(CVC-SSL) Result: Arbiviewgen 仅需多视角相机图像及其位姿信息即可生成高保真的任意视角图像,并在跨视角一致性方面表现良好。 Conclusion: Arbiviewgen 是第一个能够在多种车辆配置下实现可控任意视角相机图像生成的方法,且无需额外传感器或深度图即可完成训练。 Abstract: Arbitrary viewpoint image generation holds significant potential for autonomous driving, yet remains a challenging task due to the lack of ground-truth data for extrapolated views, which hampers the training of high-fidelity generative models. In this work, we propose Arbiviewgen, a novel diffusion-based framework for the generation of controllable camera images from arbitrary points of view. To address the absence of ground-truth data in unseen views, we introduce two key components: Feature-Aware Adaptive View Stitching (FAVS) and Cross-View Consistency Self-Supervised Learning (CVC-SSL). FAVS employs a hierarchical matching strategy that first establishes coarse geometric correspondences using camera poses, then performs fine-grained alignment through improved feature matching algorithms, and identifies high-confidence matching regions via clustering analysis. Building upon this, CVC-SSL adopts a self-supervised training paradigm where the model reconstructs the original camera views from the synthesized stitched images using a diffusion model, enforcing cross-view consistency without requiring supervision from extrapolated data. Our framework requires only multi-camera images and their associated poses for training, eliminating the need for additional sensors or depth maps. To our knowledge, Arbiviewgen is the first method capable of controllable arbitrary view camera image generation in multiple vehicle configurations.[98] Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models
Zane Xu,Jason Sun
Main category: cs.CV
TL;DR: This report analyzes eight papers on improving the adversarial robustness of vision-language models (VLMs) like CLIP while maintaining zero-shot generalization. It compares parameter-modifying and training-free defense strategies and outlines future research directions.
Details
Motivation: The central challenge is balancing adversarial robustness with the preservation of zero-shot generalization capabilities in VLMs, which requires a systematic analysis of existing defense paradigms. Method: The report synthesizes findings from eight seminal papers on VLM robustness, analyzing two defense paradigms: Adversarial Fine-Tuning (AFT) and Training-Free/Test-Time Defenses. It reviews the evolution of techniques from alignment-preserving methods to embedding space re-engineering and latent-space purification. Result: Key defense strategies were categorized into parameter-modifying (e.g., AFT) and parameter-preserving (e.g., input heuristics to latent purification) approaches. The evolution of methods from TeCoA to LAAT, TIMA, and CLIPure was traced, highlighting progress in balancing robustness with generalization. Conclusion: The analysis identifies the need for hybrid defense strategies and adversarial pre-training as future directions to enhance robustness while maintaining the unique capabilities of VLMs. Abstract: This report synthesizes eight seminal papers on the zero-shot adversarial robustness of vision-language models (VLMs) like CLIP. A central challenge in this domain is the inherent trade-off between enhancing adversarial robustness and preserving the model's zero-shot generalization capabilities. We analyze two primary defense paradigms: Adversarial Fine-Tuning (AFT), which modifies model parameters, and Training-Free/Test-Time Defenses, which preserve them. We trace the evolution from alignment-preserving methods (TeCoA) to embedding space re-engineering (LAAT, TIMA), and from input heuristics (AOM, TTC) to latent-space purification (CLIPure). Finally, we identify key challenges and future directions including hybrid defense strategies and adversarial pre-training.[99] RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding
Tianchen Fang,Guiru Liu
Main category: cs.CV
TL;DR: RegionMed-CLIP is a region-aware multimodal contrastive learning framework designed to improve medical image understanding by incorporating localized pathological signals and holistic semantic representations.
Details
Motivation: The progress of medical image understanding is impeded by the limited availability of high-quality annotated medical data and an overreliance on global image features that miss subtle pathological regions. Method: The method involves an ROI processor that integrates regional features with global context, supported by a progressive training strategy. It utilizes the MedRegion-500k corpus for region-level representation learning. Result: Experiments showed that RegionMed-CLIP outperforms state-of-the-art vision language models in tasks like image-text retrieval, zero-shot classification, and visual question answering. Conclusion: Region-aware contrastive pre-training is crucial for medical image understanding, positioning RegionMed-CLIP as a robust foundation for advancing this field. Abstract: Medical image understanding plays a crucial role in enabling automated diagnosis and data-driven clinical decision support. However, its progress is impeded by two primary challenges: the limited availability of high-quality annotated medical data and an overreliance on global image features, which often miss subtle but clinically significant pathological regions. To address these issues, we introduce RegionMed-CLIP, a region-aware multimodal contrastive learning framework that explicitly incorporates localized pathological signals along with holistic semantic representations. The core of our method is an innovative region-of-interest (ROI) processor that adaptively integrates fine-grained regional features with the global context, supported by a progressive training strategy that enhances hierarchical multimodal alignment. To enable large-scale region-level representation learning, we construct MedRegion-500k, a comprehensive medical image-text corpus that features extensive regional annotations and multilevel clinical descriptions. Extensive experiments on image-text retrieval, zero-shot classification, and visual question answering tasks demonstrate that RegionMed-CLIP consistently exceeds state-of-the-art vision language models by a wide margin. Our results highlight the critical importance of region-aware contrastive pre-training and position RegionMed-CLIP as a robust foundation for advancing multimodal medical image understanding.[100] A Study of Gender Classification Techniques Based on Iris Images: A Deep Survey and Analysis
Basna Mohammed Salih Hasan,Ramadhan J. Mstafa
Main category: cs.CV
TL;DR: 这篇论文综述了性别分类的方法,重点分析了基于面部和虹膜的分类技术,讨论了现有方法的优劣,并为未来研究提供了建议。
Details
Motivation: 性别分类在监控、公司简介和人机交互等应用中具有吸引力,因为可以通过性别信息推断个体身份,而虹膜作为一种重要的生物特征具有稳定性和非侵入性。 Method: 研究回顾了文献中的各种性别分类方法,并分析了不同步骤的方法。 Result: 该研究总结了现有的性别分类方法,特别是基于面部特征和虹膜纹理的方法,并讨论了其在实际应用中的潜力和挑战。 Conclusion: 本研究综述了性别分类的方法,提供了现有性别分类方法的知识和分析,指出了该领域的空白和挑战,并提出了改进建议和未来的研究方向。 Abstract: Gender classification is attractive in a range of applications, including surveillance and monitoring, corporate profiling, and human-computer interaction. Individuals' identities may be gleaned from information about their gender, which is a kind of soft biometric.Over the years, several methods for determining a person's gender have been devised. Some of the most well-known ones are based on physical characteristics like face, fingerprint, palmprint, DNA, ears, gait, and iris. On the other hand, facial features account for the vast majority of gender classification methods. Also, the iris is a significant biometric trait because the iris, according to research, remains basically constant during an individual's life. Besides that, the iris is externally visible and is non-invasive to the user, which is important for practical applications. Furthermore, there are already high-quality methods for segmenting and encoding iris images, and the current methods facilitate selecting and extracting attribute vectors from iris textures. This study discusses several approaches to determining gender. The previous works of literature are briefly reviewed. Additionally, there are a variety of methodologies for different steps of gender classification. This study provides researchers with knowledge and analysis of the existing gender classification approaches. Also, it will assist researchers who are interested in this specific area, as well as highlight the gaps and challenges in the field, and finally provide suggestions and future paths for improvement.[101] CF3: Compact and Fast 3D Feature Fields
Hyunjoon Lee,Joonkyu Min,Jaesik Park
Main category: cs.CV
TL;DR: 提出了一种名为CF3的高效3D高斯特征场构建方法,相比Feature-3DGS仅使用5%的高斯成分,实现了更紧凑和快速的3D特征场表示。
Details
Motivation: 为了解决现有3D高斯随机化方法依赖计算成本高昂的自底向上优化过程的问题,提出了一种自顶向下的新方法。 Method: 首先对多视角2D特征进行快速加权融合,并使用预训练的高斯分布进行特征提升;随后直接在提升后的特征上训练每高斯自编码器,并引入自适应稀疏化方法优化高斯属性,同时剪枝和合并冗余高斯。 Result: 与现有方法相比,该方法能够在保留几何细节的同时显著减少高斯成分数量,从而构建高效的3D特征场表示。 Conclusion: CF3是一种高效的3D高斯特征场构建方法,通过特征提升和自适应稀疏化方法显著降低了计算成本并保留了特征分布的对齐性。 Abstract: 3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS.[102] Robust Tracking with Particle Filtering for Fluorescent Cardiac Imaging
Suresh Guttikonda,Maximilian Neidhart,Johanna Sprenger,Johannes Petersen,Christian Detter,Alexander Schlaefer
Main category: cs.CV
TL;DR: 本文提出了一种新的基于循环一致性检查的粒子滤波跟踪方法,用于心脏荧光成像,具有较高的跟踪精度和实时性。
Details
Motivation: 心脏运动和血管结构丰富引起图像特征显著波动,限制了传统跟踪方法在心脏荧光成像中的应用。 Method: 提出了一种基于循环一致性检查的粒子滤波跟踪方法,用于跟踪心脏特征点。 Result: 该方法实现了25.4 fps的117个目标同时跟踪,跟踪误差为(5.00 +/- 0.22 px),优于其他深度学习跟踪器(22.3 +/- 1.1 px)和传统跟踪器(58.1 +/- 27.1 px)。 Conclusion: 粒子滤波跟踪器基于循环一致性检查,可以稳健地跟踪采样粒子以跟随目标地标,优于其他深度学习和传统跟踪器。 Abstract: Intraoperative fluorescent cardiac imaging enables quality control following coronary bypass grafting surgery. We can estimate local quantitative indicators, such as cardiac perfusion, by tracking local feature points. However, heart motion and significant fluctuations in image characteristics caused by vessel structural enrichment limit traditional tracking methods. We propose a particle filtering tracker based on cyclicconsistency checks to robustly track particles sampled to follow target landmarks. Our method tracks 117 targets simultaneously at 25.4 fps, allowing real-time estimates during interventions. It achieves a tracking error of (5.00 +/- 0.22 px) and outperforms other deep learning trackers (22.3 +/- 1.1 px) and conventional trackers (58.1 +/- 27.1 px).[103] SGDFuse: SAM-Guided Diffusion for High-Fidelity Infrared and Visible Image Fusion
Xiaoyang Zhang,Zhen Hua,Yakun Ju,Wei Zhou,Jun Liu,Alex C. Kot
Main category: cs.CV
TL;DR: 本文提出SGDFuse,结合SAM生成的语义掩码和条件扩散模型,实现高保真、语义感知的红外与可见光图像融合,解决了关键目标丢失和细节损失问题,实验表明其性能优于现有方法。
Details
Motivation: 现有红外与可见光图像融合方法因缺乏对场景的深层语义理解,导致关键目标丢失,并且融合过程容易引入伪影和细节丢失,影响图像质量和任务性能。 Method: 提出SGDFuse,基于SAM生成的语义掩码作为显式先验,引导扩散模型优化融合过程;采用两阶段框架:第一阶段进行多模态特征初步融合,第二阶段利用语义掩码和初步融合图像驱动扩散模型进行由粗到精的去噪生成。 Result: SGDFuse在主观和客观评价以及下游任务适应性方面均达到SOTA性能,实现了高保真和语义感知的图像融合,解决了现有方法的关键问题。 Conclusion: SGDFuse通过结合SAM生成的高质量语义掩码和条件扩散模型,实现了高保真和语义感知的图像融合,解决了现有方法在关键目标保留和细节损失方面的问题。实验表明,SGDFuse在主客观评估和下游任务适应性方面均达到SOTA性能。 Abstract: Infrared and visible image fusion (IVIF) aims to combine the thermal radiation information from infrared images with the rich texture details from visible images to enhance perceptual capabilities for downstream visual tasks. However, existing methods often fail to preserve key targets due to a lack of deep semantic understanding of the scene, while the fusion process itself can also introduce artifacts and detail loss, severely compromising both image quality and task performance. To address these issues, this paper proposes SGDFuse, a conditional diffusion model guided by the Segment Anything Model (SAM), to achieve high-fidelity and semantically-aware image fusion. The core of our method is to utilize high-quality semantic masks generated by SAM as explicit priors to guide the optimization of the fusion process via a conditional diffusion model. Specifically, the framework operates in a two-stage process: it first performs a preliminary fusion of multi-modal features, and then utilizes the semantic masks from SAM jointly with the preliminary fused image as a condition to drive the diffusion model's coarse-to-fine denoising generation. This ensures the fusion process not only has explicit semantic directionality but also guarantees the high fidelity of the final result. Extensive experiments demonstrate that SGDFuse achieves state-of-the-art performance in both subjective and objective evaluations, as well as in its adaptability to downstream tasks, providing a powerful solution to the core challenges in image fusion. The code of SGDFuse is available at https://github.com/boshizhang123/SGDFuse.[104] B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding
Changho Choi,Youngwoo Shin,Gyojin Han,Dong-Jae Lee,Junmo Kim
Main category: cs.CV
TL;DR: The paper introduces B4DL, a new benchmark and model for processing 4D LiDAR with Multimodal Large Language Models, offering a solution for understanding dynamic outdoor environments.
Details
Motivation: The motivation stems from the underexplored nature of 4D LiDAR in the context of Multimodal Large Language Models due to the lack of high-quality annotations and suitable MLLM architectures, despite its potential in representing real-world scenes with precise spatial geometry and rich temporal cues. Method: The method involves the introduction of a new benchmark (B4DL) for training and evaluating MLLMs on 4D LiDAR understanding, a scalable data generation pipeline, and an MLLM model designed to directly process raw 4D LiDAR data. Result: The result is the successful development of a new benchmark, a scalable data generation pipeline, and an MLLM model that directly processes raw 4D LiDAR. This combination provides a comprehensive solution for understanding dynamic outdoor environments through spatio-temporal reasoning. Conclusion: The paper concludes that by introducing B4DL, a new benchmark, along with a scalable data generation pipeline and an MLLM model capable of processing raw 4D LiDAR, they offer a unified solution for spatio-temporal reasoning in dynamic outdoor environments. Abstract: Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://mmb4dl.github.io/mmb4dl/[105] Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection
Xiaoyang Zhang,Guodong Fan,Guang-Yong Chen,Zhen Hua,Jinjiang Li,Min Gan,C. L. Philip Chen
Main category: cs.CV
TL;DR: This paper proposes a new method for change detection in remote sensing imagery using frequency-domain modeling, which improves edge clarity and detection accuracy.
Details
Motivation: To overcome the limitations of spatial-domain modeling in detecting subtle change regions in remote sensing imagery. Method: Wavelet-Guided Dual-Frequency Encoding (WGDF) using Discrete Wavelet Transform (DWT), Dual-Frequency Feature Enhancement (DFFE), Frequency-Domain Interactive Difference (FDID), and Progressive Contextual Difference Module (PCDM). Result: Superior detection accuracy and robustness on multiple remote sensing datasets. Conclusion: The proposed WGDF method significantly improves edge clarity and achieves better detection accuracy and robustness compared to existing methods. Abstract: Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management. Despite the remarkable progress of deep learning in recent years, most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions. We observe that frequency-domain feature modeling particularly in the wavelet domain an amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain. Thus, we propose a method called Wavelet-Guided Dual-Frequency Encoding (WGDF). Specifically, we first apply Discrete Wavelet Transform (DWT) to decompose the input images into high-frequency and low-frequency components, which are used to model local details and global structures, respectively. In the high-frequency branch, we design a Dual-Frequency Feature Enhancement (DFFE) module to strengthen edge detail representation and introduce a Frequency-Domain Interactive Difference (FDID) module to enhance the modeling of fine-grained changes. In the low-frequency branch, we exploit Transformers to capture global semantic relationships and employ a Progressive Contextual Difference Module (PCDM) to progressively refine change regions, enabling precise structural semantic characterization. Finally, the high- and low-frequency features are synergistically fused to unify local sensitivity with global discriminability. Extensive experiments on multiple remote sensing datasets demonstrate that WGDF significantly alleviates edge ambiguity and achieves superior detection accuracy and robustness compared to state-of-the-art methods. The code will be available at https://github.com/boshizhang123/WGDF.[106] VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test
Meiqi Wu,Yaxuan Kang,Xuchen Li,Shiyu Hu,Xiaotang Chen,Yunfeng Kang,Weiqiang Wang,Kaiqi Huang
Main category: cs.CV
TL;DR: This paper proposes a new automated method, VS-LLM, for analyzing PPAT sketches to assess depression, showing a 17.6% improvement over traditional methods and offering a promising tool for large-scale mental state evaluation.
Details
Motivation: The motivation of the study is to address the laborious and experience-dependent nature of interpreting PPAT sketches for depression assessment, aiming to support psychologists with a large-scale automated analysis tool. Method: The study introduces the VS-LLM method for visual-semantic depression assessment, designed to automate the interpretation of PPAT sketches by analyzing elements like color usage and space utilization, and was validated through experimental comparison with traditional psychologist assessment methods. Result: The proposed VS-LLM method demonstrated a 17.6% improvement in performance compared to traditional psychologist assessment methods in analyzing PPAT sketches for depression. Conclusion: The study concludes that the proposed Visual-Semantic depression assessment based on LLM (VS-LLM) method effectively enhances the automated analysis of PPAT sketches, offering a promising contribution to mental state assessment research. Abstract: The Drawing Projection Test (DPT) is an essential tool in art therapy, allowing psychologists to assess participants' mental states through their sketches. Specifically, through sketches with the theme of "a person picking an apple from a tree (PPAT)", it can be revealed whether the participants are in mental states such as depression. Compared with scales, the DPT can enrich psychologists' understanding of an individual's mental state. However, the interpretation of the PPAT is laborious and depends on the experience of the psychologists. To address this issue, we propose an effective identification method to support psychologists in conducting a large-scale automatic DPT. Unlike traditional sketch recognition, DPT more focus on the overall evaluation of the sketches, such as color usage and space utilization. Moreover, PPAT imposes a time limit and prohibits verbal reminders, resulting in low drawing accuracy and a lack of detailed depiction. To address these challenges, we propose the following efforts: (1) Providing an experimental environment for automated analysis of PPAT sketches for depression assessment; (2) Offering a Visual-Semantic depression assessment based on LLM (VS-LLM) method; (3) Experimental results demonstrate that our method improves by 17.6% compared to the psychologist assessment method. We anticipate that this work will contribute to the research in mental state assessment based on PPAT sketches' elements recognition. Our datasets and codes are available at https://github.com/wmeiqi/VS-LLM.[107] CoCAViT: Compact Vision Transformer with Robust Global Coordination
Xuyang Wang,Lingjuan Miao,Zhiqiang Zhou
Main category: cs.CV
TL;DR: The paper proposes CoCAViT, a visual backbone that enhances the generalization performance of smaller models across multiple vision tasks while maintaining low computational overhead.
Details
Motivation: The motivation is to address the disproportionately larger performance drop of smaller models on out-of-distribution (OOD) data, aiming to improve their generalization capabilities. Method: The paper identifies architectural bottlenecks in existing efficient models and proposes a Coordinator-patch Cross Attention (CoCA) mechanism to enhance local-global feature modeling. It introduces CoCAViT, which integrates these improvements. Result: CoCAViT achieves 84.0% top-1 accuracy on ImageNet-1K, 52.2 mAP on COCO object detection, and 51.3 mIOU on ADE20K semantic segmentation, with significant improvements on OOD benchmarks compared to competing models. Conclusion: The paper concludes that CoCAViT, a new visual backbone architecture, improves the generalization and robustness of smaller models on various vision tasks while maintaining low latency. Abstract: In recent years, large-scale visual backbones have demonstrated remarkable capabilities in learning general-purpose features from images via extensive pre-training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in-domain benchmarks. However, we observe that for smaller models, the performance drop on out-of-distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator-patch Cross Attention (CoCA) mechanism, featuring dynamic, domain-aware global tokens that enhance local-global feature modeling and adaptively capture robust patterns across domains with minimal computational overhead. Integrating these advancements, we present CoCAViT, a novel visual backbone designed for robust real-time visual representation. Extensive experiments empirically validate our design. At a resolution of 224*224, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant gains on multiple OOD benchmarks, compared to competing models. It also attains 52.2 mAP on COCO object detection and 51.3 mIOU on ADE20K semantic segmentation, while maintaining low latency.[108] mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering
Xu Yuan,Liangbo Ning,Wenqi Fan,Qing Li
Main category: cs.CV
TL;DR: 本文提出了一种结合多模态知识图谱的RAG框架mKG-RAG,用于提升视觉问答任务的效果。
Details
Motivation: 现有的基于RAG的方法依赖非结构化文档,忽视知识元素之间的结构关系,导致内容无关或误导,影响答案准确性和可靠性。 Method: 利用MLLM驱动的关键词提取和视觉-文本匹配来提取语义一致且模态对齐的实体/关系,构建高质量多模态知识图谱,并引入双阶段检索策略提升检索效率和精度。 Result: 综合实验表明,所提方法在知识密集型VQA任务上显著优于现有方法,达到了新的先进水平。 Conclusion: 本文提出了一种基于多模态知识图谱的新型检索增强生成框架(mKG-RAG),在知识密集型视觉问答任务中显著优于现有方法,达到了新的SOTA。 Abstract: Recently, Retrieval-Augmented Generation (RAG) has been proposed to expand internal knowledge of Multimodal Large Language Models (MLLMs) by incorporating external knowledge databases into the generation process, which is widely used for knowledge-based Visual Question Answering (VQA) tasks. Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relationships among knowledge elements frequently introduce irrelevant or misleading content, reducing answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks to enhance the generation by introducing structured multimodal knowledge. Therefore, in this paper, we propose a novel multimodal knowledge-augmented generation framework (mKG-RAG) based on multimodal KGs for knowledge-intensive VQA tasks. Specifically, our approach leverages MLLM-powered keyword extraction and vision-text matching to distill semantically consistent and modality-aligned entities/relationships from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. In addition, a dual-stage retrieval strategy equipped with a question-aware multimodal retriever is introduced to improve retrieval efficiency while refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art for knowledge-based VQA.[109] Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting
Frank Ruis,Gertjan Burghouts,Hugo Kuijf
Main category: cs.CV
TL;DR: A new method inspired by Textual Inversion extends the vocabulary of pre-trained vision language models for improved object detection without compromising the original model's performance or requiring extensive computational resources.
Details
Motivation: To overcome the limitations of fine-tuning large pre-trained vision language models (VLMs) which often results in the loss of original natural language querying and zero-shot capabilities. Method: Inspired by Textual Inversion (TI), the method involves learning new tokens or improving existing ones to enhance object detection capabilities without modifying the original VLM weights. Result: The proposed method allows for accurate detection of novel or fine-grained objects from few examples, retains original model performance and zero-shot capabilities, and requires less computational resources compared to full-model fine-tuning. Conclusion: The proposed method successfully extends the VLM vocabulary while maintaining original model performance and capabilities, offering a more efficient alternative to full-model fine-tuning. Abstract: Recent progress in large pre-trained vision language models (VLMs) has reached state-of-the-art performance on several object detection benchmarks and boasts strong zero-shot capabilities, but for optimal performance on specific targets some form of finetuning is still necessary. While the initial VLM weights allow for great few-shot transfer learning, this usually involves the loss of the original natural language querying and zero-shot capabilities. Inspired by the success of Textual Inversion (TI) in personalizing text-to-image diffusion models, we propose a similar formulation for open-vocabulary object detection. TI allows extending the VLM vocabulary by learning new or improving existing tokens to accurately detect novel or fine-grained objects from as little as three examples. The learned tokens are completely compatible with the original VLM weights while keeping them frozen, retaining the original model's benchmark performance, and leveraging its existing capabilities such as zero-shot domain transfer (e.g., detecting a sketch of an object after training only on real photos). The storage and gradient calculations are limited to the token embedding dimension, requiring significantly less compute than full-model fine-tuning. We evaluated whether the method matches or outperforms the baseline methods that suffer from forgetting in a wide variety of quantitative and qualitative experiments.[110] 3DGabSplat: 3D Gabor Splatting for Frequency-adaptive Radiance Field Rendering
Junyu Zhou,Yuyang Huang,Wenrui Dai,Junni Zou,Ziyang Zheng,Nuowen Kan,Chenglin Li,Hongkai Xiong
Main category: cs.CV
TL;DR: 本文提出3DGabSplat,通过基于3D Gabor的原语改进3D高斯随机化,实现更高效和高质量的新视角合成。
Details
Motivation: 3DGS使用高斯函数存在低通特性,限制了高频细节的表示,并且存在冗余原语和内存开销。 Method: 使用基于3D Gabor的原语和CUDA光栅化器,并提出频率自适应机制进行优化。 Result: 3DGabSplat在真实和合成场景中均达到了最先进的渲染质量,PSNR增益达1.35 dB,并减少了原语数量和内存消耗。 Conclusion: 3DGabSplat克服了3DGS的局限性,通过使用基于3D Gabor的原语实现了更高效和高质量的新视角合成。 Abstract: Recent prominence in 3D Gaussian Splatting (3DGS) has enabled real-time rendering while maintaining high-fidelity novel view synthesis. However, 3DGS resorts to the Gaussian function that is low-pass by nature and is restricted in representing high-frequency details in 3D scenes. Moreover, it causes redundant primitives with degraded training and rendering efficiency and excessive memory overhead. To overcome these limitations, we propose 3D Gabor Splatting (3DGabSplat) that leverages a novel 3D Gabor-based primitive with multiple directional 3D frequency responses for radiance field representation supervised by multi-view images. The proposed 3D Gabor-based primitive forms a filter bank incorporating multiple 3D Gabor kernels at different frequencies to enhance flexibility and efficiency in capturing fine 3D details. Furthermore, to achieve novel view rendering, an efficient CUDA-based rasterizer is developed to project the multiple directional 3D frequency components characterized by 3D Gabor-based primitives onto the 2D image plane, and a frequency-adaptive mechanism is presented for adaptive joint optimization of primitives. 3DGabSplat is scalable to be a plug-and-play kernel for seamless integration into existing 3DGS paradigms to enhance both efficiency and quality of novel view synthesis. Extensive experiments demonstrate that 3DGabSplat outperforms 3DGS and its variants using alternative primitives, and achieves state-of-the-art rendering quality across both real-world and synthetic scenes. Remarkably, we achieve up to 1.35 dB PSNR gain over 3DGS with simultaneously reduced number of primitives and memory consumption.[111] PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation
Kang Liu,Zhuoqi Ma,Zikang Fang,Yunan Li,Kun Xie,Qiguang Miao
Main category: cs.CV
TL;DR: PriorRG是一个利用患者特定先验知识生成胸部X光报告的新框架,通过对比预训练和先验感知解码来提高报告的临床准确性和流畅性。
Details
Motivation: 现有方法忽视了放射科医生日常诊断中依赖的患者特定先验知识(如临床背景和最近的影像),导致报告质量受限。 Method: PriorRG采用两阶段训练框架:第一阶段使用先验引导的对比预训练,第二阶段采用先验感知的由粗到精解码策略,以整合患者特定信息。 Result: 在MIMIC-CXR和MIMIC-ABN数据集上,PriorRG在BLEU-4和F1评分上分别提升了3.6%和3.8%,在MIMIC-ABN上BLEU-1提升了5.9%。 Conclusion: PriorRG有效利用患者先验信息,提高了X光报告生成的临床准确性和流畅性,优于现有最先进方法。 Abstract: Chest X-ray report generation aims to reduce radiologists' workload by automatically producing high-quality preliminary reports. A critical yet underexplored aspect of this task is the effective use of patient-specific prior knowledge -- including clinical context (e.g., symptoms, medical history) and the most recent prior image -- which radiologists routinely rely on for diagnostic reasoning. Most existing methods generate reports from single images, neglecting this essential prior information and thus failing to capture diagnostic intent or disease progression. To bridge this gap, we propose PriorRG, a novel chest X-ray report generation framework that emulates real-world clinical workflows via a two-stage training pipeline. In Stage 1, we introduce a prior-guided contrastive pre-training scheme that leverages clinical context to guide spatiotemporal feature extraction, allowing the model to align more closely with the intrinsic spatiotemporal semantics in radiology reports. In Stage 2, we present a prior-aware coarse-to-fine decoding for report generation that progressively integrates patient-specific prior knowledge with the vision encoder's hidden states. This decoding allows the model to align with diagnostic focus and track disease progression, thereby enhancing the clinical accuracy and fluency of the generated reports. Extensive experiments on MIMIC-CXR and MIMIC-ABN datasets demonstrate that PriorRG outperforms state-of-the-art methods, achieving a 3.6% BLEU-4 and 3.8% F1 score improvement on MIMIC-CXR, and a 5.9% BLEU-1 gain on MIMIC-ABN. Code and checkpoints will be released upon acceptance.[112] Cross-View Localization via Redundant Sliced Observations and A-Contrario Validation
Yongjun Zhang,Mingtao Xiong,Yi Wan,Gui-Song Xia
Main category: cs.CV
TL;DR: Slice-Loc通过引入冗余观测和几何刚度公式,提高了CVL方法的定位准确性和可靠性。
Details
Motivation: 为了解决CVL方法缺乏冗余观测的问题,从而使得定位可靠性难以评估,引入了Slice-Loc方法。 Method: Slice-Loc是一种两阶段方法,通过将查询图像分成子图像,估计每个切片的3-DoF姿态,并使用几何刚度公式过滤错误姿态,最后合并内点生成最终相机姿态。 Result: Slice-Loc在DReSS数据集上的跨城市测试中,将平均定位误差从4.47米减少到1.86米,平均方向误差从3.42度减少到1.24度。 Conclusion: Slice-Loc通过引入冗余观测和几何刚度公式,提高了CVL方法的定位准确性和可靠性。 Abstract: Cross-view localization (CVL) matches ground-level images with aerial references to determine the geo-position of a camera, enabling smart vehicles to self-localize offline in GNSS-denied environments. However, most CVL methods output only a single observation, the camera pose, and lack the redundant observations required by surveying principles, making it challenging to assess localization reliability through the mutual validation of observational data. To tackle this, we introduce Slice-Loc, a two-stage method featuring an a-contrario reliability validation for CVL. Instead of using the query image as a single input, Slice-Loc divides it into sub-images and estimates the 3-DoF pose for each slice, creating redundant and independent observations. Then, a geometric rigidity formula is proposed to filter out the erroneous 3-DoF poses, and the inliers are merged to generate the final camera pose. Furthermore, we propose a model that quantifies the meaningfulness of localization by estimating the number of false alarms (NFA), according to the distribution of the locations of the sliced images. By eliminating gross errors, Slice-Loc boosts localization accuracy and effectively detects failures. After filtering out mislocalizations, Slice-Loc reduces the proportion of errors exceeding 10 m to under 3\%. In cross-city tests on the DReSS dataset, Slice-Loc cuts the mean localization error from 4.47 m to 1.86 m and the mean orientation error from $\mathbf{3.42^{\circ}}$ to $\mathbf{1.24^{\circ}}$, outperforming state-of-the-art methods. Code and dataset will be available at: https://github.com/bnothing/Slice-Loc.[113] CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT Report Generation
Hamza Kalisch,Fabian Hörst,Jens Kleesiek,Ken Herrmann,Constantin Seibold
Main category: cs.CV
TL;DR: CT-GRAPH is a new method for automated radiology report generation that uses a hierarchical graph attention network to model anatomical relationships, achieving better performance than existing approaches.
Details
Motivation: Current methods for automating radiology report generation rely solely on global image features, which fail to capture important fine-grained organ relationships necessary for accurate reporting. This limitation motivated the development of a more structured and knowledge-based approach like CT-GRAPH. Method: The authors proposed CT-GRAPH, a hierarchical graph attention network that structures anatomical regions into a graph, integrating fine-grained organ features with coarser anatomical systems and global patient context. The features are obtained using pretrained 3D medical feature encoders and anatomical masks, refined within the graph, and then fed into a large language model to generate medical reports. Result: The method achieved a substantial improvement of 7.9% in F1 score over state-of-the-art methods on the CT-RATE dataset for chest CT report generation. Conclusion: CT-GRAPH provides a novel method for generating radiology reports by explicitly modeling radiological knowledge through a hierarchical graph attention network, leading to improved performance compared to existing methods. Abstract: As medical imaging is central to diagnostic processes, automating the generation of radiology reports has become increasingly relevant to assist radiologists with their heavy workloads. Most current methods rely solely on global image features, failing to capture fine-grained organ relationships crucial for accurate reporting. To this end, we propose CT-GRAPH, a hierarchical graph attention network that explicitly models radiological knowledge by structuring anatomical regions into a graph, linking fine-grained organ features to coarser anatomical systems and a global patient context. Our method leverages pretrained 3D medical feature encoders to obtain global and organ-level features by utilizing anatomical masks. These features are further refined within the graph and then integrated into a large language model to generate detailed medical reports. We evaluate our approach for the task of report generation on the large-scale chest CT dataset CT-RATE. We provide an in-depth analysis of pretrained feature encoders for CT report generation and show that our method achieves a substantial improvement of absolute 7.9\% in F1 score over current state-of-the-art methods. The code is publicly available at https://github.com/hakal104/CT-GRAPH.[114] Deformable Attention Graph Representation Learning for Histopathology Whole Slide Image Analysis
Mingxi Fu,Xitong Ling,Yuxuan Chen,Jiawen Li,fanglei fu,Huaitian Yuan,Tian Guan,Yonghong He,Lianghui Zhu
Main category: cs.CV
TL;DR: This paper proposes a novel GNN framework with deformable attention to improve the classification of pathology images by capturing spatial dependencies and enhancing contextual information.
Details
Motivation: Accurate classification of Whole Slide Images and Regions of Interest is challenging due to spatial dependencies among tissue structures, which current methods fail to capture effectively. Method: A dynamic weighted directed graph was constructed based on patch features, incorporating learnable spatial offsets informed by real coordinates of each patch to adaptively attend to morphologically relevant regions. Result: The framework achieved state-of-the-art performance on four benchmark datasets, demonstrating its effectiveness in capturing complex spatial structures in WSIs and ROIs. Conclusion: The proposed GNN framework with deformable attention significantly enhances contextual field while preserving spatial specificity for pathology image analysis. Abstract: Accurate classification of Whole Slide Images (WSIs) and Regions of Interest (ROIs) is a fundamental challenge in computational pathology. While mainstream approaches often adopt Multiple Instance Learning (MIL), they struggle to capture the spatial dependencies among tissue structures. Graph Neural Networks (GNNs) have emerged as a solution to model inter-instance relationships, yet most rely on static graph topologies and overlook the physical spatial positions of tissue patches. Moreover, conventional attention mechanisms lack specificity, limiting their ability to focus on structurally relevant regions. In this work, we propose a novel GNN framework with deformable attention for pathology image analysis. We construct a dynamic weighted directed graph based on patch features, where each node aggregates contextual information from its neighbors via attention-weighted edges. Specifically, we incorporate learnable spatial offsets informed by the real coordinates of each patch, enabling the model to adaptively attend to morphologically relevant regions across the slide. This design significantly enhances the contextual field while preserving spatial specificity. Our framework achieves state-of-the-art performance on four benchmark datasets (TCGA-COAD, BRACS, gastric intestinal metaplasia grading, and intestinal ROI classification), demonstrating the power of deformable attention in capturing complex spatial structures in WSIs and ROIs.[115] UNCAGE: Contrastive Attention Guidance for Masked Generative Transformers in Text-to-Image Generation
Wonjun Kang,Byeongkeun Ahn,Minjae Lee,Kevin Galim,Seunghyuk Oh,Hyung Il Koo,Nam Ik Cho
Main category: cs.CV
TL;DR: The paper introduces UNCAGE, a training-free method using attention maps to enhance compositional fidelity in text-to-image generation, showing significant performance improvements.
Details
Motivation: The motivation stems from the challenges in compositional text-to-image generation, where existing models like Diffusion Models and Masked Generative Transformers struggle with accurate attribute binding and text-image alignment. Method: The paper proposes UNCAGE, a training-free method that utilizes attention maps to prioritize the unmasking of tokens representing individual objects, aiming to improve compositional fidelity in text-to-image generation. Result: UNCAGE demonstrates consistent improvements in both quantitative and qualitative evaluations across multiple benchmarks, achieving better compositional fidelity with negligible inference overhead. Conclusion: The paper concludes that UNCAGE effectively enhances compositional fidelity in text-to-image generation by leveraging attention maps to prioritize unmasking relevant tokens, offering consistent performance improvements with minimal inference overhead. Abstract: Text-to-image (T2I) generation has been actively studied using Diffusion Models and Autoregressive Models. Recently, Masked Generative Transformers have gained attention as an alternative to Autoregressive Models to overcome the inherent limitations of causal attention and autoregressive decoding through bidirectional attention and parallel decoding, enabling efficient and high-quality image generation. However, compositional T2I generation remains challenging, as even state-of-the-art Diffusion Models often fail to accurately bind attributes and achieve proper text-image alignment. While Diffusion Models have been extensively studied for this issue, Masked Generative Transformers exhibit similar limitations but have not been explored in this context. To address this, we propose Unmasking with Contrastive Attention Guidance (UNCAGE), a novel training-free method that improves compositional fidelity by leveraging attention maps to prioritize the unmasking of tokens that clearly represent individual objects. UNCAGE consistently improves performance in both quantitative and qualitative evaluations across multiple benchmarks and metrics, with negligible inference overhead. Our code is available at https://github.com/furiosa-ai/uncage.[116] From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization
Farah Wahida,M. A. P. Chamikara,Yashothara Shanmugarasa,Mohan Baruwal Chhetri,Thilina Ranbaduge,Ibrahim Khalil
Main category: cs.CV
TL;DR: TrueBiometric is a novel approach that accurately detects and corrects poisoned images in biometric systems using a majority voting mechanism and targeted corrective noise, offering a more practical and effective solution for mitigating backdoor attacks.
Details
Motivation: Existing defense mechanisms against backdoor attacks face challenges in precisely identifying and mitigating poisoned images without compromising data utility. This undermines the overall reliability of biometric systems like face recognition systems. Method: TrueBiometric uses a majority voting mechanism leveraging multiple state-of-the-art large vision language models to detect poisoned images, and then corrects poisoned samples using targeted and calibrated corrective noise. Result: TrueBiometric detects and corrects poisoned images with 100% accuracy without compromising accuracy on clean images. Conclusion: TrueBiometric offers a more practical, accurate, and effective solution for mitigating backdoor attacks in face recognition systems. Abstract: Biometric systems, such as face recognition systems powered by deep neural networks (DNNs), rely on large and highly sensitive datasets. Backdoor attacks can subvert these systems by manipulating the training process. By inserting a small trigger, such as a sticker, make-up, or patterned mask, into a few training images, an adversary can later present the same trigger during authentication to be falsely recognized as another individual, thereby gaining unauthorized access. Existing defense mechanisms against backdoor attacks still face challenges in precisely identifying and mitigating poisoned images without compromising data utility, which undermines the overall reliability of the system. We propose a novel and generalizable approach, TrueBiometric: Trustworthy Biometrics, which accurately detects poisoned images using a majority voting mechanism leveraging multiple state-of-the-art large vision language models. Once identified, poisoned samples are corrected using targeted and calibrated corrective noise. Our extensive empirical results demonstrate that TrueBiometric detects and corrects poisoned images with 100\% accuracy without compromising accuracy on clean images. Compared to existing state-of-the-art approaches, TrueBiometric offers a more practical, accurate, and effective solution for mitigating backdoor attacks in face recognition systems.[117] Physical Adversarial Camouflage through Gradient Calibration and Regularization
Jiawei Liang,Siyuan Liang,Jianjie Huang,Chenxi Si,Ming Zhang,Xiaochun Cao
Main category: cs.CV
TL;DR: This paper introduces a novel adversarial camouflage framework to improve the effectiveness of physical attacks on deep object detectors, particularly addressing challenges like inconsistent sampling and conflicting gradient updates, with significant improvements in attack success rates.
Details
Motivation: Physical adversarial camouflage poses a security risk by deceiving object detectors through texture alterations, and existing techniques struggle with inconsistent sampling and multi-angle gradient conflicts. Method: The paper proposes a novel adversarial camouflage framework based on gradient optimization, introducing a gradient calibration strategy for consistent gradient updates and a gradient decorrelation method to prioritize and orthogonalize gradients. Result: Experimental results show that the proposed method achieves an average increase in attack success rate (ASR) of 13.46% across distances and 11.03% across angles compared to the state of the art. Conclusion: The proposed adversarial camouflage framework significantly outperforms existing methods in attack success rate and highlights the need for more robust system design in safety-critical applications like autonomous driving. Abstract: The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles and distances show that our method significantly exceeds the state of the art, with an average increase in attack success rate (ASR) of 13.46% across distances and 11.03% across angles. Furthermore, empirical evaluation in real-world scenarios highlights the need for more robust system design.[118] Smoothing Slot Attention Iterations and Recurrences
Rongzhen Zhao,Wenyan Yang,Juho Kannala,Joni Pajarinen
Main category: cs.CV
TL;DR: SmoothSA improves Slot Attention by addressing limitations in cold-start queries and transform differentiation for video frames.
Details
Motivation: Cold-start queries lack sample-specific cues, hindering precise aggregation in first frames, while non-first frames require different transforms. Method: SmoothSA preheats cold-start queries using a self-distilled module and applies differentiated transforms for aggregation in video frames. Result: Experiments demonstrate that SmoothSA enhances performance on object discovery, recognition, and downstream benchmarks. Conclusion: SmoothSA improves the Slot Attention mechanism by preheating cold-start queries and differentiating transforms for first and non-first video frames, enhancing object-centric learning effectiveness. Abstract: Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video's first frame; Also, non-first frames' queries are already sample-specific thus require transforms different from the first frame's aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video's first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method's effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our code is available in the supplement.[119] Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions
Hubert Baniecki,Maximilian Muschalik,Fabian Fumagalli,Barbara Hammer,Eyke Hüllermeier,Przemyslaw Biecek
Main category: cs.CV
TL;DR: 本文提出了一种名为FIxLIP的二阶交互解释方法,用于视觉-语言模型,基于博弈论并使用加权Banzhaf指数,优于传统显著图方法,在多个基准测试中表现出色,并可用于模型比较。
Details
Motivation: 现有的显著图方法仅捕捉一阶归因,忽略了跨模态交互,因此需要一种更全面和灵活的方法来解释视觉-语言模型的输出。 Method: 基于博弈论,使用加权Banzhaf交互指数,提出了一种名为FIxLIP的方法来分解视觉-语言预训练模型中的相似性。 Result: 实验表明,FIxLIP在MS COCO和ImageNet-1k基准测试中优于一阶归因方法,并能够有效比较不同模型(如CLIP与SigLIP-2、ViT-B/32与ViT-L/16)的性能。 Conclusion: FIxLIP提供了一种更优的二阶交互解释方法,能够有效地分解视觉-语言编码器中的相似性,并在实际应用中表现出优于一阶归因方法的性能。 Abstract: Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, like the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on MS COCO and ImageNet-1k benchmarks validate that second-order methods like FIxLIP outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models like CLIP vs. SigLIP-2 and ViT-B/32 vs. ViT-L/16.[120] How and Why: Taming Flow Matching for Unsupervised Anomaly Detection and Localization
Liangwei Li,Lin Liu,Juanxiu Liu,Jing Zhang,Ruqian Hao,Xiaohui Du
Main category: cs.CV
TL;DR: The paper introduces a new paradigm for unsupervised anomaly detection and localization using Flow Matching, which improves upon conventional flow-based methods by enhancing dynamical control over sample trajectories.
Details
Motivation: The motivation is to address the model expressivity limitations of conventional flow-based methods in unsupervised anomaly detection and localization. Method: The method involves formalizing time-reversed Flow Matching (rFM) as a vector field regression and proposing Worst Transport (WT) displacement interpolation to reconstruct a non-probabilistic evolution path. Result: The paper achieves state-of-the-art performance at a single scale on the MVTec dataset with the proposed WT-Flow method. Conclusion: The paper concludes that the proposed WT-Flow method enhances dynamical control over sample trajectories, offering a theoretically grounded separation mechanism for anomalous samples, with the potential to scale to complex data. Abstract: We propose a new paradigm for unsupervised anomaly detection and localization using Flow Matching (FM), which fundamentally addresses the model expressivity limitations of conventional flow-based methods. To this end, we formalize the concept of time-reversed Flow Matching (rFM) as a vector field regression along a predefined probability path to transform unknown data distributions into standard Gaussian. We bring two core observations that reshape our understanding of FM. First, we rigorously prove that FM with linear interpolation probability paths is inherently non-invertible. Second, our analysis reveals that employing reversed Gaussian probability paths in high-dimensional spaces can lead to trivial vector fields. This issue arises due to the manifold-related constraints. Building on the second observation, we propose Worst Transport (WT) displacement interpolation to reconstruct a non-probabilistic evolution path. The proposed WT-Flow enhances dynamical control over sample trajectories, constructing ''degenerate potential wells'' for anomaly-free samples while allowing anomalous samples to escape. This novel unsupervised paradigm offers a theoretically grounded separation mechanism for anomalous samples. Notably, FM provides a computationally tractable framework that scales to complex data. We present the first successful application of FM for the unsupervised anomaly detection task, achieving state-of-the-art performance at a single scale on the MVTec dataset. The reproducible code for training will be released upon camera-ready submission.[121] F2PASeg: Feature Fusion for Pituitary Anatomy Segmentation in Endoscopic Surgery
Lumin Chen,Zhiying Wu,Tianye Lei,Xuexue Bai,Ming Feng,Yuxi Wang,Gaofeng Meng,Zhen Lei,Hongbin Liu
Main category: cs.CV
TL;DR: The study presents F2PASeg, a feature fusion-based method for pituitary anatomy segmentation, offering real-time, reliable intraoperative surgical planning by consistently segmenting critical anatomical structures.
Details
Motivation: Pituitary tumors can deform or encapsulate adjacent vital structures, and anatomical structure segmentation can enhance the safety of pituitary surgery by providing early warnings of surgical risks. However, pixel-level annotated video stream datasets for pituitary surgeries are rare, which motivated the creation of the PAS dataset and the development of F2PASeg. Method: The authors introduced the Pituitary Anatomy Segmentation (PAS) dataset comprising 7,845 time-coherent images from 120 videos. They applied data augmentation techniques to mitigate class imbalance and proposed F2PASeg with a Feature Fusion module to enhance segmentation robustness against intraoperative variations. Result: Experimental results showed that F2PASeg consistently segments critical anatomical structures in real time, demonstrating its effectiveness and reliability for intraoperative pituitary surgery planning. Conclusion: F2PASeg provides a reliable solution for intraoperative pituitary surgery planning by consistently segmenting critical anatomical structures in real time. Abstract: Pituitary tumors often cause deformation or encapsulation of adjacent vital structures. Anatomical structure segmentation can provide surgeons with early warnings of regions that pose surgical risks, thereby enhancing the safety of pituitary surgery. However, pixel-level annotated video stream datasets for pituitary surgeries are extremely rare. To address this challenge, we introduce a new dataset for Pituitary Anatomy Segmentation (PAS). PAS comprises 7,845 time-coherent images extracted from 120 videos. To mitigate class imbalance, we apply data augmentation techniques that simulate the presence of surgical instruments in the training data. One major challenge in pituitary anatomy segmentation is the inconsistency in feature representation due to occlusions, camera motion, and surgical bleeding. By incorporating a Feature Fusion module, F2PASeg is proposed to refine anatomical structure segmentation by leveraging both high-resolution image features and deep semantic embeddings, enhancing robustness against intraoperative variations. Experimental results demonstrate that F2PASeg consistently segments critical anatomical structures in real time, providing a reliable solution for intraoperative pituitary surgery planning. Code: https://github.com/paulili08/F2PASeg.[122] Keep It Real: Challenges in Attacking Compression-Based Adversarial Purification
Samuel Räber,Till Aczel,Andreas Plesner,Roger Wattenhofer
Main category: cs.CV
TL;DR: 论文发现高现实感的图像压缩模型在对抗攻击中表现出更强的鲁棒性,这对未来攻击方法构成了重大挑战。
Details
Motivation: 先前的研究表明,通过有损压缩预处理图像可以防御对抗扰动,但缺乏全面的攻击评估。论文的动机是填补这一空白,并评估压缩模型在面对强攻击时的安全性。 Method: 论文构建了针对各种压缩模型的强白盒和自适应攻击,并通过在多种攻击场景中的严格评估进行分析。 Result: 研究结果表明,能够生成高现实感、高保真重建的压缩模型比低现实感的模型更能抵抗攻击。此外,这种鲁棒性并非源于梯度掩码,而是源于重建图像与自然图像的分布对齐。 Conclusion: 该论文得出的结论是,保持自然图像分布对齐的高保真、高现实感的重建图像具有内在的鲁棒性,这为未来的对抗攻击带来了重大障碍。同时,论文指出,开发更有效的克服现实感的技术是全面安全评估的重要挑战。 Abstract: Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.[123] SMOL-MapSeg: Show Me One Label
Yunshuang Yuan,Frank Thiemann,Thorsten Dahms,Monika Sester
Main category: cs.CV
TL;DR: 本文提出SMOL-MapSeg模型,结合OND提示机制,有效提升历史地图的语义分割性能,并具备良好的泛化能力。
Details
Motivation: 现有预训练模型在处理历史地图时表现不佳,因为这些地图的样式和模式缺乏一致性,与现代或领域特定图像不同。 Method: 将SAM模型的提示编码器替换为OND提示机制,并在历史地图数据上进行微调,形成SMOL-MapSeg模型。 Result: SMOL-MapSeg能够准确分割由OND知识定义的类别,并具有通过少量样本适应新类别的能力,平均分割性能优于UNet基线模型。 Conclusion: SMOL-MapSeg通过OND提示机制在历史地图语义分割中表现出色,优于UNet基线模型,并能通过少量微调适应新类别。 Abstract: Historical maps are valuable for studying changes to the Earth's surface. With the rise of deep learning, models like UNet have been used to extract information from these maps through semantic segmentation. Recently, pre-trained foundation models have shown strong performance across domains such as autonomous driving, medical imaging, and industrial inspection. However, they struggle with historical maps. These models are trained on modern or domain-specific images, where patterns can be tied to predefined concepts through common sense or expert knowledge. Historical maps lack such consistency -- similar concepts can appear in vastly different shapes and styles. To address this, we propose On-Need Declarative (OND) knowledge-based prompting, which introduces explicit prompts to guide the model on what patterns correspond to which concepts. This allows users to specify the target concept and pattern during inference (on-need inference). We implement this by replacing the prompt encoder of the foundation model SAM with our OND prompting mechanism and fine-tune it on historical maps. The resulting model is called SMOL-MapSeg (Show Me One Label). Experiments show that SMOL-MapSeg can accurately segment classes defined by OND knowledge. It can also adapt to unseen classes through few-shot fine-tuning. Additionally, it outperforms a UNet-based baseline in average segmentation performance.[124] AutoIAD: Manager-Driven Multi-Agent Collaboration for Automated Industrial Anomaly Detection
Dongwei Ji,Bingzhang Hu,Yi Zhou
Main category: cs.CV
TL;DR: This paper proposes AutoIAD, an automated framework for industrial visual anomaly detection that improves performance and reduces manual effort.
Details
Motivation: Industrial anomaly detection typically demands extensive manual work, and AutoIAD aims to automate this process. Method: AutoIAD uses a Manager-Driven agent to coordinate specialized sub-agents and incorporates a domain-specific knowledge base. Result: AutoIAD outperforms general-purpose frameworks in task completion rate and AUROC, with reduced hallucination through refinement. Conclusion: AutoIAD is a specialized multi-agent framework for industrial visual anomaly detection that achieves better performance compared to existing frameworks. Abstract: Industrial anomaly detection (IAD) is critical for manufacturing quality control, but conventionally requires significant manual effort for various application scenarios. This paper introduces AutoIAD, a multi-agent collaboration framework, specifically designed for end-to-end automated development of industrial visual anomaly detection. AutoIAD leverages a Manager-Driven central agent to orchestrate specialized sub-agents (including Data Preparation, Data Loader, Model Designer, Trainer) and integrates a domain-specific knowledge base, which intelligently handles the entire pipeline using raw industrial image data to develop a trained anomaly detection model. We construct a comprehensive benchmark using MVTec AD datasets to evaluate AutoIAD across various LLM backends. Extensive experiments demonstrate that AutoIAD significantly outperforms existing general-purpose agentic collaboration frameworks and traditional AutoML frameworks in task completion rate and model performance (AUROC), while effectively mitigating issues like hallucination through iterative refinement. Ablation studies further confirm the crucial roles of the Manager central agent and the domain knowledge base module in producing robust and high-quality IAD solutions.[125] Symmetry Understanding of 3D Shapes via Chirality Disentanglement
Weikang Wang,Tobias Weißberg,Nafie El Amrani,Florian Bernard
Main category: cs.CV
TL;DR: 本文提出了一种基于2D基础模型的无监督手性特征提取流程,用于为形状顶点添加手性感知信息,以解决当前形状描述符中缺乏手性感知特征的问题。
Details
Motivation: 手性信息在图像、视频、点云和网格等不同计算机视觉数据模式中普遍存在,尽管许多形状顶点描述符具有吸引人的特性,但它们通常无法区分左右对称部分,因此需要开发一种手性特征提取方法。 Method: 基于Diff3F框架,提出了一种无监督的手性特征提取流程,通过从2D基础模型中提取信息来装饰形状顶点。 Result: 在多个数据集上进行了定量和定性实验,结果显示该方法在左-右解耦、形状匹配和部件分割等下游任务中表现出色,证明了其有效性和实用性。 Conclusion: 所提出的手性特征提取方法能够有效捕捉形状中的手性信息,为形状分析中的多个任务提供了新的解决方案。 Abstract: Chirality information (i.e. information that allows distinguishing left from right) is ubiquitous for various data modes in computer vision, including images, videos, point clouds, and meshes. While chirality has been extensively studied in the image domain, its exploration in shape analysis (such as point clouds and meshes) remains underdeveloped. Although many shape vertex descriptors have shown appealing properties (e.g. robustness to rigid-body transformations), they are often not able to disambiguate between left and right symmetric parts. Considering the ubiquity of chirality information in different shape analysis problems and the lack of chirality-aware features within current shape descriptors, developing a chirality feature extractor becomes necessary and urgent. Based on the recent Diff3F framework, we propose an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. We evaluated the extracted chirality features through quantitative and qualitative experiments across diverse datasets. Results from downstream tasks including left-right disentanglement, shape matching, and part segmentation demonstrate their effectiveness and practical utility. Project page: https://wei-kang-wang.github.io/chirality/[126] MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips
Shibo Wang,Haonan He,Maria Parelli,Christoph Gebhardt,Zicong Fan,Jie Song
Main category: cs.CV
TL;DR: MagicHOI improves hand-object reconstruction from short monocular videos with limited viewpoint variation by integrating a novel view synthesis model and incorporating visible contact constraints, outperforming existing methods.
Details
Motivation: The authors aim to overcome the limitations of current RGB-based hand-object reconstruction methods which either rely on object templates or assume full object visibility, an assumption that often breaks in real-world settings. Method: MagicHOI integrates a novel view synthesis model into the hand-object reconstruction framework and aligns hand to object by incorporating visible contact constraints. Result: MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. The novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction. Conclusion: MagicHOI is an effective method for reconstructing hands and objects from short monocular interaction videos, especially under limited viewpoint variation. Abstract: Most RGB-based hand-object reconstruction methods rely on object templates, while template-free methods typically assume full object visibility. This assumption often breaks in real-world settings, where fixed camera viewpoints and static grips leave parts of the object unobserved, resulting in implausible reconstructions. To overcome this, we present MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited viewpoint variation. Our key insight is that, despite the scarcity of paired 3D hand-object data, large-scale novel view synthesis diffusion models offer rich object supervision. This supervision serves as a prior to regularize unseen object regions during hand interactions. Leveraging this insight, we integrate a novel view synthesis model into our hand-object reconstruction framework. We further align hand to object by incorporating visible contact constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. We also show that novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.[127] Revealing Latent Information: A Physics-inspired Self-supervised Pre-training Framework for Noisy and Sparse Events
Lin Zhu,Ruonan Liu,Xiao Wang,Lizhi Wang,Hua Huang
Main category: cs.CV
TL;DR: This paper introduces a self-supervised pre-training framework for event camera data, enhancing feature extraction and achieving superior performance in visual tasks.
Details
Motivation: Event camera data is inherently sparse and noisy, complicating effective feature extraction. This study aims to address these challenges by proposing a self-supervised pre-training framework to fully reveal latent information in event data. Method: The framework includes three stages: Difference-guided Masked Modeling, Backbone-fixed Feature Transition, and Focus-aimed Contrastive Learning, aiming to extract enhanced information from raw event data while improving semantic discrimination. Result: Extensive experiments show the framework's robustness and consistent outperformance of state-of-the-art methods in various visual tasks. Conclusion: The proposed self-supervised pre-training framework demonstrates robustness and superior performance on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. Abstract: Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self-supervised pre-training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference-guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone-fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus-aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high-value regions. Extensive experiments show our framework is robust and consistently outperforms state-of-the-art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT-Vision/EventPretrain.[128] Head Anchor Enhanced Detection and Association for Crowded Pedestrian Tracking
Zewei Wu,César Teixeira,Wei Ke,Zhang Xiong
Main category: cs.CV
TL;DR: 本文提出了一种改进的多行人跟踪方法,结合特征增强和运动模型优化,有效应对遮挡问题,提升了跟踪稳定性。
Details
Motivation: 视觉行人跟踪在现实应用中面临严重遮挡问题,传统方法依赖完整行人边界框特征和线性运动假设,在复杂场景中表现不佳。 Method: 方法结合了目标检测器的分类和回归分支特征,并引入头部关键点检测以减少遮挡影响;在运动建模方面,采用结合3D先验的迭代卡尔曼滤波方法以更贴合现代检测器假设。 Result: 该方法在处理遮挡场景时表现出更强的鲁棒性,能够更稳定地维持行人轨迹,适用于智能监控和人机交互等应用。 Conclusion: 该研究提出了一种增强的多目标跟踪框架,通过结合更丰富的特征表示和更鲁棒的运动模型,有效解决了行人遮挡问题,提升了拥挤环境下的跟踪性能。 Abstract: Visual pedestrian tracking represents a promising research field, with extensive applications in intelligent surveillance, behavior analysis, and human-computer interaction. However, real-world applications face significant occlusion challenges. When multiple pedestrians interact or overlap, the loss of target features severely compromises the tracker's ability to maintain stable trajectories. Traditional tracking methods, which typically rely on full-body bounding box features extracted from {Re-ID} models and linear constant-velocity motion assumptions, often struggle in severe occlusion scenarios. To address these limitations, this work proposes an enhanced tracking framework that leverages richer feature representations and a more robust motion model. Specifically, the proposed method incorporates detection features from both the regression and classification branches of an object detector, embedding spatial and positional information directly into the feature representations. To further mitigate occlusion challenges, a head keypoint detection model is introduced, as the head is less prone to occlusion compared to the full body. In terms of motion modeling, we propose an iterative Kalman filtering approach designed to align with modern detector assumptions, integrating 3D priors to better complete motion trajectories in complex scenes. By combining these advancements in appearance and motion modeling, the proposed method offers a more robust solution for multi-object tracking in crowded environments where occlusions are prevalent.[129] FS-IQA: Certified Feature Smoothing for Robust Image Quality Assessment
Ekaterina Shumitskaya,Dmitriy Vatolin,Anastasia Antsiferova
Main category: cs.CV
TL;DR: 本文提出了一种基于特征空间随机平滑的新颖图像质量评估模型认证防御方法,该方法在保持图像质量的同时提供了鲁棒性保证,并且在计算上高效。
Details
Motivation: 先前的方法直接在输入图像中注入高斯噪声,通常会降低视觉质量,因此需要一种能够在保持图像保真度的同时提供鲁棒性保证的防御方法。 Method: 分析了特征空间中的噪声水平与输入空间扰动之间的正式联系,并分析了主干网络雅可比矩阵的最大奇异值。 Result: 该方法在两个基准数据集上进行了验证,涉及六种广泛使用的全参考和无参考图像质量评估模型,并与五种最先进的认证防御方法进行了比较,结果表明其与主观质量评分的相关性提高了高达30.9%。 Conclusion: 这种方法支持全参考和无参考图像质量评估模型,无需任何架构修改,适用于各种场景,并且计算效率高。 Abstract: We propose a novel certified defense method for Image Quality Assessment (IQA) models based on randomized smoothing with noise applied in the feature space rather than the input space. Unlike prior approaches that inject Gaussian noise directly into input images, often degrading visual quality, our method preserves image fidelity while providing robustness guarantees. To formally connect noise levels in the feature space with corresponding input-space perturbations, we analyze the maximum singular value of the backbone network's Jacobian. Our approach supports both full-reference (FR) and no-reference (NR) IQA models without requiring any architectural modifications, suitable for various scenarios. It is also computationally efficient, requiring a single backbone forward pass per image. Compared to previous methods, it reduces inference time by 99.5% without certification and by 20.6% when certification is applied. We validate our method with extensive experiments on two benchmark datasets, involving six widely-used FR and NR IQA models and comparisons against five state-of-the-art certified defenses. Our results demonstrate consistent improvements in correlation with subjective quality scores by up to 30.9%.[130] Leveraging AI to Accelerate Clinical Data Cleaning: A Comparative Study of AI-Assisted vs. Traditional Methods
Matthew Purri,Amit Patel,Erik Deurrell
Main category: cs.CV
TL;DR: Octozi, an AI-assisted platform for clinical trial data cleaning, significantly improves efficiency and accuracy while reducing burdens, highlighting the potential of AI in pharmaceutical trials.
Details
Motivation: Clinical trial data cleaning is a critical bottleneck due to exponentially increasing data volumes and complexity, which manual review processes struggle to manage effectively. Method: Octozi was developed by integrating large language models with domain-specific heuristics. A controlled experimental study involving 10 experienced clinical reviewers evaluated the platform's performance in improving data cleaning efficiency and accuracy. Result: AI assistance increased data cleaning throughput by 6.03-fold, decreased cleaning errors from 54.67% to 8.48% (a 6.44-fold improvement), and reduced false positive queries by 15.48-fold. These improvements were consistent across reviewers regardless of experience level. Conclusion: Octozi, an AI-assisted platform combining large language models with domain-specific heuristics, demonstrates transformative potential in enhancing clinical trial data cleaning by significantly improving throughput, reducing errors, and minimizing unnecessary burdens, suggesting broad applicability and the promise of AI in pharmaceutical clinical trials. Abstract: Clinical trial data cleaning represents a critical bottleneck in drug development, with manual review processes struggling to manage exponentially increasing data volumes and complexity. This paper presents Octozi, an artificial intelligence-assisted platform that combines large language models with domain-specific heuristics to transform clinical data review. In a controlled experimental study with experienced clinical reviewers (n=10), we demonstrate that AI assistance increased data cleaning throughput by 6.03-fold while simultaneously decreasing cleaning errors from 54.67% to 8.48% (a 6.44-fold improvement). Crucially, the system reduced false positive queries by 15.48-fold, minimizing unnecessary site burden. These improvements were consistent across reviewers regardless of experience level, suggesting broad applicability. Our findings indicate that AI-assisted approaches can address fundamental inefficiencies in clinical trial operations, potentially accelerating drug development timelines and reducing costs while maintaining regulatory compliance. This work establishes a framework for integrating AI into safety-critical clinical workflows and demonstrates the transformative potential of human-AI collaboration in pharmaceutical clinical trials.[131] Optimal Brain Connection: Towards Efficient Structural Pruning
Shaowu Chen,Wei Ma,Binhua Huang,Qingyuan Wang,Guoxin Wang,Weize Sun,Lei Huang,Deepu John
Main category: cs.CV
TL;DR: This paper proposes Optimal Brain Connection, a structural pruning framework that improves neural network compression by considering parameter interconnections. It introduces the Jacobian Criterion for saliency evaluation and the Equivalent Pruning mechanism for retaining connection contributions during fine-tuning, both of which enhance model performance preservation.
Details
Motivation: The motivation stems from the observation that existing structural pruning methods often neglect the interconnections among parameters, which can lead to suboptimal compression and performance degradation. The authors aim to address this limitation by proposing a more holistic framework for structural pruning. Method: The paper introduces two key components: the Jacobian Criterion, which evaluates the saliency of structural parameters by capturing both intra-component interactions and inter-layer dependencies, and the Equivalent Pruning mechanism, which uses autoencoders to retain contributions of all connections—including pruned ones—during fine-tuning. Result: Experimental results show that the Jacobian Criterion outperforms several popular metrics in preserving model performance, and the Equivalent Pruning mechanism effectively mitigates performance degradation after fine-tuning. Conclusion: The paper concludes that the proposed structural pruning framework, Optimal Brain Connection, effectively compresses neural networks while maintaining model performance. The Jacobian Criterion and Equivalent Pruning mechanism are shown to be effective in preserving model accuracy and reducing performance degradation after fine-tuning. Abstract: Structural pruning has been widely studied for its effectiveness in compressing neural networks. However, existing methods often neglect the interconnections among parameters. To address this limitation, this paper proposes a structural pruning framework termed Optimal Brain Connection. First, we introduce the Jacobian Criterion, a first-order metric for evaluating the saliency of structural parameters. Unlike existing first-order methods that assess parameters in isolation, our criterion explicitly captures both intra-component interactions and inter-layer dependencies. Second, we propose the Equivalent Pruning mechanism, which utilizes autoencoders to retain the contributions of all original connection--including pruned ones--during fine-tuning. Experimental results demonstrate that the Jacobian Criterion outperforms several popular metrics in preserving model performance, while the Equivalent Pruning mechanism effectively mitigates performance degradation after fine-tuning. Code: https://github.com/ShaowuChen/Optimal_Brain_Connection[132] When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework
Haoyu Liu,Chaoyu Gong,Mengke He,Jiate Li,Kai Han,Siqiang Luo
Main category: cs.CV
TL;DR: 本文提出了一种名为SSTGNN的轻量级视频检测框架,通过将视频表示为结构化图,有效结合了空间、时序和频谱信息,以检测AI生成和篡改的视频。
Details
Motivation: 现有的视频检测方法通常依赖于孤立的空间、时序或频谱信息,难以在不同类型的篡改之间泛化,且通常需要大模型才能实现良好性能。 Method: SSTGNN采用基于图的架构,引入可学习的频谱滤波器和时序差分建模,对视频进行空间、时序和频谱联合分析,从而更有效地捕捉细微的篡改痕迹。 Result: 实验表明,SSTGNN在多种基准数据集上均表现出色,不仅在同域和跨域设置下性能优越,而且对未知篡改具有较强的鲁棒性。此外,其参数量最多比现有最先进模型少42.4倍。 Conclusion: SSTGNN是一种高效且轻量级的视频篡改检测方法,具备良好的泛化能力和实际部署潜力。 Abstract: The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and temporal differential modeling into a graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong robustness against unseen manipulations. Remarkably, SSTGNN accomplishes these results with up to 42.4$\times$ fewer parameters than state-of-the-art models, making it highly lightweight and scalable for real-world deployment.[133] AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety
Adi Levi,Or Levi,Sardhendu Mishra,Jonathan Morra
Main category: cs.CV
TL;DR: This paper explores the use of Multimodal Large Language Models (MLLMs) like Gemini, GPT, and Llama in brand safety classification for content moderation. The authors introduce a new multimodal, multilingual dataset labeled by professionals and show that MLLMs can effectively handle brand safety tasks with cost and time efficiency, although certain limitations and failure cases are identified.
Details
Motivation: The exponential growth of online video content has led to a surge in unsafe content that exceeds human moderation capabilities, creating operational and mental health challenges. While MLLMs have shown promise in video understanding tasks, their use in multimodal content moderation, which requires nuanced interpretation of visual and textual cues, remains underexplored. Method: The authors benchmarked MLLMs (Gemini, GPT, Llama) using a newly introduced, multimodal and multilingual dataset labeled by professional reviewers across multiple risk categories. They conducted a comparative analysis to assess accuracy and cost efficiency of these models against human reviewers. Result: MLLMs demonstrated effectiveness in brand safety classification, showing potential for addressing content moderation challenges. The models were evaluated for accuracy and cost efficiency in comparison to professional human reviewers. Conclusion: The study concludes that MLLMs like Gemini, GPT, and Llama are effective in multimodal brand safety classification, offering a viable solution to the growing demand for video content moderation, while also highlighting their limitations and failure cases. Abstract: As the volume of video content online grows exponentially, the demand for moderation of unsafe videos has surpassed human capabilities, posing both operational and mental health challenges. While recent studies demonstrated the merits of Multimodal Large Language Models (MLLMs) in various video understanding tasks, their application to multimodal content moderation, a domain that requires nuanced understanding of both visual and textual cues, remains relatively underexplored. In this work, we benchmark the capabilities of MLLMs in brand safety classification, a critical subset of content moderation for safe-guarding advertising integrity. To this end, we introduce a novel, multimodal and multilingual dataset, meticulously labeled by professional reviewers in a multitude of risk categories. Through a detailed comparative analysis, we demonstrate the effectiveness of MLLMs such as Gemini, GPT, and Llama in multimodal brand safety, and evaluate their accuracy and cost efficiency compared to professional human reviewers. Furthermore, we present an in-depth discussion shedding light on limitations of MLLMs and failure cases. We are releasing our dataset alongside this paper to facilitate future research on effective and responsible brand safety and content moderation.[134] Looking into the Unknown: Exploring Action Discovery for Segmentation of Known and Unknown Actions
Federico Spurio,Emad Bahrami,Olga Zatsarynna,Yazan Abu Farha,Gianpiero Francesca,Juergen Gall
Main category: cs.CV
TL;DR: This paper introduces Action Discovery, a two-step method for Temporal Action Segmentation that effectively identifies and segments unknown actions in partially labeled datasets, outperforming existing approaches.
Details
Motivation: The motivation stems from the challenges in Temporal Action Segmentation, particularly in defining and annotating ambiguous actions and handling incomplete annotations in partially labeled datasets. This is common in domains like neuroscience and other applications where labels may be missing or ambiguous. Method: The method involves two modules: the Granularity-Guided Segmentation Module (GGSM), which identifies temporal intervals for both known and unknown actions, and the Unknown Action Segment Assignment (UASA), which assigns semantically meaningful classes to unknown actions based on embedding similarities. Result: The proposed Action Discovery approach was evaluated on three challenging datasets (Breakfast, 50Salads, and Desktop Assembly) and demonstrated significant improvements over existing baselines in handling unknown actions. Conclusion: The proposed two-step approach, Action Discovery, effectively addresses the challenge of identifying and segmenting unknown actions in partially labeled datasets, significantly outperforming existing baselines. Abstract: We introduce Action Discovery, a novel setup within Temporal Action Segmentation that addresses the challenge of defining and annotating ambiguous actions and incomplete annotations in partially labeled datasets. In this setup, only a subset of actions - referred to as known actions - is annotated in the training data, while other unknown actions remain unlabeled. This scenario is particularly relevant in domains like neuroscience, where well-defined behaviors (e.g., walking, eating) coexist with subtle or infrequent actions that are often overlooked, as well as in applications where datasets are inherently partially annotated due to ambiguous or missing labels. To address this problem, we propose a two-step approach that leverages the known annotations to guide both the temporal and semantic granularity of unknown action segments. First, we introduce the Granularity-Guided Segmentation Module (GGSM), which identifies temporal intervals for both known and unknown actions by mimicking the granularity of annotated actions. Second, we propose the Unknown Action Segment Assignment (UASA), which identifies semantically meaningful classes within the unknown actions, based on learned embedding similarities. We systematically explore the proposed setting of Action Discovery on three challenging datasets - Breakfast, 50Salads, and Desktop Assembly - demonstrating that our method considerably improves upon existing baselines.[135] Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis
Kunyu Feng,Yue Ma,Xinhua Zhang,Boshi Liu,Yikuang Yuluo,Yinhan Zhang,Runtao Liu,Hongyu Liu,Zhiyuan Qin,Shanhui Mo,Qifeng Chen,Zeyu Wang
Main category: cs.CV
TL;DR: 本文提出 Follow-Your-Instruction 框架,基于多模态大语言模型自动合成高质量 2D、3D 和 4D 数据,提升生成智能模型性能。
Details
Motivation: 随着 AI 生成内容 (AIGC) 需求的增长,对高质量、多样化和可扩展的数据需求日益增加,但大规模真实世界数据的收集仍然昂贵且耗时。 Method: 使用基于多模态大语言模型(MLLM)的框架,通过 MLLM-Collector、MLLM-Generator、MLLM-Optimizer 和 MLLM-Planner 自动合成高质量的 2D、3D 和 4D 数据。 Result: 综合实验表明,生成的数据显著提升了现有基线模型在 2D、3D 和 4D 生成任务上的表现。 Conclusion: Follow-Your-Instruction 框架展示了作为生成智能的可扩展且有效的数据引擎的潜力,能够显著提升现有基线模型的性能。 Abstract: With the growing demands of AI-generated content (AIGC), the need for high-quality, diverse, and scalable data has become increasingly crucial. However, collecting large-scale real-world data remains costly and time-consuming, hindering the development of downstream applications. While some works attempt to collect task-specific data via a rendering process, most approaches still rely on manual scene construction, limiting their scalability and accuracy. To address these challenges, we propose Follow-Your-Instruction, a Multimodal Large Language Model (MLLM)-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data. Our \textbf{Follow-Your-Instruction} first collects assets and their associated descriptions through multimodal inputs using the MLLM-Collector. Then it constructs 3D layouts, and leverages Vision-Language Models (VLMs) for semantic refinement through multi-view scenes with the MLLM-Generator and MLLM-Optimizer, respectively. Finally, it uses MLLM-Planner to generate temporally coherent future frames. We evaluate the quality of the generated data through comprehensive experiments on the 2D, 3D, and 4D generative tasks. The results show that our synthetic data significantly boosts the performance of existing baseline models, demonstrating Follow-Your-Instruction's potential as a scalable and effective data engine for generative intelligence.[136] DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition
Haijing Liu,Tao Pu,Hefeng Wu,Keze Wang,Liang Lin
Main category: cs.CV
TL;DR: DART improves Open-Vocabulary Multi-Label Recognition by combining adaptive intra-class refinement and inter-class transfer using external relational knowledge from LLM, outperforming existing methods.
Details
Motivation: Current Vision-Language Pre-training (VLP) models struggle with fine-grained localization under weak supervision and do not effectively utilize structured relational knowledge, limiting performance on unseen classes in Open-Vocabulary Multi-Label Recognition (OV-MLR). Method: The DART framework uses two modules: Adaptive Refinement Module (ARM) with Weakly Supervised Patch Selecting (WPS) loss for intra-class refinement, and Adaptive Transfer Module (ATM) using a Class Relationship Graph (CRG) and graph attention network for inter-class transfer. Result: Extensive experiments show that DART achieves new state-of-the-art performance on challenging OV-MLR benchmarks. Conclusion: DART is the first framework to integrate external relational knowledge from LLM for adaptive inter-class transfer and perform intra-class refinement under weak supervision, achieving state-of-the-art performance on OV-MLR benchmarks. Abstract: Open-Vocabulary Multi-Label Recognition (OV-MLR) aims to identify multiple seen and unseen object categories within an image, requiring both precise intra-class localization to pinpoint objects and effective inter-class reasoning to model complex category dependencies. While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. DART enhances a frozen VLP backbone via two synergistic adaptive modules. For intra-class refinement, an Adaptive Refinement Module (ARM) refines patch features adaptively, coupled with a novel Weakly Supervised Patch Selecting (WPS) loss that enables discriminative localization using only image-level labels. Concurrently, for inter-class transfer, an Adaptive Transfer Module (ATM) leverages a Class Relationship Graph (CRG), constructed using structured knowledge mined from a Large Language Model (LLM), and employs graph attention network to adaptively transfer relational information between class representations. DART is the first framework, to our knowledge, to explicitly integrate external LLM-derived relational knowledge for adaptive inter-class transfer while simultaneously performing adaptive intra-class refinement under weak supervision for OV-MLR. Extensive experiments on challenging benchmarks demonstrate that our DART achieves new state-of-the-art performance, validating its effectiveness.[137] WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction
Shaobin Zhuang,Yiwei Guo,Canmiao Fu,Zhipeng Huang,Zeyue Tian,Ying Zhang,Chen Li,Yali Wang
Main category: cs.CV
TL;DR: WeTok is a novel visual tokenizer that improves compression and reconstruction performance in vision generation tasks, achieving state-of-the-art results on benchmarks.
Details
Motivation: To address the trade-off between compression ratios and reconstruction fidelity in existing visual tokenizers. Method: Two core innovations: (1) Group-wise lookup-free Quantization (GQ) for efficient memory and computation usage; (2) Generative Decoding (GD) to probabilistically model visual data distribution conditioned on discrete tokens. Result: On the ImageNet 50k validation set, WeTok achieved a record-low zero-shot rFID of 0.12 and a high compression ratio of 768 with a zero-shot rFID of 3.49, outperforming Cosmos (384) 4.57. Conclusion: WeTok tokenizer achieves superior performance in visual generation tasks, outperforming previous tokenizers in terms of compression ratio and reconstruction fidelity. Abstract: Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoding (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratios. Extensive experiments on mainstream benchmarks show superior performance of our WeTok. On the ImageNet 50k validation set, WeTok achieves a record-low zero-shot rFID (WeTok: 0.12 vs. FLUX-VAE: 0.18 vs. SD-VAE 3.5: 0.19). Furthermore, our highest compression model achieves a zero-shot rFID of 3.49 with a compression ratio of 768, outperforming Cosmos (384) 4.57 which has only 50% compression rate of ours. Code and models are available: https://github.com/zhuangshaobin/WeTok.[138] LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model
Tao Sun,Oliver Liu,JinJin Li,Lan Ma
Main category: cs.CV
TL;DR: This paper introduces LLaVA-RE, a framework for binary image-text relevancy evaluation using Multimodal Large Language Models, showing promising results across various tasks.
Details
Motivation: Binary relevancy evaluation is essential for assessing the quality of responses in multimodal generative AI, but it is challenging due to diverse text formats and varying relevancy definitions. Method: LLaVA-RE uses the LLaVA architecture with detailed task instructions and multimodal in-context samples for binary relevancy evaluation. Result: The experimental results show the effectiveness of LLaVA-RE for binary image-text relevancy evaluation across various tasks. Conclusion: LLaVA-RE is an effective framework for binary image-text relevancy evaluation using Multimodal Large Language Models. Abstract: Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., ``Relevant'' vs. ``Not Relevant'', is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. In addition, we propose a novel binary relevancy data set that covers various tasks. Experimental results validate the effectiveness of our framework.[139] Hi3DEval: Advancing 3D Generation Evaluation with Hierarchical Validity
Yuhan Zhang,Long Zhuo,Ziyang Chu,Tong Wu,Zhibing Li,Liang Pan,Dahua Lin,Ziwei Liu
Main category: cs.CV
TL;DR: The paper introduces Hi3DEval, a new hierarchical framework for evaluating 3D generative content that overcomes the limitations of current image-based metrics by incorporating both object-level and part-level assessments, as well as evaluating material realism beyond aesthetic appearance. The accompanying Hi3DBench dataset and automated scoring system enhance the evaluation process, resulting in superior performance in modeling 3D characteristics and alignment with human preferences.
Details
Motivation: The motivation is to address the challenges in quality assessment for 3D generative content, which current methods fail to capture effectively due to their reliance on image-based metrics and object-level operations. These limitations hinder the ability to evaluate spatial coherence, material authenticity, and high-fidelity local details. Method: The paper introduces Hi3DEval, a hierarchical framework combining object-level and part-level evaluations to assess 3D generative content. It evaluates material realism using attributes like albedo, saturation, and metallicness. The supporting Hi3DBench dataset includes diverse 3D assets and annotations, and a multi-agent annotation pipeline is used. A 3D-aware automated scoring system leverages video-based representations for object-level and material-subject evaluations and uses pretrained 3D features for part-level perception. Result: The result of the paper is that Hi3DEval, supported by Hi3DBench and the 3D-aware automated scoring system, outperforms existing image-based metrics in modeling 3D characteristics and aligns better with human preferences, offering a scalable alternative to manual evaluations. Conclusion: Hi3DEval is a hierarchical evaluation framework for 3D generative content that overcomes the limitations of existing image-based metrics by incorporating object-level and part-level evaluations and extending texture evaluation to include material realism. Hi3DBench, a large-scale dataset, and a 3D-aware automated scoring system support the framework. The approach demonstrates superior performance in modeling 3D characteristics and aligning with human preferences. Abstract: Despite rapid advances in 3D content generation, quality assessment for the generated 3D assets remains challenging. Existing methods mainly rely on image-based metrics and operate solely at the object level, limiting their ability to capture spatial coherence, material authenticity, and high-fidelity local details. 1) To address these challenges, we introduce Hi3DEval, a hierarchical evaluation framework tailored for 3D generative content. It combines both object-level and part-level evaluation, enabling holistic assessments across multiple dimensions as well as fine-grained quality analysis. Additionally, we extend texture evaluation beyond aesthetic appearance by explicitly assessing material realism, focusing on attributes such as albedo, saturation, and metallicness. 2) To support this framework, we construct Hi3DBench, a large-scale dataset comprising diverse 3D assets and high-quality annotations, accompanied by a reliable multi-agent annotation pipeline. We further propose a 3D-aware automated scoring system based on hybrid 3D representations. Specifically, we leverage video-based representations for object-level and material-subject evaluations to enhance modeling of spatio-temporal consistency and employ pretrained 3D features for part-level perception. Extensive experiments demonstrate that our approach outperforms existing image-based metrics in modeling 3D characteristics and achieves superior alignment with human preference, providing a scalable alternative to manual evaluations. The project page is available at https://zyh482.github.io/Hi3DEval/.[140] MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
Henghui Ding,Kaining Ying,Chang Liu,Shuting He,Xudong Jiang,Yu-Gang Jiang,Philip H. S. Torr,Song Bai
Main category: cs.CV
TL;DR: MOSEv2 是一个用于视频对象分割研究的高复杂度数据集,包含更多真实世界挑战,如遮挡、恶劣天气和低光环境,当前方法在其上的表现明显下降。
Details
Motivation: 现有数据集如 DAVIS 和 YouTube-VOS 主要包含显著、孤立对象,难以反映真实世界的复杂性,因此需要更具挑战性的数据集来推动 VOS 研究。 Method: 构建了一个包含 5,024 个视频和超过 701,976 个高质量掩码的新数据集,并在 20 种代表性 VOS 方法和 9 种视频对象跟踪方法上进行了基准测试。 Result: 在 MOSEv2 上测试的 VOS 方法性能显著下降,例如 SAM2 从 MOSEv1 的 76.4% 下降到 MOSEv2 的 50.9%,表明 MOSEv2 成功提升了任务难度。 Conclusion: MOSEv2 是一个面向现实复杂场景的视频对象分割数据集,相较以往数据集具有更高的挑战性,揭示了现有方法在真实世界复杂环境下的局限性。 Abstract: Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state-of-the-art methods have achieved impressive performance (e.g., 90+% J&F) on existing benchmarks such as DAVIS and YouTube-VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real-world scenarios. To advance VOS toward more realistic environments, coMplex video Object SEgmentation (MOSEv1) was introduced to facilitate VOS research in complex scenes. Building on the strengths and limitations of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real-world conditions. MOSEv2 consists of 5,024 videos and over 701,976 high-quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces significantly greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low-light scenes (e.g., nighttime, underwater), multi-shot sequences, camouflaged objects, non-physical targets (e.g., shadows, reflections), scenarios requiring external knowledge, etc. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and find similar declines, demonstrating that MOSEv2 presents challenges across tasks. These results highlight that despite high accuracy on existing datasets, current VOS methods still struggle under real-world complexities. MOSEv2 is publicly available at https://MOSE.video.[141] GAP: Gaussianize Any Point Clouds with Text Guidance
Weiqi Zhang,Junsheng Zhou,Haotian Geng,Wenyuan Zhang,Yu-Shen Liu
Main category: cs.CV
TL;DR: 本文提出了一种创新方法GAP,用于将无颜色的点云转化为高保真的3D高斯分布,通过多视角优化框架与表面锚定机制,解决了几何准确性与外观一致性之间的平衡问题。
Details
Motivation: 3D高斯分布(3DGS)在快速高质量渲染方面表现出色,但如何将无颜色的点云转化为高斯分布仍然是一个未解决的问题,本文旨在填补这一空白。 Method: 提出了一种名为GAP的方法,利用深度感知的图像扩散模型进行多视角优化,通过表面锚定机制约束高斯分布位于3D形状表面,并采用基于扩散的修复策略完成难以观测区域。 Result: GAP在从合成点云到复杂真实世界扫描及大规模场景的点到高斯生成任务中均表现出色,能够生成高保真的3D高斯分布并保持几何准确性与外观一致性。 Conclusion: GAP有效地将无颜色的点云转化为高保真的3D高斯分布,通过多视角优化框架和表面锚定机制,实现了几何准确性与外观一致性的平衡,并在多种复杂场景下展示了其优越性能。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes. Project Page: https://weiqi-zhang.github.io/GAP.[142] FaceAnonyMixer: Cancelable Faces via Identity Consistent Latent Space Mixing
Mohammed Talha Alam,Fahad Shamshad,Fakhri Karray,Karthik Nandakumar
Main category: cs.CV
TL;DR: 本文提出FaceAnonyMixer,一种基于生成模型潜在空间混合的可撤销面部生成方法,实现高识别精度与强隐私保护。